At my last firm, I came across what seemed like a really clever way of creating constants for the indices of certain pieces of a delimited string, or columns in a SQL table, but I've long wondered if there's a performance or other flaw I didn't catch.
For example, if there were five data fields in a single delimited string containing address information, you might see the following:
const int INDX_NAME = 0;
const int INDX_ADDR = INDX_NAME + 1;
const int INDX_CITY = INDX_ADDR + 1;
const int INDX_STATE = INDX_CITY + 1;
const int INDX_ZIP = INDX_STATE + 1;
That way, if in the future, the program needed to separate what could be two lines of an address between the name and city (say, street address on one line, suite number on the other), all the programmer would have to do is add the line
const int INDX_ADDR2 = INDX_ADDR + 1;
and change the line creating INDX_CITY to
const int INDX_CITY = INDX_ADDR2 + 1;
In short, this seemed like a really clever way of handling column/piece index lists that could change, especially if the list was 50+ items, and you need to add something at index number 3. However, I wasn't sure how much this could start affecting performance as the list starts getting large, or whether there was some other inherent flaw I wasn't seeing.
Any thoughts? Is there something that makes this lazy programming instead of clever?
Thanks.
-EDIT-
I can see where the value of enums would be given the responses I've been getting, but I'm concerned that they'd only be useful if every field I'm using is exactly 1 spot away from every other. What if I'm using pieces of some data storage that other programmers are also referencing, but I only care about a subset of the total fields, so I need to skip blocks of fields that I don't care about.
For example, I'm using 100 data fields from a delimited string that 300-fields long, but my application doesn't care about fields 1-4 (array counting). Is there a clean way to use enums to replace
const int INDX_RECORD_ID = 0;
const int INDX_DATE_UPDATED = INDX_RECORD_ID + 5;
const int INDX_USER_UPDATED = INDX_DATE_UPDATE + 1;
...
?
You should really use an enum here. They are designed for exactly this type of situation:
public enum MyEnum
{
NDX_NAME,
INDX_ADDR,
INDX_CITY,
INDX_STATE,
INDX_ZIP
}
Adding values (to any position) will update all of the indexes accordingly.
Using an enumeration will have several advantages: (not an exhaustive list)
You could have a parameter to a function or a variable typed to the enumeration, rather than an index. This helps make the intent clearer to other coders then knowing that, "This parameter needs to be given one of the constants in this other class."
You don't need to worry about making a mistake where you don't add one to the previous constant. I could easily see that happening when you add in a new item in the middle, or just as a copy/paste error.
You can use other enum tools, such as the ability to get the string value of an enum (as well as the integer value) for easier displaying, you can use reflection to iterate the possible values, you can parse a string or an integer into the appropriate enum value, etc.
As for performance, the values (for both an enum and your constants) will be evaluated at compile time, not runtime, so at runtime it won't have any additional cost over manual numbering.
No, it's not "lazy". I think it's simply good practice. And it incurs no runtime penalty because const fields must be compile-time constants.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've seen this a couple of times recently in high profile code, where constant values are defined as variables, named after the value, then used only once. I wondered why it gets done?
E.g. Linux Source (resize.c)
unsigned five = 5;
unsigned seven = 7;
E.g. C#.NET Source (Quaternion.cs)
double zero = 0;
double one = 1;
Naming numbers is terrible practice, one day something will need to change, and you'll end up with unsigned five = 7.
If it has some meaning, give it a meaningful name. The 'magic number' five is no improvement over the magic number 5, it's worse because it might not actually equal 5.
This kind of thing generally arises from some cargo-cult style programming style guidelines where someone heard that "magic numbers are bad" and forbade their use without fully understanding why.
Well named variables
Giving proper names to variables can dramatically clarify code, such as
constant int MAXIMUM_PRESSURE_VALUE=2;
This gives two key advantages:
The value MAXIMUM_PRESSURE_VALUE may be used in many different places, if for whatever reason that value changes you need to change it in only one place.
Where used it immediately shows what the function is doing, for example the following code obviously checks if the pressure is dangerously high:
if (pressure>MAXIMUM_PRESSURE_VALUE){
//without me telling you you can guess there'll be some safety protection in here
}
Poorly named variables
However, everything has a counter argument and what you have shown looks very like a good idea taken so far that it makes no sense. Defining TWO as 2 doesn't add any value
constant int TWO=2;
The value TWO may be used in many different places, perhaps to double things, perhaps to access an index. If in the future you need to change the index you cannot just change to int TWO=3; because that would affect all the other (completely unrelated) ways you've used TWO, now you'd be tripling instead of doubling etc
Where used it gives you no more information than if you just used "2". Compare the following two pieces of code:
if (pressure>2){
//2 might be good, I have no idea what happens here
}
or
if (pressure>TWO){
//TWO means 2, 2 might be good, I still have no idea what happens here
}
Worse still (as seems to be the case here) TWO may not equal 2, if so this is a form of obfuscation where the intention is to make the code less clear: obviously it achieves that.
The usual reason for this is a coding standard which forbids magic numbers but doesn't count TWO as a magic number; which of course it is! 99% of the time you want to use a meaningful variable name but in that 1% of the time using TWO instead of 2 gains you nothing (Sorry, I mean ZERO).
this code is inspired by Java but is intended to be language agnostic
Short version:
A constant five that just holds the number five is pretty useless. Don't go around making these for no reason (sometimes you have to because of syntax or typing rules, though).
The named variables in Quaternion.cs aren't strictly necessary, but you can make the case for the code being significantly more readable with them than without.
The named variables in ext4/resize.c aren't constants at all. They're tersely-named counters. Their names obscure their function a bit, but this code actually does correctly follow the project's specialized coding standards.
What's going on with Quaternion.cs?
This one's pretty easy.
Right after this:
double zero = 0;
double one = 1;
The code does this:
return zero.GetHashCode() ^ one.GetHashCode();
Without the local variables, what does the alternative look like?
return 0.0.GetHashCode() ^ 1.0.GetHashCode(); // doubles, not ints!
What a mess! Readability is definitely on the side of creating the locals here. Moreover, I think explicitly naming the variables indicates "We've thought about this carefully" much more clearly than just writing a single confusing return statement would.
What's going on with resize.c?
In the case of ext4/resize.c, these numbers aren't actually constants at all. If you follow the code, you'll see that they're counters and their values actually change over multiple iterations of a while loop.
Note how they're initialized:
unsigned three = 1;
unsigned five = 5;
unsigned seven = 7;
Three equals one, huh? What's that about?
See, what actually happens is that update_backups passes these variables by reference to the function ext4_list_backups:
/*
* Iterate through the groups which hold BACKUP superblock/GDT copies in an
* ext4 filesystem. The counters should be initialized to 1, 5, and 7 before
* calling this for the first time. In a sparse filesystem it will be the
* sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ...
* For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ...
*/
static unsigned ext4_list_backups(struct super_block *sb, unsigned *three,
unsigned *five, unsigned *seven)
They're counters that are preserved over the course of multiple calls. If you look at the function body, you'll see that it's juggling the counters to find the next power of 3, 5, or 7, creating the sequence you see in the comment: 1, 3, 5, 7, 9, 25, 27, &c.
Now, for the weirdest part: the variable three is initialized to 1 because 30 = 1. The power 0 is a special case, though, because it's the only time 3x = 5x = 7x. Try your hand at rewriting ext4_list_backups to work with all three counters initialized to 1 (30, 50, 70) and you'll see how much more cumbersome the code becomes. Sometimes it's easier to just tell the caller to do something funky (initialize the list to 1, 5, 7) in the comments.
So, is five = 5 good coding style?
Is "five" a good name for the thing that the variable five represents in resize.c? In my opinion, it's not a style you should emulate in just any random project you take on. The simple name five doesn't communicate much about the purpose of the variable. If you're working on a web application or rapidly prototyping a video chat client or something and decide to name a variable five, you're probably going to create headaches and annoyance for anyone else who needs to maintain and modify your code.
However, this is one example where generalities about programming don't paint the full picture. Take a look at the kernel's coding style document, particularly the chapter on naming.
GLOBAL variables (to be used only if you really need them) need to
have descriptive names, as do global functions. If you have a function
that counts the number of active users, you should call that
"count_active_users()" or similar, you should not call it "cntusr()".
...
LOCAL variable names should be short, and to the point. If you have
some random integer loop counter, it should probably be called "i".
Calling it "loop_counter" is non-productive, if there is no chance of it
being mis-understood. Similarly, "tmp" can be just about any type of
variable that is used to hold a temporary value.
If you are afraid to mix up your local variable names, you have another
problem, which is called the function-growth-hormone-imbalance syndrome.
See chapter 6 (Functions).
Part of this is C-style coding tradition. Part of it is purposeful social engineering. A lot of kernel code is sensitive stuff, and it's been revised and tested many times. Since Linux is a big open-source project, it's not really hurting for contributions — in most ways, the bigger challenge is checking those contributions for quality.
Calling that variable five instead of something like nextPowerOfFive is a way to discourage contributors from meddling in code they don't understand. It's an attempt to force you to really read the code you're modifying in detail, line by line, before you try to make any changes.
Did the kernel maintainers make the right decision? I can't say. But it's clearly a purposeful move.
My organisation have certain programming guidelines, one of which is the use of magic numbers...
eg:
if (input == 3) //3 what? Elephants?....3 really is the magic number here...
This would be changed to:
#define INPUT_1_VOLTAGE_THRESHOLD 3u
if (input == INPUT_1_VOLTAGE_THRESHOLD) //Not elephants :(
We also have a source file with -200,000 -> 200,000 #defined in the format:
#define MINUS_TWO_ZERO_ZERO_ZERO_ZERO_ZERO -200000
which can be used in place of magic numbers, for example when referencing a specific index of an array.
I imagine this has been done for "Readability".
The numbers 0, 1, ... are integers. Here, the 'named variables' give the integer a different type. It might be more reasonable to specify these constant (const unsigned five = 5;)
I've used something akin to that a couple times to write values to files:
const int32_t zero = 0 ;
fwrite( &zero, sizeof(zero), 1, myfile );
fwrite accepts a const pointer, but if some function needs a non const pointer, you'll end up using a non const variable.
P.S.: That always keeps me wondering what may be the sizeof zero .
How do you come to a conslusion that it is used only once? It is public, it could be used any number of times from any assembly.
public static readonly Quaternion Zero = new Quaternion();
public static readonly Quaternion One = new Quaternion(1.0f, 1.0f, 1.0f, 1.0f);
Same thing applies to .Net framework decimal class. which also exposes public constants like this.
public const decimal One = 1m;
public const decimal Zero = 0m;
Numbers are often given a name when these numbers have special meaning.
For example in the Quaternion case the identity quaternion and unit length quaternion have special meaning and are frequently used in a special context. Namely Quaternion with (0,0,0,1) is an identity quaternion so it's a common practice to define them instead of using magic numbers.
For example
// define as static
static Quaternion Identity = new Quaternion(0,0,0,1);
Quaternion Q1 = Quaternion.Identity;
//or
if ( Q1.Length == Unit ) // not considering floating point error
One of my first programming jobs was on a PDP 11 using Basic. The Basic interpreter allocated memory to every number required, so every time the program mentioned 0, a byte or two would be used to store the number 0. Of course back in those days memory was a lot more limited than today and so it was important to conserve.
Every program in that work place started with:
10 U0%=0
20 U1%=1
That is, for those who have forgotten their Basic:
Line number 10: create an integer variable called U0 and assign it the number 0
Line number 20: create an integer variable called U1 and assign it the number 1
These variables, by local convention, never held any other value, so they were effectively constants. They allowed 0 and 1 to be used throughout the program without wasting any memory.
Aaaaah, the good old days!
some times it's more readable to write:
double pi=3.14; //Constant or even not constant
...
CircleArea=pi*r*r;
instead of:
CircleArea=3.14*r*r;
and may be you would use pi more again (you are not sure but you think it's possible later or in other classes if they are public)
and then if you want to change pi=3.14 into pi=3.141596 it's easier.
and some other like e=2.71, Avogadro and etc.
This method gets:
IEnumerable<object[]> - in which every array is in fixed size (it represent relational
data structure).
DataEnumerable.Column[] - some metadata columns,mostly they will have the same value for all rows.
Expected outcome:
each "row" should get value for each of these columns (so the data structure remains relational).
private IEnumerable<object[]> BindExtraColumns(IEnumerable<object[]> baseData, int dataSize, DataEnumerable.Column[] columnsToAdd)
{
int extraColumnsLength = columnsToAdd.Length;
object[] row = new object[dataSize + extraColumnsLength];
string columnName;
int rowNumberColumnIndex = -1;
for (int i = 0; i < extraColumnsLength; i++)
{
//Assign values that doesn't change between lines..
// Assign rowNumberColumnIndex if row number column exists
}
//Assign values that change here, since we currently support only row number
// i'ts not generic enough
if (rowNumberColumnIndex != -1)
{
int rowNumber = 1;
foreach (var baseRow in baseData)
{
row[rowNumberColumnIndex] = rowNumber;
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
rowNumber++;
}
}
else
{
foreach (var baseRow in baseData)
{
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
}
}
}
this method can be called from hundreds of threads with relatively big data sets so
performance here is critical, and i tried to create as minimum new objects as possible.
Please note - this is a private method, which used ONLY BY DataReader, which read each line, and passes it to another array immediately prior to reading the next line.
So - does copying arrays here be optimized here somehow, and should i use (carefully) memory to boost things here?
Thanks
Your code is fundamentally broken. You're just returning a reference to the same array every time, which means that unless the caller uses the data within each item immediately, it effectively gets lost. For example, suppose I use:
List<object[]> rows = BindExtraColumns(data, size, toAdd).ToList();
Then when I iterate over the rows, I find the same data in every row. That's really not a good experience.
I think it would make much more sense to create a new array for each iteration. Yes, that's a lot of extra memory being used - but it doesn't surprise callers nearly as much.
If you really don't want to do that, I suggest you change the approach so that the caller has to pass in an Action<object[]> to be executed on each row, with the documented proviso that if the caller stashes a reference to the array, they may well be surprised by the results.
You're obviously very concerned about performance, but if your data is coming from a database I'd expect the array creation/copying performance to be insignificant. You should write the simplest (and most reliable) code that works first, and then benchmark it to see whether it performs well enough. Unless you've got evidence that you need to make this surprising design choice, it feels like you're optimizing way too early.
EDIT: Now we know that it's a private method only used in one specific place, I would still avoid this reuse. It's simply fragile. I really would change to passing in an Action<object[]> or simply copying the data to a new array every time. I certainly wouldn't keep the current approach without strong evidence that it's a bottleneck: as I said before, I'd expect the database communication to be much more important. Leaving timebombs in your code like this very rarely works out well.
If you really, really want to keep doing this, you should document it very strongly, giving severe warnings that the result is non-idiomatic.
In terms of whether there's more optimization you could do - well... one alternative would be to avoid having to work with a single array in the first place. You could create a class which held references to both arrays (the current base row and the fixed data) and exposed an indexer which returned the value from one array or the other based on which index was being requested. We don't know what you're doing with the data ("passes it to another array" doesn't really mean anything) so we don't know whether that's feasible, but it would be efficient and could be implemented without the odd behaviour.
In Java strings are immutable. If we have a string and make changes to it, we get new string referenced by the same variable:
String str = "abc";
str += "def"; // now str refers to another piece in the heap containing "abcdef"
// while "abc" is still somewhere in the heap until taken by GC
It's been said that int and double are immutable in C#. Does it mean that when we have int and later change it, we would get new int "pointed" by the same variable? Same thing but with stack.
int i = 1;
i += 1; // same thing: in the stack there is value 2 to which variable
// i is attached, and somewhere in the stack there is value 1
Is that correct? If not, in what way is int immutable?
To follow up on Marc's (perfectly acceptable) answer: Integer values are immutable but integer variables may vary. That's why they're called "variables".
Integer values are immutable: if you have the value that is the number 12, there's no way to make it odd, no way to paint it blue, and so on. If you try to make it odd by, say, adding one, then you end up with a different value, 13. Maybe you store that value in the variable that used to contain 12, but that doesn't change any property of 12. 12 stays exactly the same as it was before.
You haven't changed (and cannot change) something about the int; you have assigned a new int value (and discarded the old value). Thus it is immutable.
Consider a more complex struct:
var x = new FooStruct(123);
x.Value = 456; // mutate
x.SomeMethodThatChangedInternalState(); // mutate
x = new FooStruct(456); // **not** a mutate; this is a *reassignment*
However, there is no "pointing" here. The struct is directly on the stack (in this case): no references involved.
Although this may sound obvious I'll add a couple of lines that helped me to understand in case someone has the same confusion.
There are various kinds of mutability but generally when people say "immutable" they mean that class has members that cannot be changed.
String is probably storing characters in an array, which is not accessible and cannot be changed via methods, that's why string is immutable. Operation + on strings always returns a new string.
int is probably storing it's value in member "Value" which anyway is inaccessible, and that's why cannot be changed. All operations on int return new int, and this value is copied to variable, because it's a value type.
I have a dataset. This dataset will serve a lookup table. Given a number, I should be able to lookup a corresponding value for that number.
The dataset (let's say its CSV) has a few caveats though. Instead of:
1,ABC
2,XYZ
3,LMN
The numbers are ranges (- being "through", not minus):
1-3,ABC // 1, 2, and 3 = ABC
4-8,XYZ // 4, 5, 6, 7, 8 = XYZ
11-11,LMN // 11 = LMN
All the numbers are signed ints. No ranges overlap with another ranges. There are some gaps; there are ranges that aren't defined in the dataset (like 9 and 10 in the last snippet above).
`
How might I model this dataset in C# so that I have the most-performant lookup while keeping my in-memory footprint low?
The only option I've come up with suffers from overconsumption of memory. Let's say my dataset is:
1-2,ABC
4-6,XYZ
Then I create a Dictionary<int,string>() whose key/values are:
1/ABC
2/ABC
4/XYZ
5/XYZ
6/XYZ
Now I have hash performance-lookup, but tons of wasted space in the hash table.
Any ideas? Maybe just use PLINQ instead and hope for good performance? ;)
If your dictionary is going to truly store a wide range of key values, an approach that expands all possible ranges into explicit keys will rapidly consume more memory than you likely have available.
You're best option is to use a data structure that supports some variation of binary search (or other O(log N) lookup technique). Here's a link to a generic RangeDictionary for .NET that uses an OrderedList internally, and has O(log N) performance.
Achieving constant-time O(1) lookup requires that you expand all ranges into explicit keys. This requires both a lot of memory, and can actually degrade performance when you need to split or insert a new range. This probably isn't what you want.
You can create a doubly-indirected lookup:
Dictionary<int, int> keys;
Dictionary<int, string> values;
Then store the data like this:
keys.Add(1, 1);
keys.Add(2, 1);
keys.Add(3, 1);
//...
keys.Add(11, 3);
values.Add(1, "ABC");
//...
values.Add(3, "LMN");
And then look the data up:
return values[keys[3]]; //returns "ABC"
I'm not sure how much memory footprint this will save with trivial strings, but once you get beyond "ABC" it should help.
EDIT
After Dan Tao's comment below, I went back and checked on what he was asking about. The following code:
var abc = "ABC";
var def = "ABC";
Console.WriteLine(ReferenceEquals(abc, def));
will write "True" to the console. Which means that the either the compiler or the runtime (clarification?) is maintaining the reference to "ABC", and assigns it as the value of both variables.
After reading up some more on Interned strings, if you're using string literals to populate the dictionary, or Interning computed strings, it will in fact take more space to implement my suggestion than the original dictionary would have taken. If you're not using Interned strings, then my solution should take less space.
FINAL EDIT
If you're treating your strings correctly, there should be no excess memory usage from the original Dictionary<int, string> because you can assign them to a variable and then assign that reference as the value (or, if you need to, because you can Intern them)
Just make sure your assignment code includes an intermediate variable assignment:
while (thereAreStringsLeftToAssign)
{
var theString = theStringToAssign;
foreach (var i in range)
{
strings.Add(i, theString);
}
}
As arootbeer has mentioned in his answer, the following code does not create multiple instances of the string "ABC"; rather, it interns a single instance and assigns a reference to that instance to each KeyValuePair<int, string> in dictionary:
var dictionary = new Dictionary<int, string>();
dictionary[0] = "ABC";
dictionary[1] = "ABC";
dictionary[2] = "ABC";
// etc.
OK, so in the case of string literals, you're only using one string instance per range of keys. Is there a scenario where this wouldn't be the case--that is, where you would be using a separate string instance for each key within the range (this is what I assume you're concerned about when you speak of "overconsumption of memory")?
Honestly, I don't think so. There are scenarios where multiple equivalent string instances may be created without the benefit of interning, yes. But I can't imagine these scenarios would affect what you're trying to do here.
My reasoning is this: you want to assign certain values to different ranges of keys, right? So any time you are defining a key-range-value pairing of this sort, you have a single value and several keys. The single part is what leads me to doubt that you'll ever have multiple instances of the same string, unless it is defined as the value for more than one range.
To illustrate: yes, the following code will instantiate two identical strings:
string x = "ABC";
Console.Write("Type 'ABC' and press Enter: ");
string y = Console.ReadLine();
Console.WriteLine(Equals(x, y));
Console.WriteLine(ReferenceEquals(x, y));
The above program, assuming the user follows instructions and types "ABC," outputs True, then False. So you might think, "Ah, so when a string is only provided at run-time, it isn't interned! So this could be where my values could be duplicated!"
But... again: I don't think so. It all comes back to the fact that you are going to be assigning a single value to a range of keys. So let's say your values come from user input; then your code would look something like this:
var dictionary = new Dictionary<int, string>();
int start, count;
GetRange(out start, out count);
string value = GetValue();
foreach (int key in Enumerable.Range(start, count))
{
// Look, you're using the same string instance to assign
// to each key... how could it be otherwise?
dictionary[key] = value;
}
Now, if you were actually thinking more along the lines of what LBushkin mentions in his answer--that you may potentially have huge ranges, making it impractical to define a KeyValuePair<int, string> for each key within that range (e.g., if you have a range of 1-1000000)--then I would agree that you're best off with some sort of data structure that bases its lookup on a binary search. If that's more your scenario, say so and I will be happy to offer more ideas on that front. (Or you could just take a look at the link LBushkin already posted.)
Use a balanced ordered tree (or something similar) mapping start-of-range to end-of-range and data. This will be easy to implement for non-overlapping ranges.
arootbeer has a good solution, but one you may find confusing to work with.
Another choice is to use a reference type instead of a string, so that you point to the same reference
class StringContainer {
public string Value { get; set; }
}
Dictionary<int, StringContainer> values;
var value1 = new StringContainer { Value = "ABC" };
values.Add(1, value1);
values.Add(2, value1);
They will both point to the same instance of StringContainer
EDIT: Thanks for the comments everyone. This method handles value types other than string, so it might be useful for more than the given example. Also, it is my understanding that strings don't always behave in the manner you would expect from reference values, but I could be wrong.
What's the fastest way to parse strings in C#?
Currently I'm just using string indexing (string[index]) and the code runs reasonably, but I can't help but think that the continuous range checking that the index accessor does must be adding something.
So, I'm wondering what techniques I should consider to give it a boost. These are my initial thoughts/questions:
Use methods like string.IndexOf() and IndexOfAny() to find characters of interest. Are these faster than manually scanning a string by string[index]?
Use regex's. Personally, I don't like regex as I find them difficult to maintain, but are these likely to be faster than manually scanning the string?
Use unsafe code and pointers. This would eliminate the index range checking but I've read that unsafe code wont run in untrusted environments. What exactly are the implications of this? Does this mean the whole assembly won't load/run, or will only the code marked unsafe refuse to run? The library could potentially be used in a number of environments, so to be able to fall back to a slower but more compatible mode would be nice.
What else might I consider?
NB: I should say, the strings I'm parsing could be reasonably large (say 30k) and in a custom format for which there is no standard .NET parser. Also, performance of this code is not super critical, so this partly just a theoretical question of curiosity.
30k is not what I would consider to be large. Before getting excited, I would profile. The indexer should be fine for the best balance of flexibility and safety.
For example, to create a 128k string (and a separate array of the same size), fill it with junk (including the time to handle Random) and sum all the character code-points via the indexer takes... 3ms:
var watch = Stopwatch.StartNew();
char[] chars = new char[128 * 1024];
Random rand = new Random(); // fill with junk
for (int i = 0; i < chars.Length; i++) chars[i] =
(char) ((int) 'a' + rand.Next(26));
int sum = 0;
string s = new string(chars);
int len = s.Length;
for(int i = 0 ; i < len ; i++)
{
sum += (int) chars[i];
}
watch.Stop();
Console.WriteLine(sum);
Console.WriteLine(watch.ElapsedMilliseconds + "ms");
Console.ReadLine();
For files that are actually large, a reader approach should be used - StreamReader etc.
"Parsing" is quite an inexact term. Since you talks of 30k, it seems that you might be dealing with some sort of structured string which can be covered by creating a parser using a parser generator tool.
A nice tool to create, maintain and understand the whole process is the GOLD Parsing System by Devin Cook: http://www.devincook.com/goldparser/
This can help you create code which is efficient and correct for many textual parsing needs.
As for your points:
is usually not useful for parsing which goes further than splitting a string.
is better suited if there are no recursions or too complex rules.
is basically a no-go if you haven't really identified this as a serious problem. The JIT can take care of doing the range checks only when needed, and indeed for simple loops (the typical for loop) this is handled pretty well.