I'm trying to parse a large text string. I need to split the original string in blocks of 15 characters(and the next block might contain white spaces, so the trim function is used). I'm using two strings, the original and a temporary one. This temp string is used to store each 15 length block.
I wonder if I could fall into a performance issue because strings are immutable. This is the code:
string original = "THIS IS SUPPOSE TO BE A LONG STRING AN I NEED TO SPLIT IT IN BLOCKS OF 15 CHARACTERS.SO";
string temp = string.Empty;
while (original.Length != 0)
{
temp = original.Substring(0, 14).Trim();
original = original.Substring(14, (original.Length -14)).Trim();
}
I appreciate your feedback in order to find a best way to achieve this functionality.
You'll get slightly better performance like this (but whether the performance gain will be significant is another matter entirely):
for (var startIndex = 0; startIndex < original.Length; startIndex += 15)
{
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
}
This performs better because you're not copying the last all-but-15-characters of the original string with each loop iteration.
EDIT
To advance the index to the next non-whitespace character, you can do something like this:
for (var startIndex = 0; startIndex < original.Length; )
{
if (char.IsWhiteSpace(string, startIndex)
{
startIndex++;
continue;
}
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
startIndex += 15;
}
I think you are right about the immutable issue - recreating 'original' each time is probably not the fastest way.
How about passing 'original' into a StringReader class?
If your original string is longer than few thousand chars, you'll have noticable (>0.1s) processing time and a lot of GC pressure. First Substring call is fine and I don't think you can avoid it unless you go deep inside System.String and mess around with m_FirstChar. Second Substring can be avoided completely when going char-by-char and iterating over int.
In general, if you would run this on bigger data such code might be problematic, it of course depends on your needs.
In general, it might be a good idea to use StringBuilder class, which will allow you to operator on strings in "more mutable" way without performance hit, like remove from it's beggining without reallocating whole string.
In your example however I would consider throwing out lime that takes substring from original and substitute it with some code that would update some indexes pointing where you should get new substring from. Then while condition would be just checking if your index as at the end of the string and your temp method would take substring not from 0 to 14 but from i, where i would be this index.
However - don't optimize code if you don't have to, I'm assuming here that you need more performance and you want to sacrifice some time and/or write a bit less understandable code for more efficiency.
Related
in C#, I have a string like this:
"1 3.14 (23, 23.2, 43,88) 8.27"
I need to convert this string to other types according to the value like int/float/vector3, now i have some code like this:
public static int ReadInt(this string s, ref string op)
{
s = s.Trim();
string ss = "";
int idx = s.IndexOf(" ");
if (idx > 0)
{
ss = s.Substring(0, idx);
op = s.Substring(idx);
}
else
{
ss = s;
op = "";
}
return Convert.ToInt32(ss);
}
this will read the first int value out, i have some similar functions to read float vector3 etc. but the problem is : in my application, i have to do this a lot because i received the string from some plugin and i need to do it every single frame, so i created a lot of strings which caused a lot GC will impact the performance, is their a way i can do similar stuff without creating temp strings?
Generation 0 objects such as those created here may well not impact performance too much, as they are relatively cheap to collect. I would change from using Convert to calling int.Parse() with the invariant culture before I started worrying about the GC overhead of the extra strings.
Also, you don't really need to create a new string to accomplish the Trim() behavior. After all, you're scanning and indexing the string anyway. Just do your initial scan for whitespace, and then for the space delimiter between ss and op, so you get just the substrings you need. Right now you're creating 50% more string instances than you really need.
All that said, no...there's not anything built into the basic .NET framework that would parse a substring without actually creating a new string instance. You would have to write your own parsing routines to accomplish that.
You should measure the actual real-world performance impact first, to make sure these substrings really are a significant issue.
I don't know what the "some plugin" is or how you have to handle the input from it, but I would not be surprised to hear that the overhead in acquiring the original input string(s) for this scenario swamps the overhead of the substrings for parsing.
I have a Base64 encoded string like this :
SWwgw6l0YWl0IHVuIHBldGl0IG5hdmlyZS [...] 0IG5hdmlyZSA=
The input String can big large (> 1MB). And for interoperability reasons, I need to add a carriage return into that large string every 64 characters.
The first guess I had was to use a stringbuilder and use the method "AppendLine" every 64 characters like this :
string InputB64_Without_CRLF = "SWwgw6l0YWl0IHVuIHBldGl0IG5hdmlyZS [...] 0IG5hdmlyZSA=";
int BufferSize = 64;
int Index = 0;
StringBuilder sb = new StringBuilder();
while (Index < strInput.Length) {
sb.AppendLine(InputB64_Without_CRLF.Substring(Index, BufferSize));
Index += BufferSize;
}
string Output_With_CRLF = sb.ToString();
But I'm worried about the performance of that portion of code. Is there a better means to insert a character into a string at a certain position without rebuilding another string ?
Is there a better means to insert a character into a string at a certain position without rebuilding another string?
.NET strings are immutable, which means that they cannot be modified once they have been created.
Therefore, if you want to insert characters into a string, there is no other way but to create a new one. And StringBuilder is quite probably the most efficient way to go about this, because it allows you to perform as many string-building steps as needed, and only create one single new string in the end.
Unless you've actually noticed performance problems in a real-world scenario, keep your current solution. It looks fine to me, at least from a performance point of view.
Some further fine points to consider:
If you're still not happy with your solution, I can think of only a few minor things that might make your current solution more efficient:
Declare the StringBuilders required capacity up-front, so that its backing character buffer won't have to be resized:
var additionalCharactersCount = Environment.NewLine.Length * (input.Length / 64);
var sb = new StringBuilder(capacity: input.Length + additionalCharactersCount);
Insert the complete input string into the StringBuilder first, then repeatedly .Insert(…, Environment.NewLine) every 64 characters.
I am not at all certain whether this would actually improve execution speed, but it would get rid of the repeated string creation caused by .Substring. Measure for yourself whether it's faster than your solution or not.
Your code is not inefficient, trying to save 100ms or less is usually not worth the effort. But if you are concerned, here is another slightly more efficient way to insert a new line(which is sometimes\r\n, not just\n) every 64 characters
string Output_With_CRLF = InputB64_Without_CRLF;
//Start at last index so that our new line inserts do not move the text, making sure to input every 64th of the original string
//This looks stupid to divide and multiply again, but it works because it is integer division
StringBuilder sb = new StringBuilder(InputB64_Without_CRLF);
for (int i = (InputB64_Without_CRLF.Length / 64) * 64; i >= 64; i -= 64)
sb.Insert(i, Environment.NewLine);
This will only be a tiny bit more efficient than your original code, you likely won't notice much difference.
After talking with stakx i had this idea. By using the StringBuilder you do not create many strings over and over. The StringBuilder is very efficient and will handle its insert without creating more objects.
I am writing a custom string split. It will split on a dot(.) that is not preceded by an odd number of backslashes (\).
«string» -> «IEnemerable<string>»
"hello.world" -> "hello", "world"
"abc\.123" -> "abc\.123"
"aoeui\\.dhtns" -> "aoeui\\","dhtns"
I would like to know if there is a substring that will reuse the original string (for speed), or is there an existing split that can do this fast?
This is what I have but is 2—3 times slower than input.Split('.') //where input is a string. (I know it is a (slightly more complex problem, but not that much)
public IEnumerable<string> HandMadeSplit(string input)
{
var Result = new LinkedList<string>();
var word = new StringBuilder();
foreach (var ch in input)
{
if (ch == '.')
{
Result.AddLast(word.ToString());
word.Length = 0;
}
else
{
word.Append(ch);
}
}
Result.AddLast(word.ToString());
return Result;
}
It now uses List instead of LinkedList, and record beginning and end of substring and use string.substring to create the new substrings. This does a lot and is nearly as fast as string.split but I have added my adjustments. (will add code)
The loop that you show is the right approach if you need performance. (Regex wouldn't be).
Switch to an index-based for-loop. Remember the index of the start of the match. Don't append individual chars. Instead, remember the range of characters to copy out and do that with a single Substring call per item.
Also, don't use a LinkedList. It is slower than a List for almost all cases except random-access mutations.
You might also switch from List to a normal array that you resize with Array.Resize. This results in slightly tedious code (because you have inlined a part of the List class into your method) but it cuts out some small overheads.
Next, don't return an IEnumerable because that forces the caller through indirection when accessing its items. Return a List or an array.
This is the one I eventually settled on. It is not as fast a string.split, but good enough and can be modified, to do what I want.
private IEnumerable<string> HandMadeSplit2b(string input)
{
//this one is margenaly better that the second best 2, but makes the resolver (its client much faster), nealy as fast as original.
var Result = new List<string>();
var begining = 0;
var len = input.Length;
for (var index=0;index<len;index++)
{
if (input[index] == '.')
{
Result.Add(input.Substring(begining,index-begining));
begining = index+1;
}
}
Result.Add(input.Substring(begining));
return Result;
}
You shouldn't try to use string.Split for that.
If you need help to implement it, a simple way to solve this is to have loop that scans the string, keeping track of the last place where you found a qualifying dot. When you find a new qualifying dot (or reach the end of the input string), just yield return the current substring.
Edit: about returning a list or an array vs. using yield
If in your application, the most important thing is the time spent by the caller on iterating the substrings, then you should populate a list or an array and return that, as suggested in the accepted question. I would not use a resizable array while collecting the substrings because this would be slow.
On the other hand, if you care about the overall performance and about memory, and if sometimes the caller doesn't have to iterate over the entire list, you should use yield return. When you use yield return, you have the advantage that no code at all is executing until the caller has called MoveNext (directly or indirectly through a foreach). This means that you save the memory for allocating the array/list, and you save the time spent on allocating/resizing/populating the list. You will be spending time almost only on the logic of finding the substrings, and this will be done lazily, that is - only when actually needed because the caller continues to iterate the substrings.
I'm working with huge string data for a project in C#. I'm confused about which approach should I use to manipulate my string data.
First Approach:
StringBuilder myString = new StringBuilder().Append(' ', 1024);
while(someString[++counter] != someChar)
myString[i++] += someString[counter];
Second Approach:
String myString = new String();
int i = counter;
while(soumeString[++counter] != someChar);
myString = someString.SubString(i, counter - i);
Which one of the two would be more fast(and efficient)? Considering the strings I'm working with are huge.
The strings are already in the RAM.
The size of the string can vary from 32MB-1GB.
You should use IndexOf rather than doing individual character manipulations in a loop, and add whole chunks of string to the result:
StringBuilder myString = new StringBuilder();
int pos = someString.IndexOf(someChar, counter);
myString.Append(someString.SubString(counter, pos));
For "huge" strings, it may make sense to take a streamed approach and not load the whole thing into memory. For the best raw performance, you can sometimes squeeze a little more speed out by using pointer math to search and capture pieces of strings.
To be clear, I'm stating two completely different approaches.
1 - Stream
The OP doesn't say how big these strings are, but it may be impractical to load them into memory. Perhaps they are being read from a file, from a data reader connected to a DB, from an active network connection, etc.
In this scenario, I would open a stream, read forward, buffering my input in a StringBuilder until the criteria was met.
2 - Unsafe Char Manipulation
This requires that you do have the complete string. You can obtain a char* to the start of a string quite simply:
// fix entire string in memory so that we can work w/ memory range safely
fixed( char* pStart = bigString )
{
char* pChar = pStart; // unfixed pointer to start of string
char* pEnd = pStart + bigString.Length;
}
You can now increment pChar and examine each character. You can buffer it (e.g. if you want to examine multiple adjacent characters) or not as you choose. Once you determine the ending memory location, you now have a range of data that you can work with.
Unsafe Code and Pointers in c#
2.1 - A Safer Approach
If you are familiar with unsafe code, it is very fast, expressive, and flexible. If not, I would still use a similar approach, but without the pointer math. This is similar to the approach which #supercat suggested, namely:
Get a char[].
Read through it character by character.
Buffer where needed. StringBuilder is good for this; set an initial size and reuse the instance.
Analyze buffer where needed.
Dump buffer often.
Do something with the buffer when it contains the desired match.
And an obligatory disclaimer for unsafe code: The vast majority of the time the framework methods are a better solution. They are safe, tested, and invoked millions of times per second. Unsafe code puts all of the responsibility on the developer. It does not make any assumptions; it's up to you to be a good framework/OS citizen (e.g. not overwriting immutable strings, allowing buffer overruns, etc.). Because it does not make any assumptions and removes the safeguards, it will often yield a performance increase. It's up to the developer to determine if there is indeed a benefit, and to decide if the advantages are significant enough.
Per request from OP, here are my test results.
Assumptions:
Big string is already in memory, no requirement for reading from disk
Goal is to not use any native pointers/unsafe blocks
The "checking" process is simple enough that something like Regex is not needed. For now simplifying to a single char comparison. The below code can easily be modified to consider multiple chars at once, this should have no effect on the relative performance of the two approaches.
public static void Main()
{
string bigStr = GenString(100 * 1024 * 1024);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
StringBuilder sb = new StringBuilder();
while (bigStr[++counter] != 'x')
sb.Append(bigStr[counter]);
Console.WriteLine(sb.ToString().Length);
}
sw.Stop();
Console.WriteLine("StringBuilder: {0}", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
while (bigStr[++counter] != 'x') ;
Console.WriteLine(bigStr.Substring(0, counter).Length);
}
sw.Stop();
Console.WriteLine("Substring: {0}", sw.Elapsed.TotalSeconds);
}
public static string GenString(int size)
{
StringBuilder sb = new StringBuilder(size);
for (int i = 0; i < size - 1; i++)
{
sb.Append('a');
}
sb.Append('x');
return sb.ToString();
}
Results (release build, .NET 4):
StringBuilder ~7.9 sec
Substring ~1.9 sec
StringBuilder was consistently > 3x slower, with a variety of different sized strings.
There's an IndexOf operation which would search more quickly for someChar, but I'll assume your real function to find the desired length is more complicated than that. In that scenario, I would recommend copying someString to a Char[], doing the search, and then using the new String(Char[], Int32, Int32) constructor to produce the final string. Indexing a Char[] is going to be so much more efficient than indexing an String or StringBuilder that unless you expect that you'll typically be needing only a small fraction of the string, copying everything to the Char[] will be a 'win' (unless, of course, you could simply use something like IndexOf).
Even if the length of the string will often be much larger than the length of interest, you may still be best off using a Char[]. Pre-initialize the Char[] to some size, and then do something like:
Char[] temp = new Char[1024];
int i=0;
while (i < theString.Length)
{
int subLength = theString.Length - i;
if (subLength > temp.Length) // May impose other constraints on subLength, provided
subLength = temp.Length; // it's greater than zero.
theString.CopyTo(i, temp, 0, subLength);
... do stuff with the array
i+=subLength;
}
Once you're all done, you may then use a single SubString call to construct a string with the necessary characters from the original. If your application requires buinding a string whose characters differ from the original, you could use a StringBuilder and, within the above loop, use the Append(Char[], Int32, Int32) method to add processed characters to it.
Note also that when the above loop construct, one may decide to reduce subLength at any point in the loop provided it is not reduced to zero. For example, if one is trying to find whether the string contains a prime number of sixteen or fewer digits enclosed by parentheses, one could start by scanning for an open-paren; if one finds it and it's possible that the data one is looking for might extend beyond the array, set subLength to the position of the open-paren, and reloop. Such an approach will result in a small amount of redundant copying, but not much (often none), and will eliminate the need to keep track of parsing state between loops. A very convenient pattern.
You always want to use StringBuilder when manipulating strings. This is becwuse strings are immutable, so every time a new object needs to be created.
I'm aimed at speed, must be ultra fast.
string s = something;
for (int j = 0; j < s.Length; j++)
{
if (s[j] == 'ь')
if(s.Length>(j+1))
if(s[j+1] != 'о')
s[j] = 'ъ';
It gives me an error Error "Property or indexer 'string.this[int]' cannot be assigned to -- it is read only"
How do I do it the fastest way?
Fast way? Use a StringBuilder.
Fastest way? Always pass around a char* and a length instead of a string so you can modify the buffer in-place, but make sure you don't ever modify any string object.
There are at least two options:
Use a StringBuilder and keep track of the previous character.
You could just use a regular expression "ь(?!о)" or a simple string replacement of "ьо" depending on what your needs are (your question seems self-contradictory).
I tested the performance of a StringBuilder approach versus regular expressions and there is very little difference - at most a factor of 2:
Method Iterations per second
StringBuilder 153480.094
Regex (uncompiled) 90021.978
Regex (compiled) 136355.787
string.Replace 1427605.174
If performance is critical for you I would strongly recommend making some performance measurements before jumping to conclusions about what the fastest approach is.
Strings in .Net is read-only. You could use StringBuilder.