fast custom string splitting

fast custom string splitting - c#

I am writing a custom string split. It will split on a dot(.) that is not preceded by an odd number of backslashes (\).
«string» -> «IEnemerable<string>»
"hello.world" -> "hello", "world"
"abc\.123" -> "abc\.123"
"aoeui\\.dhtns" -> "aoeui\\","dhtns"
I would like to know if there is a substring that will reuse the original string (for speed), or is there an existing split that can do this fast?
This is what I have but is 2—3 times slower than input.Split('.') //where input is a string. (I know it is a (slightly more complex problem, but not that much)
public IEnumerable<string> HandMadeSplit(string input)
{
var Result = new LinkedList<string>();
var word = new StringBuilder();
foreach (var ch in input)
{
if (ch == '.')
{
Result.AddLast(word.ToString());
word.Length = 0;
}
else
{
word.Append(ch);
}
}
Result.AddLast(word.ToString());
return Result;
}
It now uses List instead of LinkedList, and record beginning and end of substring and use string.substring to create the new substrings. This does a lot and is nearly as fast as string.split but I have added my adjustments. (will add code)

The loop that you show is the right approach if you need performance. (Regex wouldn't be).
Switch to an index-based for-loop. Remember the index of the start of the match. Don't append individual chars. Instead, remember the range of characters to copy out and do that with a single Substring call per item.
Also, don't use a LinkedList. It is slower than a List for almost all cases except random-access mutations.
You might also switch from List to a normal array that you resize with Array.Resize. This results in slightly tedious code (because you have inlined a part of the List class into your method) but it cuts out some small overheads.
Next, don't return an IEnumerable because that forces the caller through indirection when accessing its items. Return a List or an array.

This is the one I eventually settled on. It is not as fast a string.split, but good enough and can be modified, to do what I want.
private IEnumerable<string> HandMadeSplit2b(string input)
{
//this one is margenaly better that the second best 2, but makes the resolver (its client much faster), nealy as fast as original.
var Result = new List<string>();
var begining = 0;
var len = input.Length;
for (var index=0;index<len;index++)
{
if (input[index] == '.')
{
Result.Add(input.Substring(begining,index-begining));
begining = index+1;
}
}
Result.Add(input.Substring(begining));
return Result;
}

You shouldn't try to use string.Split for that.
If you need help to implement it, a simple way to solve this is to have loop that scans the string, keeping track of the last place where you found a qualifying dot. When you find a new qualifying dot (or reach the end of the input string), just yield return the current substring.
Edit: about returning a list or an array vs. using yield
If in your application, the most important thing is the time spent by the caller on iterating the substrings, then you should populate a list or an array and return that, as suggested in the accepted question. I would not use a resizable array while collecting the substrings because this would be slow.
On the other hand, if you care about the overall performance and about memory, and if sometimes the caller doesn't have to iterate over the entire list, you should use yield return. When you use yield return, you have the advantage that no code at all is executing until the caller has called MoveNext (directly or indirectly through a foreach). This means that you save the memory for allocating the array/list, and you save the time spent on allocating/resizing/populating the list. You will be spending time almost only on the logic of finding the substrings, and this will be done lazily, that is - only when actually needed because the caller continues to iterate the substrings.

Related

C# Fastest way to determine if a string contains all elements of a list

Quick background. I have a string of words - I separate out those words into a List (I've tried HashSet it doesn't make any difference - and you lose the ordered nature of a List).
I then manipulate the original words in many dull ways - and create thousands of "new strings" - all of these strings are in a StringBuilder which has been set .ToString();
At the end of the manipulation, I want to QC those new strings - and be sure that every word that was in the original set - is still somewhere in those new strings and I have not accidentally lost a word.
That original string, can run to hundreds of individual words.
Short Example:
List<string> uniqueWords = new List<string> { "two", "three", "weather sunday" };
string final = "two and tomorrow\n\rtwo or wednesday\n\rtwo with thursday\n\rtwo without friday\n\rthree gone tomorrow\n\rthree weather saturday\n\rthree timely sunday";
The output string can run to tens of millions of characters, millions of words, 200,000+ rows of data (when split). You may notice that there are words that are actually two words separated by a space - so I cannot simply split out the individual words by splitting on the space as comparing them to the original would fail, and I need to confirm the words are exactly as they appeared originally - having weather somewhere and sunday somewhere - is not the same as having 'weather sunday' - for my purposes.
The the code I have tried so far and have benchmarked:
First attempt:
var allWords = uniqueWords.Where(substring => final.Contains(substring, StringComparison.CurrentCultureIgnoreCase)).ToList();
Second Attempt:
List<string> removeableList = new(uniqueWords);
foreach (var item in uniqueWords)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(item))
{
removeableList.Remove(item);
}
}
Third Attempt:
List<string> removeableList = new(uniqueWords);
for (int i = uniqueWords.Count; i >= 0; i--)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(uniqueWords[i]))
{
removeableList.Remove(uniqueWords[i]);
}
}
These are the results:
These results are repeatable, though I will say that the First Attempt tends to fluctuate quite a lot while the Second and Third Attempts tend to remain at about the same level - the Third Attempt does seem to do better than the Second rather consistently.
Are there any options that I am missing?
I have tried it using a Regex Matches collection into a HashSet - oh that was bad, 4 times worse than the First Attempt.
If there is a way to improve the performance on this task I would love to find it.

Your attempt #1 uses CurrentCultureIgnoreCase which will be slow. But even after removing that, you are adding to the list, rather than removing, and therefore the list might need to be resized.
You are also measuring two different things: option #1 is getting the list of words which are in final, the others get the list of words which are not.
Further options include:
Use List.RemoveAll
List<string> remainingWords = new(uniqueWords);
remainingWords.RemoveAll(final.Contains); // use delegate directly, without anonymous delegate
Use a pre-sized list and use Linq
List<string> remainingWords = new(uniqueWords.Length);
remainingWords.AddRange(uniqueWords.Where(s => !final.Contains(s)));
Each of these two options can be flipped depending on what result you are trying to achieve, as mentioned.
List<string> words = new(uniqueWords);
words.RemoveAll(s => !final.Contains(s));
List<string> words = new(uniqueWords.Length);
words.AddRange(uniqueWords.Where(final.Contains)); // use delegate directly, without anonymous delegate

#Charlieface, thanks for that - I tried those, I think you have a point about adding to a list - as that appears much slower. For me it doesn't matter whether it is adding or removing, the result is a True/False return - whether the list is empty or of the size of the original list.
Sixth Attempt:
List<string> removeableList = new(uniqueWords.Count);
removeableList.AddRange(uniqueWords.Where(s => !parsedTermsComplete!.Contains(s)));
Seventh Attempt:
List<string> removeableList = new(uniqueWords);
removeableList.RemoveAll(parsedTermsComplete!.Contains);
Results in comparison to Third Attempt (fastest generally):
The adding does appear slower - and memory is a little higher for the RemoveAll but timing is consistent - bearing in mind it fluctuates depending on what Windows decides to do at any given moment...
Here is an interesting implementation of the AhoCorasickTree method - which I saw mentioned on this site somewhere else.
My knowledge on this is extremely limited so this may not be a good implementation at all - I am not saying it is a good implementation just that it works - this comes from a nuget package, but I am unsure on SO's policy on nuget package links, so won't link for now. In testing, creating an array was faster than creating a list.
Eighth Attempt:
var wordArray = uniqueWords.ToArray();
int i = uniqueWords.Count - 1;
foreach (var item in wordArray)
{
var keyWords = new AhoCorasickTree(new[] { item });
if (keyWords.Contains(parsedTermsComplete))
{
uniqueWords.RemoveAt(i);
}
i--;
}
I noticed in testing that creating a "removableList" was actually slower than creating a removableArray (found this out implementing the above Aho run). I updated the Third Attempt to incorporate this:
var removeableArray = uniqueWords.ToArray();
for (int i = removeableArray.Length -1; i >= 0; i--)
{
if (!uniqueWords.Any())
{
break;
}
if (parsedTermsComplete!.Contains(removeableArray[i]))
{
uniqueWords.RemoveAt(i);
}
}
The Benchmarks come out like this, the Third Attempt is updated to an array, the Seventh Attempt is the AhoCorasick implementation on a list, and the Eighth Attempt is the AhoCorasick implementation on an Array.
The ToArray - does seem faster than List, which is good to know.
My only issue with the AhoCorasick is that in practice - in a WASM application - this is actually much slower, so not a good option for me - but I put it here because it does seem to be much faster in Benchmarks (may be using multiple threads where WASM is limited to 1) and doesn't appear to allocate any memory, so might be useful to someone - interesting that the Third Attempt also appears to be allocated no memory when using an Array implementation whereas on a list it was allocated.

Need to understand how recursion is working in finding word permutation code sample

I have a working code sample for finding possible word permutations for mis-typed words. For example, someone types the word "gi" so the suggested words would be "go" or "hi". The find get_nearby_letters() and isWord() functions are given.
I need to understand how is sub_words is getting values. And since the function is called recursively how is the char[] nearby_letters = get_nearby_letters(letters[index-1]); program statement being reached?
I seem to be having trouble understanding how recursive functions work.
public List<string> nearby_words(string word)
{
List<string> possible_words;
char[] letters = word.ToCharArray();
possible_words = get_nearby_permutations(letters, 0);
possible_words = possible_words.Where(x => isWord(x)).ToList();
return possible_words;
}
public List<string> get_nearby_permutations(char[] letters, int index)
{
List<string> permutations = new List<string>();
if (index >= letters.Count())
{
permutations = new List<string> { "" };
return permutations;
}
List<string> sub_words = get_nearby_permutations(letters, ++index);
char[] nearby_letters = get_nearby_letters(letters[index-1]);
foreach (var sub_word in sub_words)
{
foreach (var letter in nearby_letters)
{
permutations.Add(letter+sub_word);
}
}
return permutations;
}

The function rightget_nearby_permutations() is recursive because it calls itself inside of the fuction. Now you are wondering how the part after the recursive call can even be reached.
Have a look at the parameter index, which is counted up each time. At the start rightget_nearby_permutations() will be called index = 0. Inside of the function you have a recursive function call with ++index, which means the index will be counted up by one.
This goes on until the condition index >= letters.Count() is reached. This time there will be no recursive call and a List with one empty string will be returned. In the previously calling function this List gets stored in the parameter sub_words.
And now everything goes backwards and the lines after the recursive call will be reached and permutations populated.
ProTip: Use debugging and breakpoints to check what your code is doing.
Edit: Example of recursive call for letters.Count()==2:
Function 1
index = 0
Recursive call of Function 2
index = 1
Recursive call of Function 3
index = 2
index >= letters.Count() == true
return
continue with f2
return permutations
continue with f1
return permutations

how is sub_words is getting values
The local variable sub_words receives the return value of the method call to the get_nearby_permutations() method.
how is the char[] nearby_letters = get_nearby_letters(letters[index-1]); program statement being reached?
That program statement is executed after the previous call to get_nearby_permutations() returns. It works just like any program statement that follows a method call.
Stack Overflow isn't really the best place to seek help understanding recursion. It's a broad topic and typically requires some hand-holding with the student to walk them through the specifics. You should read articles such the Wikipedia article Recursion (computer science) and the In plain English, what is recursion? Q&A on programmers.stackexchange.com.
It its core, recursion is two things:
A method (function) that calls itself, and
One or more termination cases, i.e. a reason for the method to not call itself
In your example, the method calls itself to obtain the results of the operation on the input after the current index. IMHO, it should have been written like this:
List<string> sub_words = get_nearby_permutations(letters, index + 1);
char[] nearby_letters = get_nearby_letters(letters[index]);
That would make more clear that it's not really that you want a new value for index in the current call frame of the method, but that the next call should use the incremented value. Incrementing the value and then subtracting it when the variable is used later in the current call frame is just confusing and inefficient.
So, you have the first part, clearly. The second part, a reason to not call itself, happens because each time it calls itself, the index value increases by one. Eventually, the index value is large enough that there are no more characters to process, and the list containing the empty string is returned instead of the method calling itself.
That is fundamentally how your recursive method works. Of course, there is a bit more to it than that. After all, the method does real work as well. But that's just regular algorithmic stuff. I.e. given the results of the recursive call, now the method will create different combinations of the current letter with the various strings returned by the recursive call.
Since the first time the method returns, it's simply returned a list with the empty string, all of the "combinations" are just the letters near the current letter. But then those letters are returned as sub_words values to the previous call to the method, at which point it then combines those values with the nearby letters to the previous letter.
In this way, the method works its way back, creating different permutations of possible words by trying all the different combinations of letters with each of the previously-determined, shorter letter combinations.
With all that in mind, your next step should be simply to step into the method using the debugger. You will find that with each call into the method, the index value increases by one, until eventually the method returns from the termination clause (i.e. the list containing the empty string), and then from there, each time you return a list, you proceed to generate a longer list based on the current letter and the previous list.
The debugger can be very informative in understanding this code. I recommend it be one of the very next things you try. :)

Working with String.Substrings, is there lack of performance?

I'm trying to parse a large text string. I need to split the original string in blocks of 15 characters(and the next block might contain white spaces, so the trim function is used). I'm using two strings, the original and a temporary one. This temp string is used to store each 15 length block.
I wonder if I could fall into a performance issue because strings are immutable. This is the code:
string original = "THIS IS SUPPOSE TO BE A LONG STRING AN I NEED TO SPLIT IT IN BLOCKS OF 15 CHARACTERS.SO";
string temp = string.Empty;
while (original.Length != 0)
{
temp = original.Substring(0, 14).Trim();
original = original.Substring(14, (original.Length -14)).Trim();
}
I appreciate your feedback in order to find a best way to achieve this functionality.

You'll get slightly better performance like this (but whether the performance gain will be significant is another matter entirely):
for (var startIndex = 0; startIndex < original.Length; startIndex += 15)
{
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
}
This performs better because you're not copying the last all-but-15-characters of the original string with each loop iteration.
EDIT
To advance the index to the next non-whitespace character, you can do something like this:
for (var startIndex = 0; startIndex < original.Length; )
{
if (char.IsWhiteSpace(string, startIndex)
{
startIndex++;
continue;
}
temp = original.Substring(startIndex, Math.Min(original.Length - startIndex, 15)).Trim();
startIndex += 15;
}

I think you are right about the immutable issue - recreating 'original' each time is probably not the fastest way.
How about passing 'original' into a StringReader class?

If your original string is longer than few thousand chars, you'll have noticable (>0.1s) processing time and a lot of GC pressure. First Substring call is fine and I don't think you can avoid it unless you go deep inside System.String and mess around with m_FirstChar. Second Substring can be avoided completely when going char-by-char and iterating over int.

In general, if you would run this on bigger data such code might be problematic, it of course depends on your needs.
In general, it might be a good idea to use StringBuilder class, which will allow you to operator on strings in "more mutable" way without performance hit, like remove from it's beggining without reallocating whole string.
In your example however I would consider throwing out lime that takes substring from original and substitute it with some code that would update some indexes pointing where you should get new substring from. Then while condition would be just checking if your index as at the end of the string and your temp method would take substring not from 0 to 14 but from i, where i would be this index.
However - don't optimize code if you don't have to, I'm assuming here that you need more performance and you want to sacrifice some time and/or write a bit less understandable code for more efficiency.

C# highest string

This seems so trivial but I'm not finding an answer with Google.
I'm after a high value for a string for a semaphore at the end of a sorted list of strings.
It seems to me that char.highest.ToString() should do it--but this compares low, not high.
Obviously it's not truly possible to create a highest possible string because it would always be lower than the same thing + more data but the strings I'm sorting are all valid pathnames and thus the symbols used are constrained.
In response to the comments:
In the pre-unicode days in Delphi I would simply have used #255. I simply want a string that will compare higher than any possible pathname. This should be trivial--why isn't it??
Response #2:
It's not the sorting that requires the sentinel, it's the processing afterwards. I have multiple lists that I am sort-of merging (a simplistic merge won't do the job.) and either I duplicate code or I have dummy values that always compare high.

A string representation of the highest character will only be one character long.
Why don't you just append it as a semaphore after sorting, rather than trying to make it something that will sort afterwards?
Alternatively, you could specify your own comparator that sorts your token after any other string, and calls the default comparator otherwise.

I had the same problem when trying to put null values at the bottom of a list in a LINQ OrderBy() statement. I ended up using...
Char.ConvertFromUtf32(0x10ffff)
...which worked a treat.

Something like this?
public static String Highest(this String value)
{
Char highest = '\0';
foreach (Char c in value)
{
highest = Math.Max(c, highest);
}
return new String(new Char[] { highest });
}

What is the most efficient way to count all of the words in a richtextbox?

I am writing a text editor and need to provide a live word count. Right now I am using this extension method:
public static int WordCount(this string s)
{
s = s.TrimEnd();
if (String.IsNullOrEmpty(s)) return 0;
int count = 0;
bool lastWasWordChar = false;
foreach (char c in s)
{
if (Char.IsLetterOrDigit(c) || c == '_' || c == '\'' || c == '-')
{
lastWasWordChar = true;
continue;
}
if (lastWasWordChar)
{
lastWasWordChar = false;
count++;
}
}
if (!lastWasWordChar) count--;
return count + 1;
}
I have it set so that the word count runs on the richtextbox's text every tenth of a second (if the selection start is different from what it was last time the method ran). The problem is that the word count gets slow when working on very long files. To solve this I am thinking about having the word count only run on the current paragraph, recording the word count each time and comparing it against what the word count was last time the word count ran. It would then add the difference between the two to the total word count.
Doing this would cause many complications (if the user pastes, if the user deletes a paragraph, ect.)
Is this a logical way to go about improving my word count? Or is there something that I don't know about which would make it better?
EDIT:
Would it work to run the word count on a different thread? I don't know much about threading, will research.
SAMPLE TEXT THAT I USED:

You could do a simpler word count based on white-space:
public static int WordCount(this string s)
{
return s.Split(new char[] {' '},
StringSplitOptions.RemoveEmptyEntries).Length;
}
MSDN provides this example, should give you an accurate word count much faster on large files.

You could also use a very simple Regex that looks for at least one word character and/or apostrophe to capture the contractions:
public static int WordCount(this string s)
{
return Regex.Matches(s, #"[\w']+").Count;
}
This will return 2141 matches (which is actually more correct than Word in this case because Word counts the single asterisk as a word in the sentence "by stabbing a * with her finger").

Your method is actually faster than the proposed String.Split method, nearly three times faster on x86 and more than two times faster on x64 in fact. I suspect JIT is messing with your timings, always run your microbenchmarks twice as JIT will occupy the vast majority of the time during your first run. And because String.Split has been NGEN'd, it doesn't need to be compiled to native code and thus will appear to be faster.
Not to mention it's also more accurate, String.Split will count 7 words here:
test : : this is a test
It also makes sense, String.Split doesn't perform any magic and I would be very surprised if the creation of an array of many strings would be faster than simply iterating over the individual characters in the string. Foreaching over a string has apparently been highly optimized as I tried unsafe pointer arithmetic and it was actually a tiny bit slower than a simple foreach. I really doubt there's any way to do this faster, other than being smart about which sections in your text need word counts.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.