stringbuilder vs list string when searching line by line c#

stringbuilder vs list string when searching line by line c# - c#

I have an WinForm app that getting output from serail/telnet terminals
because of historical decisions all output goes to a list like this
static List<string> BufferLog = new List<string>();
serialInputData += serialPort.ReadExisting();
BufferLog.Add(serialInputData);
Now I want to add another function to block thread until a sentence {one word is also possible }
what I had in mind was to do something like
if (IsWaitForCustomMessage)
{
while(IsNotTimeout)
{
List<string> waiterList = serialInputData.Split('\n').ToList();
if (waiterList.Exists(x => x.Contains("SomeSentenc")) return true ;
}
return false;
}
assuming that serialInputData is not containing one line but many lines
What I want to know is , Is there is any faster way to check those lines ?
The only other way to do it fairly simple for me is with stringBuilder , I am more the willing to try other ways
also fro your experiences should I change the BufferLog from List<string> to some other type ?

Last question first - yes, I'd use StringBuilder instead of List(string) because its a closer fit to what you are doing (building a string with incremental inputs). Just tidier rather than better performance neccessarily.
I think you are asking how to wait until the StringBuilder contains a specific sequence of chars ? Instead of breaking it into lines, is there any reason you couldn't just use IndexOf ? This would prevent the need to move the strings around in memory and will be pretty fast.

Related

C# Fastest way to determine if a string contains all elements of a list

Quick background. I have a string of words - I separate out those words into a List (I've tried HashSet it doesn't make any difference - and you lose the ordered nature of a List).
I then manipulate the original words in many dull ways - and create thousands of "new strings" - all of these strings are in a StringBuilder which has been set .ToString();
At the end of the manipulation, I want to QC those new strings - and be sure that every word that was in the original set - is still somewhere in those new strings and I have not accidentally lost a word.
That original string, can run to hundreds of individual words.
Short Example:
List<string> uniqueWords = new List<string> { "two", "three", "weather sunday" };
string final = "two and tomorrow\n\rtwo or wednesday\n\rtwo with thursday\n\rtwo without friday\n\rthree gone tomorrow\n\rthree weather saturday\n\rthree timely sunday";
The output string can run to tens of millions of characters, millions of words, 200,000+ rows of data (when split). You may notice that there are words that are actually two words separated by a space - so I cannot simply split out the individual words by splitting on the space as comparing them to the original would fail, and I need to confirm the words are exactly as they appeared originally - having weather somewhere and sunday somewhere - is not the same as having 'weather sunday' - for my purposes.
The the code I have tried so far and have benchmarked:
First attempt:
var allWords = uniqueWords.Where(substring => final.Contains(substring, StringComparison.CurrentCultureIgnoreCase)).ToList();
Second Attempt:
List<string> removeableList = new(uniqueWords);
foreach (var item in uniqueWords)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(item))
{
removeableList.Remove(item);
}
}
Third Attempt:
List<string> removeableList = new(uniqueWords);
for (int i = uniqueWords.Count; i >= 0; i--)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(uniqueWords[i]))
{
removeableList.Remove(uniqueWords[i]);
}
}
These are the results:
These results are repeatable, though I will say that the First Attempt tends to fluctuate quite a lot while the Second and Third Attempts tend to remain at about the same level - the Third Attempt does seem to do better than the Second rather consistently.
Are there any options that I am missing?
I have tried it using a Regex Matches collection into a HashSet - oh that was bad, 4 times worse than the First Attempt.
If there is a way to improve the performance on this task I would love to find it.

Your attempt #1 uses CurrentCultureIgnoreCase which will be slow. But even after removing that, you are adding to the list, rather than removing, and therefore the list might need to be resized.
You are also measuring two different things: option #1 is getting the list of words which are in final, the others get the list of words which are not.
Further options include:
Use List.RemoveAll
List<string> remainingWords = new(uniqueWords);
remainingWords.RemoveAll(final.Contains); // use delegate directly, without anonymous delegate
Use a pre-sized list and use Linq
List<string> remainingWords = new(uniqueWords.Length);
remainingWords.AddRange(uniqueWords.Where(s => !final.Contains(s)));
Each of these two options can be flipped depending on what result you are trying to achieve, as mentioned.
List<string> words = new(uniqueWords);
words.RemoveAll(s => !final.Contains(s));
List<string> words = new(uniqueWords.Length);
words.AddRange(uniqueWords.Where(final.Contains)); // use delegate directly, without anonymous delegate

#Charlieface, thanks for that - I tried those, I think you have a point about adding to a list - as that appears much slower. For me it doesn't matter whether it is adding or removing, the result is a True/False return - whether the list is empty or of the size of the original list.
Sixth Attempt:
List<string> removeableList = new(uniqueWords.Count);
removeableList.AddRange(uniqueWords.Where(s => !parsedTermsComplete!.Contains(s)));
Seventh Attempt:
List<string> removeableList = new(uniqueWords);
removeableList.RemoveAll(parsedTermsComplete!.Contains);
Results in comparison to Third Attempt (fastest generally):
The adding does appear slower - and memory is a little higher for the RemoveAll but timing is consistent - bearing in mind it fluctuates depending on what Windows decides to do at any given moment...
Here is an interesting implementation of the AhoCorasickTree method - which I saw mentioned on this site somewhere else.
My knowledge on this is extremely limited so this may not be a good implementation at all - I am not saying it is a good implementation just that it works - this comes from a nuget package, but I am unsure on SO's policy on nuget package links, so won't link for now. In testing, creating an array was faster than creating a list.
Eighth Attempt:
var wordArray = uniqueWords.ToArray();
int i = uniqueWords.Count - 1;
foreach (var item in wordArray)
{
var keyWords = new AhoCorasickTree(new[] { item });
if (keyWords.Contains(parsedTermsComplete))
{
uniqueWords.RemoveAt(i);
}
i--;
}
I noticed in testing that creating a "removableList" was actually slower than creating a removableArray (found this out implementing the above Aho run). I updated the Third Attempt to incorporate this:
var removeableArray = uniqueWords.ToArray();
for (int i = removeableArray.Length -1; i >= 0; i--)
{
if (!uniqueWords.Any())
{
break;
}
if (parsedTermsComplete!.Contains(removeableArray[i]))
{
uniqueWords.RemoveAt(i);
}
}
The Benchmarks come out like this, the Third Attempt is updated to an array, the Seventh Attempt is the AhoCorasick implementation on a list, and the Eighth Attempt is the AhoCorasick implementation on an Array.
The ToArray - does seem faster than List, which is good to know.
My only issue with the AhoCorasick is that in practice - in a WASM application - this is actually much slower, so not a good option for me - but I put it here because it does seem to be much faster in Benchmarks (may be using multiple threads where WASM is limited to 1) and doesn't appear to allocate any memory, so might be useful to someone - interesting that the Third Attempt also appears to be allocated no memory when using an Array implementation whereas on a list it was allocated.

how to convert part of the string to int/float/vector3 etc. without creating a temp string?

in C#, I have a string like this:
"1 3.14 (23, 23.2, 43,88) 8.27"
I need to convert this string to other types according to the value like int/float/vector3, now i have some code like this:
public static int ReadInt(this string s, ref string op)
{
s = s.Trim();
string ss = "";
int idx = s.IndexOf(" ");
if (idx > 0)
{
ss = s.Substring(0, idx);
op = s.Substring(idx);
}
else
{
ss = s;
op = "";
}
return Convert.ToInt32(ss);
}
this will read the first int value out, i have some similar functions to read float vector3 etc. but the problem is : in my application, i have to do this a lot because i received the string from some plugin and i need to do it every single frame, so i created a lot of strings which caused a lot GC will impact the performance, is their a way i can do similar stuff without creating temp strings?

Generation 0 objects such as those created here may well not impact performance too much, as they are relatively cheap to collect. I would change from using Convert to calling int.Parse() with the invariant culture before I started worrying about the GC overhead of the extra strings.
Also, you don't really need to create a new string to accomplish the Trim() behavior. After all, you're scanning and indexing the string anyway. Just do your initial scan for whitespace, and then for the space delimiter between ss and op, so you get just the substrings you need. Right now you're creating 50% more string instances than you really need.
All that said, no...there's not anything built into the basic .NET framework that would parse a substring without actually creating a new string instance. You would have to write your own parsing routines to accomplish that.
You should measure the actual real-world performance impact first, to make sure these substrings really are a significant issue.
I don't know what the "some plugin" is or how you have to handle the input from it, but I would not be surprised to hear that the overhead in acquiring the original input string(s) for this scenario swamps the overhead of the substrings for parsing.

fast custom string splitting

I am writing a custom string split. It will split on a dot(.) that is not preceded by an odd number of backslashes (\).
«string» -> «IEnemerable<string>»
"hello.world" -> "hello", "world"
"abc\.123" -> "abc\.123"
"aoeui\\.dhtns" -> "aoeui\\","dhtns"
I would like to know if there is a substring that will reuse the original string (for speed), or is there an existing split that can do this fast?
This is what I have but is 2—3 times slower than input.Split('.') //where input is a string. (I know it is a (slightly more complex problem, but not that much)
public IEnumerable<string> HandMadeSplit(string input)
{
var Result = new LinkedList<string>();
var word = new StringBuilder();
foreach (var ch in input)
{
if (ch == '.')
{
Result.AddLast(word.ToString());
word.Length = 0;
}
else
{
word.Append(ch);
}
}
Result.AddLast(word.ToString());
return Result;
}
It now uses List instead of LinkedList, and record beginning and end of substring and use string.substring to create the new substrings. This does a lot and is nearly as fast as string.split but I have added my adjustments. (will add code)

The loop that you show is the right approach if you need performance. (Regex wouldn't be).
Switch to an index-based for-loop. Remember the index of the start of the match. Don't append individual chars. Instead, remember the range of characters to copy out and do that with a single Substring call per item.
Also, don't use a LinkedList. It is slower than a List for almost all cases except random-access mutations.
You might also switch from List to a normal array that you resize with Array.Resize. This results in slightly tedious code (because you have inlined a part of the List class into your method) but it cuts out some small overheads.
Next, don't return an IEnumerable because that forces the caller through indirection when accessing its items. Return a List or an array.

This is the one I eventually settled on. It is not as fast a string.split, but good enough and can be modified, to do what I want.
private IEnumerable<string> HandMadeSplit2b(string input)
{
//this one is margenaly better that the second best 2, but makes the resolver (its client much faster), nealy as fast as original.
var Result = new List<string>();
var begining = 0;
var len = input.Length;
for (var index=0;index<len;index++)
{
if (input[index] == '.')
{
Result.Add(input.Substring(begining,index-begining));
begining = index+1;
}
}
Result.Add(input.Substring(begining));
return Result;
}

You shouldn't try to use string.Split for that.
If you need help to implement it, a simple way to solve this is to have loop that scans the string, keeping track of the last place where you found a qualifying dot. When you find a new qualifying dot (or reach the end of the input string), just yield return the current substring.
Edit: about returning a list or an array vs. using yield
If in your application, the most important thing is the time spent by the caller on iterating the substrings, then you should populate a list or an array and return that, as suggested in the accepted question. I would not use a resizable array while collecting the substrings because this would be slow.
On the other hand, if you care about the overall performance and about memory, and if sometimes the caller doesn't have to iterate over the entire list, you should use yield return. When you use yield return, you have the advantage that no code at all is executing until the caller has called MoveNext (directly or indirectly through a foreach). This means that you save the memory for allocating the array/list, and you save the time spent on allocating/resizing/populating the list. You will be spending time almost only on the logic of finding the substrings, and this will be done lazily, that is - only when actually needed because the caller continues to iterate the substrings.

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}

You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.

Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.

While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.

It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.

Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

C# highest string

This seems so trivial but I'm not finding an answer with Google.
I'm after a high value for a string for a semaphore at the end of a sorted list of strings.
It seems to me that char.highest.ToString() should do it--but this compares low, not high.
Obviously it's not truly possible to create a highest possible string because it would always be lower than the same thing + more data but the strings I'm sorting are all valid pathnames and thus the symbols used are constrained.
In response to the comments:
In the pre-unicode days in Delphi I would simply have used #255. I simply want a string that will compare higher than any possible pathname. This should be trivial--why isn't it??
Response #2:
It's not the sorting that requires the sentinel, it's the processing afterwards. I have multiple lists that I am sort-of merging (a simplistic merge won't do the job.) and either I duplicate code or I have dummy values that always compare high.

A string representation of the highest character will only be one character long.
Why don't you just append it as a semaphore after sorting, rather than trying to make it something that will sort afterwards?
Alternatively, you could specify your own comparator that sorts your token after any other string, and calls the default comparator otherwise.

I had the same problem when trying to put null values at the bottom of a list in a LINQ OrderBy() statement. I ended up using...
Char.ConvertFromUtf32(0x10ffff)
...which worked a treat.

Something like this?
public static String Highest(this String value)
{
Char highest = '\0';
foreach (Char c in value)
{
highest = Math.Max(c, highest);
}
return new String(new Char[] { highest });
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.