Fastest way to check if a string is a substring C#?

Fastest way to check if a string is a substring C#? - c#

I have a need to check if a list of items contains a string...so kind of like the list gets filtered as the user types in a search box. So, on the text changed event, I am checking if the entered text is contained in one of the listox items and filtering out...so
something like:
value.Contains(enteredText)
I was wondering if this is the fastest and most efficient way to filter out listbox items?
Is Contains() method the best way to search for substrings in C#?

I'd say that in all but very exceptional circumstances, it's fast and efficient enough, and even in such exceptional circumstances it's likely to be a purely academical problem. If you use it and come across any bottlenecks in your logic related to this then I'd be surprised, but only then would it be worth looking at, then chances are you'll be looking elsewhere.

Contains is one of the cheapest methods in my code completion filtering algorithm (Part 6 #6, where #7 and the fuzzy logic matching described in the footnote are vastly more expensive), which doesn't have problems keeping up with even a fast typing user and thousands of items in the dropdown.
I highly doubt it will cause you problems.

Although this is not the fastest option globally, it is the fastest one for which you do not need to code anything. It should be sufficient for filtering drop-down items.
For longer texts, you may want to go with the KMP Algorithm, which has a linear timing complexity. Note, however, that it would not make any difference for very short search strings.
For searches that have lots of matches (e.g. ones that you get for the first one to two characters) you may want to precompute a table that maps single letters and letter pairs to the rows in your drop-down list for a much faster look-up at the expense of using more memory (a pretty standard tradeoff in programming in general).

Related

Huge Dictionary and sub string lookup

I have a dictionary with 500,000 keys and I have to compare using Key.contains("Description"). This is making my performance really slow. Is there any other alternative way to perform faster search?
I had List before but that performed even worse. Tried using Index on List but did not improve performance much.

Other than storing all possible substrings of all possible keys as the keys in the dictionary (which you almost certainly wouldn't have enough memory to do) there really isn't much to be done besides iterating through the entire collection and doing the check on each item. Given that you're iterating the entire collection, there's not really much benefit to using a Dictionary over a List, at least for this specific operation (perhaps other operations you perform on this data benefit from it being in a Dictionary). They're both going to be quite slow. You simply have an inherently expensive operation that you're trying to perform.
If you can alter your requirements somehow to search for a string exactly equal to your search string then you can use the dictionary's hash based lookup, which is super fast, and if you could use a StartsWith or EndsWith operation instead of a full Contains then you could sort the data and use a binary search, but with a Contains operation none of those optimizations can be made.

If the search is performed multiple times, you may want to consider using extra collections holding just the items that match a predefined condition.
These collections would be populated at the same time your original dictionary is populated.
This could be a viable solution if you have a limited number of fixed searches.

I've read that by doing Regex you get an extra overhead, but why don't you benchmark it yourself?
Something like this:
var test = "Telle Carraige Sawmill Rh-ccxxH440xxx38.5Hyv-Op-rL-2008";
var matchCollection = Regex.Matches(test, "(Carraige|Sawmill)",RegexOptions.IgnoreCase);
//matchCollection.Count should be == 2

Efficiency between searching and foreach loops

I am working with a WPF in C#. I am using the GetNextControl method to store all the child controls in a Control.ControlCollection. I want to loop through the results and fill in only the text boxes. I have thought of two ways to do this, but which would be more efficient:
Search once and store the results in an Control.ControlCollection.
Use a foreach loop to go through the collection and use multiple if/else statements to find the TextBox I am looking for and fill in the box with some text.
Or,
Search and store all the controls in a Control.ControlCollection.
Use the find method of the collection to find a TextBoxwith a certain name and fill in some text in the TextBox.
I think that the first way would be slower because there are more comparisons to make. While the second method uses searching only.

Implement the easiest one. Do not worry about optimization until you have metrics to support the need.
If it is not fast enough/efficient enough, then get some good time measurements. Now it is time to consider alternate implementations.
Implement and time each of the alternates, picking the fastest/most efficient one.

Is it more or less efficient to perform a check before performing a Replace in C#?

This is an almost academic question but I'm curious as to its answer.
Suppose you have a loop that performs a routine replace on every row in a dataset. Let's say there's 10,000 such rows.
Is it more efficient to have something like this:
Row = Row.Replace('X', 'Y');
Or to check whether the row even contains the character that is to be replaced in the first place, like this:
if (Row.Contains('X')) Row = Row.Replace('X', 'Y');
Is there any difference in terms of efficiency? I realize that that the difference might be very minor bit I'm interested in knowing if one way is better than the other regardless of how much better it may be. Also, would your answer be different if the probability of finding the character that's to be replaced was 10% from it it being 90%?

For your check, Row.Contains('X'), is an O(n) function, which means that it iterates over the entire string one character at a time to see if that character exists.
Row.Replace('X', 'Y') works exactly the same way, it checks every single character one character at a time.
So, if you have that check in place, you iterate over the string potentially twice. If you just replace, you iterate over the string once.

You need to measure first on a realistic dataset, then decide which is higher performance. If your typical dataset doesn't often have anything, then having the Contains() call may be faster (because although Replace also iterates through all chars in the string, there will be an extra string object created and garbage collected due to the immutability of strings), but if "X" is often present, the check becomes a waste and actually slows things down.
Also, this typically isn't the first place to look for and worry about performance problems. Things like chatty interfaces, network I/O, web services, databases, file I/O and GUI updates are going to hurt you orders of magnitude more than stuff like this.
If you were going to do stuff like this, and if Row came back from a database (as it's name suggests) then getting the database to do the query might be another approach to save performance. E.g.
select MyTextColumn from MyTable where MyTextColumn like '%X%'
Then perform the replacement on all the results, because you know you only returned results where the replacement was needed.
This does introduce other concerns though - for example, in SQL Server, if the above example included an index on MyTextColumn, SQL Server won't be able to use that index because the like argument starts with a wildcard (it's not considered to be "sargable").
In summary, write for correctness, readability and maintenance first, then measure performance and make targeted improvements where they are found to be required.

The first option is faster. In order to check if a substring is present it first has to find it. As there won't be any caching mechanism why not replace it directly? Otherwise you'd be searching twice. If 'X' is present many times you would be basically doubling the effort.

Don't forget that strings in C# are IMMUTABLE. That means they cannot change.
For it to replace anything it has to create a new string in memory, and copy the data across, then garbage collect the old string later on.
Using Contains() first, will prevent needless creation, copying, and garbage collection of string data, and therefore perform faster.

Efficient insertion and search of strings

In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)

It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.

HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.

If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).

The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)

If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

'Proper' collection to use to obtain items in O(1) time in C# .NET?

Something I do often if I'm storing a bunch of string values and I want to be able to find them in O(1) time later is:
foreach (String value in someStringCollection)
{
someDictionary.Add(value, String.Empty);
}
This way, I can comfortably perform constant-time lookups on these string values later on, such as:
if (someDictionary.containsKey(someKey))
{
// etc
}
However, I feel like I'm cheating by making the value String.Empty. Is there a more appropriate .NET Collection I should be using?

If you're using .Net 3.5, try HashSet. If you're not using .Net 3.5, try C5. Otherwise your current method is ok (bool as #leppie suggests is better, or not as #JonSkeet suggests, dun dun dun!).
HashSet<string> stringSet = new HashSet<string>(someStringCollection);
if (stringSet.Contains(someString))
{
...
}

You can use HashSet<T> in .NET 3.5, else I would just stick to you current method (actually I would prefer Dictionary<string,bool> but one does not always have that luxury).

something you might want to add is an initial size to your hash. I'm not sure if C# is implemented differently than Java, but it usually has some default size, and if you add more than that, it extends the set. However a properly sized hash is important for achieving as close to O(1) as possible. The goal is to get exactly 1 entry in each bucket, without making it really huge. If you do some searching, I know there is a suggested ratio for sizing the hash table, assuming you know beforehand how many elements you will be adding. For example, something like "the hash should be sized at 1.8x the number of elements to be added" (not the real ratio, just an example).
From Wikipedia:
With a good hash function, a hash
table can typically contain about
70%–80% as many elements as it does
table slots and still perform well.
Depending on the collision resolution
mechanism, performance can begin to
suffer either gradually or
dramatically as more elements are
added. To deal with this, when the
load factor exceeds some threshold, it
is necessary to allocate a new, larger
table, and add all the contents of the
original table to this new table. In
Java's HashMap class, for example, the
default load factor threshold is 0.75.

I should probably make this a question, because I see the problem so often. What makes you think that dictionaries are O(1)? Technically, the only thing likely to be something like O(1) is access into a standard integer-indexed fixed-bound array using an integer index value (there being no look-up in arrays implemented that way).
The presumption that if it looks like an array reference it is O(1) when the "index" is a value that must be looked up somehow, however behind the scenes, means that it is not likely an O(1) scheme unless you are lucky to obtain a hash function with data that has no collisions (and probably a lot of wasted cells).
I see these questions and I even see answers that claim O(1) [not on this particular question, but I do seem them around], with no justification or explanation of what is required to make sure O(1) is actually achieved.
Hmm, I guess this is a decent question. I will do that after I post this remark here.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.