Better way to search a List for a string

Better way to search a List for a string - c#

I've implimented the following code to search a list of objects for a particular value:
List<customer> matchingContacts = cAllServer
.Where(o => o.customerNum.Contains(searchTerm) ||
o.personInv.lastname.Contains(searchTerm) ||
o.personDel.lastname.Contains(searchTerm))
.ToList();
is there a quicker or cleaner way to impliment this search?

Since you will have to iterate through all of the list items, it will have O(n) complexity. Performance also depends whether you are operating on IQueryable collection (with or without lazy loading) or it is a serialized IEnumerable collection. I'd advise to check first the properties that are most likely to have the value you are searching for, because you are using "or" operator, so that you can speed up your "Contains" operation. You iterate more quickly if you know that it is a match for a particular entity in 10ms rather than in 25ms. There is also a popular argument what is faster? Contains or IndexOf? Well IndexOf should be a little bit faster, but I doubt you'll notice it, unless you operate on lists with millions of elements. Is String.Contains() faster than String.IndexOf()?

I think that this is fine as it is, but in other hand I'd think twice about the need to convert to list, you already are receving an IEnumerable type that'll let you iterate through stuff. Unless you need to go back and forth the list and searching by index there's no need for you to convert it to List.
This is an small optimization though.

The one thing I'd suggest is to create a new "searchtext" column, prepopulated with (o.customerNum + "|" + o.personInv.lastname + "|" + o.personDel.lastname).ToUpper().
List<customer> matchingContacts = cAllServer
.Where(o => o.searchtext.Contains(searchTerm))
.ToList();
This performs one search instead of three (but on a longer string), and it you .ToUpper() searchTerm, you can perform a case-sensivitive search which migh tbe trivial faster.
On the whole, I wouldn't expect this to be significantly faster.

Related

Iterate over strings that ".StartsWith" without using LINQ

I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?

Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.

It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.

Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.

Huge Dictionary and sub string lookup

I have a dictionary with 500,000 keys and I have to compare using Key.contains("Description"). This is making my performance really slow. Is there any other alternative way to perform faster search?
I had List before but that performed even worse. Tried using Index on List but did not improve performance much.

Other than storing all possible substrings of all possible keys as the keys in the dictionary (which you almost certainly wouldn't have enough memory to do) there really isn't much to be done besides iterating through the entire collection and doing the check on each item. Given that you're iterating the entire collection, there's not really much benefit to using a Dictionary over a List, at least for this specific operation (perhaps other operations you perform on this data benefit from it being in a Dictionary). They're both going to be quite slow. You simply have an inherently expensive operation that you're trying to perform.
If you can alter your requirements somehow to search for a string exactly equal to your search string then you can use the dictionary's hash based lookup, which is super fast, and if you could use a StartsWith or EndsWith operation instead of a full Contains then you could sort the data and use a binary search, but with a Contains operation none of those optimizations can be made.

If the search is performed multiple times, you may want to consider using extra collections holding just the items that match a predefined condition.
These collections would be populated at the same time your original dictionary is populated.
This could be a viable solution if you have a limited number of fixed searches.

I've read that by doing Regex you get an extra overhead, but why don't you benchmark it yourself?
Something like this:
var test = "Telle Carraige Sawmill Rh-ccxxH440xxx38.5Hyv-Op-rL-2008";
var matchCollection = Regex.Matches(test, "(Carraige|Sawmill)",RegexOptions.IgnoreCase);
//matchCollection.Count should be == 2

OrderBy().Last() or OrderByDescending().First() performance

I know that this probably is micro-optimization, but still I wonder if there is any difference in using
var lastObject = myList.OrderBy(item => item.Created).Last();
or
var lastObject = myList.OrderByDescending(item => item.Created).First();
I am looking for answers for Linq to objects and Linq to Entities.

Assuming that both ways of sorting take equal time (and that's a big 'if'), then the first method would have the extra cost of doing a .Last(), potentially requiring a full enumeration.
And that argument probably holds even stronger for an SQL oriented LINQ.

(my answer is about Linq to Objects, not Linq to Entities)
I don't think there's a big difference between the two instructions, this is clearly a case of micro-optimization. In both cases, the collection needs to be sorted, which usually means a complexity of O(n log n). But you can easily get the same result with a complexity of O(n), by enumerating the collection and keeping track of the min or max value. Jon Skeet provides an implementation in his MoreLinq project, in the form of a MaxBy extension method:
var lastObject = myList.MaxBy(item => item.Created);

I'm sorry this doesn't directly answer your question, but...
Why not do a better optimization and use Jon Skeet's implementations of MaxBy or MinBy?
That will be O(n) as opposed to O(n log n) in both of the alternatives you presented.

In both cases it depends somewhat on your underlying collections. If you have knowledge up front about how the collections look before the order and select you could choose one over the other. For example, if you know the list is usually in an ascending (or mostly ascending) sorted order you could prefer the first choice. Or if you know you have indexes on the SQL tables that are sorted ascending. Although the SQL optimizer can probably deal with that anyway.
In a general case they are equivalent statements. You were right when you said it's micro-optimization.

Assuming OrderBy and OrderByDescending averages the same performance, taking the first element would permorm better than last when the number of elements is large.

just my two cents: since OrderBy or OrderByDescending have to iterate over all the objects anyway, there should be no difference. however, if it were me i would probably just loop through all the items in a foreach with a compare to hold the highest comparing item, which would be an O(n) search instead of whatever order of magnitude the sorting is.

finding all lines in a list that contain x or y?

can I do this without looping through the whole list?
List<string> responseLines = new List<string>();
the list is then filled with around 300 lines of text.
next I want to search the list and create a second list of all lines that either start with "abc" or contain "xyz".
I know I can do a for each but is there a better / faster way?

You could use LINQ. This is no different performance-wise to using foreach -- that's pretty much what it does behind the scenes -- but you might prefer the syntax:
var query = responseLines.Where(s => s.StartsWith("abc") || s.Contains("xyz"))
.ToList();
(If you're happy dealing with an IEnumerable<string> rather than List<string> then you can omit the final ToList call.)

var newList = (from line in responseLines
where line.StartsWith("abc") || line.Contains("xyz")
select line).ToList();

Try this:
List<string> responseLines = new List<string>();
List<string> myLines = responseLines.Where(line => line.StartsWith("abc", StringComparison.InvariantCultureIgnoreCase) || line.Contains("xyz")).ToList();
The StartsWith and Contains shortcut - the Contains will only evaluate if the StartsWith is not satisfied. This still iterates the whole list, but of course there is no way to avoid that if you want to check the whole list, but it saves you from doing typing a foreach.

Use LINQ:
List<string> list = responseLines.Where(x => x.StartsWith("abc") || x.Contains("xyz")).ToList();

Unless you need all the text for some reason, it would be quicker to inspect each line at the time when you were generating the List and discard the ones that don't match without ever adding them.
This depends on how the List is loaded as well - that code is not shown. This would be effective if you were reading from a text file since then you could just use your LINQ query to operate directly on the input data using File.ReadLines as the source instead of the final List<string>.
var query = File.ReadLines("input.txt").
Where(s => s.StartsWith("abc") || s.Contains("xyz"))
.ToList();

LINQ works well as far as offering you improved syntax for this sort of thing (See LukeH's answer for a good example), but it isn't any faster than iterating over it by hand.
If you need to do this operation often, you might want to come up with some kind of indexed data structure that watches for all "abc" or "xyz" strings as they come into the list, and can thereby use a faster algorithm for serving them up when asked, rather than iterating through the whole list.
If you don't have to do it often, it's probably a "premature optimization."

Quite simply, there is no possible algorithm that can guarantee you will never have to iterate through every item in the list. However, it is possible to improve the average number of items you need to iterate through - sorting the list before you begin your search. By doing so, the only times you would have to iterate through the entire list would be when it is filled with only "abc" and "xyz."
Assuming that it's not practical for you to have a pre-sorted list by the time you need to search through it, then the only way to improve the speed of your search would be to use a different data structure than a list - for example, a binary search tree.

What .NET collection provides the fastest search

I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.

In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.

If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.

Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.

You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.

I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):

Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.

If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item

If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet

If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.