Recently I used a predicate to describe search logic and passed it to the Find method of a few Lists.
foreach (IHiscoreBarItemView item in _view.HiscoreItems)
{
Predicate<Hiscore> matchOfHiscoreName =
(h) => h.Information.Name.Equals(item.HiscoreName);
var current = player.Hiscores.Find(matchOfHiscoreName);
item.GetLogicEngine().ForceSetHiscoreValue(current as Skill);
var goal = player.Goals.Find(matchOfHiscoreName);
item.GetLogicEngine().ForceSetGoalHiscoreValue(goal as Skill);
}
Are there any benefits, apart from 'less code', from using the aforementioned approach over an alternative.
I am particularly interested in performance.
Thanks
Benefit of Find over using LINQ: It's available in .NET 2.0
Benefit of LINQ over Find: consistency with other sequences; query expression syntax etc
Benefit of Find over BinarySearch: The list doesn't have to be sorted, and you only need equality comparison
Benefit of BinarySearch over Find: BinarySearch is O(log n); Find is O(n)
Benefit of Find over a foreach loop: compactness and not repeating yourself
Benefit of a foreach loop over Find: Any other custom processing you want to perform
Of these, only Find vs BinarySearch has any real performance difference. Of course, if you could change from a List<T> to a Dictionary<TKey,TValue> then finding elements would become amortized O(1) ...
Related
I was charged with speeding up a text processing/normalization section of our code, and there were multiple sections that had multiple, configurable lists of "if you see this, replace with that", and they were implemented with big stacks of regexes. That looked like a good place to start - and it was.
I implemented a simple Trie loaded with the configuration entries and then had a
Match (string raw, int idx = 0)
function that skimmed the raw input, looking through the Trie for matches.
My first draft of the match function used a for loop and an indexer (i.e.
TrieNode node = Root;
for (; idx < raw.Length; idx++)
{
TrieNode next;
if (node.TryGetValue(raw[idx], out next))
...
In it and it was several orders of magnitude faster than a pile of regexes.
I wanted to clean up and generalize the Trie, maybe make it configurable for either chars or words as tokens, and after all the generisizing I replaced the above with
foreach (var c in idx > 0 ? raw.Skip(idx) : raw)
{
...
and was surprised to see just how much overhead the change in iteration caused. I expected there to be some overhead but the foreach method was about 100x slower (4300 ms per run of 100 articles vs 40 ms with for loop) - just that change alone.
I've seen lots of articles from various time periods saying "of course Linq and enumerators suck!" to "always use foreach because the performance is close enough and foreach is cooler".
None of the oflow articles I found were very current so I thought I'd drop this note in a bottle.
I get the enumerator allocation is going to add a little overhead and Skip() is never going to be as fast as jumping right ahead with an indexer, but it was a pretty stark contrast.
I did find a debate about whether String should implement IReadOnlyList or not, which seems like it could have been the best of both worlds but that doesn't exist.
Is anyone else surprised that has that amount of overhead?
I'm not surprised that Skip is orders of magnitude slower since it will be O(n) (essentially incrementing an integer until you get to idx) versus O(1) for the direct indexer.
I would not generalize this to "Linq sucks - use foreach". You could implement functionally the same code as Skip in your foreach and get roughly the same results. The problem is not that you're using Linq - the problem is that you're using Skip on a collection that supports direct access.
If you want to generalize it to use either chars or words as tokens, it may be simplest to convert raw to a List<T> and support either a list of chars or a list of strings - with what you have, there should not be a significant performance difference between the two.
I've implimented the following code to search a list of objects for a particular value:
List<customer> matchingContacts = cAllServer
.Where(o => o.customerNum.Contains(searchTerm) ||
o.personInv.lastname.Contains(searchTerm) ||
o.personDel.lastname.Contains(searchTerm))
.ToList();
is there a quicker or cleaner way to impliment this search?
Since you will have to iterate through all of the list items, it will have O(n) complexity. Performance also depends whether you are operating on IQueryable collection (with or without lazy loading) or it is a serialized IEnumerable collection. I'd advise to check first the properties that are most likely to have the value you are searching for, because you are using "or" operator, so that you can speed up your "Contains" operation. You iterate more quickly if you know that it is a match for a particular entity in 10ms rather than in 25ms. There is also a popular argument what is faster? Contains or IndexOf? Well IndexOf should be a little bit faster, but I doubt you'll notice it, unless you operate on lists with millions of elements. Is String.Contains() faster than String.IndexOf()?
I think that this is fine as it is, but in other hand I'd think twice about the need to convert to list, you already are receving an IEnumerable type that'll let you iterate through stuff. Unless you need to go back and forth the list and searching by index there's no need for you to convert it to List.
This is an small optimization though.
The one thing I'd suggest is to create a new "searchtext" column, prepopulated with (o.customerNum + "|" + o.personInv.lastname + "|" + o.personDel.lastname).ToUpper().
List<customer> matchingContacts = cAllServer
.Where(o => o.searchtext.Contains(searchTerm))
.ToList();
This performs one search instead of three (but on a longer string), and it you .ToUpper() searchTerm, you can perform a case-sensivitive search which migh tbe trivial faster.
On the whole, I wouldn't expect this to be significantly faster.
I'm not sure if this is even possible, as far as I can tell, it isn't, but I have a dictionary and I want to find the Min() of values, but return either the KVP or Key. Currently I am using orderby, but I feel like that is going to be super inefficient once this grows large enough.
You can use MoreLinq and its MinBy method:
var pair = dictionary.MinBy(x => x.Value);
var key = pair.Key;
(And yes, this is O(N) rather than O(N log N) - more efficient than ordering. It won't need as much memory, either.)
I don't think there are any shortcuts and I would think that iterating every item looking for the min would be faster than sorting it first. You could easily compare speeds to find out for sure.
-EDIT: After the comment-
If you use MoreLinq.MinBy, your code will be more concise, but you won't avoid iterating every element.
https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/MinBy.cs
So, to answer your question more explicitly, you are best to iterate every element looking for the lowest value, and when you have finished iterating all elements, return the key.
I offer this answer in case you don't want to bring in MoreLinq for any reason.
I know that this probably is micro-optimization, but still I wonder if there is any difference in using
var lastObject = myList.OrderBy(item => item.Created).Last();
or
var lastObject = myList.OrderByDescending(item => item.Created).First();
I am looking for answers for Linq to objects and Linq to Entities.
Assuming that both ways of sorting take equal time (and that's a big 'if'), then the first method would have the extra cost of doing a .Last(), potentially requiring a full enumeration.
And that argument probably holds even stronger for an SQL oriented LINQ.
(my answer is about Linq to Objects, not Linq to Entities)
I don't think there's a big difference between the two instructions, this is clearly a case of micro-optimization. In both cases, the collection needs to be sorted, which usually means a complexity of O(n log n). But you can easily get the same result with a complexity of O(n), by enumerating the collection and keeping track of the min or max value. Jon Skeet provides an implementation in his MoreLinq project, in the form of a MaxBy extension method:
var lastObject = myList.MaxBy(item => item.Created);
I'm sorry this doesn't directly answer your question, but...
Why not do a better optimization and use Jon Skeet's implementations of MaxBy or MinBy?
That will be O(n) as opposed to O(n log n) in both of the alternatives you presented.
In both cases it depends somewhat on your underlying collections. If you have knowledge up front about how the collections look before the order and select you could choose one over the other. For example, if you know the list is usually in an ascending (or mostly ascending) sorted order you could prefer the first choice. Or if you know you have indexes on the SQL tables that are sorted ascending. Although the SQL optimizer can probably deal with that anyway.
In a general case they are equivalent statements. You were right when you said it's micro-optimization.
Assuming OrderBy and OrderByDescending averages the same performance, taking the first element would permorm better than last when the number of elements is large.
just my two cents: since OrderBy or OrderByDescending have to iterate over all the objects anyway, there should be no difference. however, if it were me i would probably just loop through all the items in a foreach with a compare to hold the highest comparing item, which would be an O(n) search instead of whatever order of magnitude the sorting is.
I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).