I'm a complete LINQ newbie, so I don't know if my LINQ is incorrect for what I need to do or if my expectations of performance are too high.
I've got a SortedList of objects, keyed by int; SortedList as opposed to SortedDictionary because I'll be populating the collection with pre-sorted data. My task is to find either the exact key or, if there is no exact key, the one with the next higher value. If the search is too high for the list (e.g. highest key is 100, but search for 105), return null.
// The structure of this class is unimportant. Just using
// it as an illustration.
public class CX
{
public int KEY;
public DateTime DT;
}
static CX getItem(int i, SortedList<int, CX> list)
{
var items =
(from kv in list
where kv.Key >= i
select kv.Key);
if (items.Any())
{
return list[items.Min()];
}
return null;
}
Given a list of 50,000 records, calling getItem 500 times takes about a second and a half. Calling it 50,000 times takes over 2 minutes. This performance seems very poor. Is my LINQ bad? Am I expecting too much? Should I be rolling my own binary search function?
First, your query is being evaluated twice (once for Any, and once for Min). Second, Min requires that it iterate over the entire list, even though the fact that it's sorted means that the first item will be the minimum. You should be able to change this:
if (items.Any())
{
return list[items.Min()];
}
To this:
var default =
(from kv in list
where kv.Key >= i
select (int?)kv.Key).FirstOrDefault();
if(default != null) return list[default.Value];
return null;
UPDATE
Because you're selecting a value type, FirstOrDefault doesn't return a nullable object. I have altered your query to cast the selected value to an int? instead, allowing the resulting value to be checked for null. I would advocate this over using ContainsKey, as that would return true if your list contained a value for 0. For example, say you have the following values
0 2 4 6 8
If you were to pass in anything less than or equal to 8, then you would get the correct value. However, if you were to pass in 9, you would get 0 (default(int)), which is in the list but isn't a valid result.
Writing a binary search on your own can be tough.
Fortunately, Microsoft already wrote a pretty robust one: Array.BinarySearch<T>. This is, in fact, the method that SortedList<TKey, TValue>.IndexOfKey uses internally. Only problem is, it takes a T[] argument, instead of any IList<T> (like SortedList<TKey, TValue>.Keys).
You know what, though? There's this great tool called Reflector that lets you look at .NET source code...
Check it out: a generic BinarySearch extension method on IList<T>, taken straight from the reflected code of Microsoft's Array.BinarySearch<T> implementation.
public static int BinarySearch<T>(this IList<T> list, int index, int length, T value, IComparer<T> comparer) {
if (list == null)
throw new ArgumentNullException("list");
else if (index < 0 || length < 0)
throw new ArgumentOutOfRangeException((index < 0) ? "index" : "length");
else if (list.Count - index < length)
throw new ArgumentException();
int lower = index;
int upper = (index + length) - 1;
while (lower <= upper) {
int adjustedIndex = lower + ((upper - lower) >> 1);
int comparison = comparer.Compare(list[adjustedIndex], value);
if (comparison == 0)
return adjustedIndex;
else if (comparison < 0)
lower = adjustedIndex + 1;
else
upper = adjustedIndex - 1;
}
return ~lower;
}
public static int BinarySearch<T>(this IList<T> list, T value, IComparer<T> comparer) {
return list.BinarySearch(0, list.Count, value, comparer);
}
public static int BinarySearch<T>(this IList<T> list, T value) where T : IComparable<T> {
return list.BinarySearch(value, Comparer<T>.Default);
}
This will let you call list.Keys.BinarySearch and get the negative bitwise complement of the index you want in case the desired key isn't found (the below is taken basically straight from tzaman's answer):
int index = list.Keys.BinarySearch(i);
if (index < 0)
index = ~index;
var item = index < list.Count ? list[list.Keys[index]] : null;
return item;
Using LINQ on a SortedList will not give you the benefit of the sort.
For optimal performance, you should write your own binary search.
OK, just to give this a little more visibility - here's a more concise version of Adam Robinson's answer:
return list.FirstOrDefault(kv => kv.Key >= i).Value;
The FirstOrDefault function has an overload that accepts a predicate, which selects the first element satisfying a condition - you can use that to directly get the element you want, or null if it doesn't exist.
Why not use the BinarySearch that's built into the List class?
var keys = list.Keys.ToList();
int index = keys.BinarySearch(i);
if (index < 0)
index = ~index;
var item = index < keys.Count ? list[keys[index]] : null;
return item;
If the search target isn't in the list, BinarySearch returns the bit-wise complement of the next-higher item; we can use that to directly get you what you want by re-complementing the result if it's negative. If it becomes equal to the Count, your search key was bigger than anything in the list.
This should be much faster than doing a LINQ where, since it's already sorted...
As comments have pointed out, the ToList call will force an evaluation of the whole list, so this is only beneficial if you do multiple searches without altering the underlying SortedList, and you keep the keys list around separately.
Using OrderedDictionary in PowerCollections you can get an enumerator that starts where they key you are looking for should be... if it's not there, you'll get the next closest node and can then navigate forwards/backwards from that in O(log N) time per nav call.
This has the advantage of you not having to write your own search or even manage your own searches on top of a SortedList.
Related
In a checksum calculation algorithm I'm implementing, the input must be an even number of bytes - if it isn't an extra zero byte must be packed at the end.
I do not want to modify the input data to my method by actually adding an element (and the input might be non-modifiable). Neither do I want to create a new data structure and copy the input.
I wondered if LINQ is a good option to create a lightweight IEnumerable something like:
void Calculate(IList<byte> input)
{
IEnumerable<byte> items = (input.Count & 1 ==0) ? items : X(input,0x0);
foreach(var i in items)
{
...
}
}
i.e. what would X(...) look like?
You can use this iterator (yield return) extension method to add extra items to the end of an IEnumerable<T> without needing to initially iterate over the elements (which you would need to do in-order to get a .Count value).
Note that you should check if input is an IReadOnlyCollection<T> or an IList<T> because that means you can use a more optimal code path when the .Count can be known in-advance.
public static IEnumerable<T> EnsureModuloItems<T>( this IEnumerable<T> source, Int32 modulo, T defaultValue = default )
{
if( source is null ) throw new ArgumentNullException(nameof(source));
if( modulo < 1 ) throw new ArgumentOutOfRangeException( nameof(modulo), modulo, message: "Value must be 1 or greater." );
//
Int32 count = 0;
foreach( T item in source )
{
yield return item;
count++;
}
Int32 remainder = count % modulo;
for( Int32 i = 0; i < remainder; i++ )
{
yield return defaultValue;
}
}
Used like so:
foreach( Byte b in input.EnsureModuloItems( modulo: 2, defaultValue: 0x00 ) )
{
}
You might use Concat method for that
IEnumerable<byte> items = input.Count() % 2 == 0 ? input : input.Concat(new[] { (byte)0x0 });
I've also changed your code a little bit, there is no Count property for IEnumerable<T>, you should use Count() method.
Since Concat() accepts IEnumerable<T>, it requires to a List<T> ao array to it. You can make a simple extension method to wrap a single item as IEnumerable<T>
internal static class Ext
{
public static IEnumerable<T> Yield<T>(this T item)
{
yield return item;
}
}
and use it
IEnumerable<byte> items = input.Count() % 2 == 0 ? input : input.Concat(((byte)0x0).Yield());
However, according to comments, the better option here can be an Append method
IEnumerable<byte> items = input.Count() % 2 == 0 ? input : input.Append((byte)0x0);
I have code that needs to know that a collection should not be empty or contain only one item.
In general, I want an extension of the form:
bool collectionHasAtLeast2Items = collection.AtLeast(2);
I can write an extension easily, enumerating over the collection and incrementing an indexer until I hit the requested size, or run out of elements, but is there something already in the LINQ framework that would do this? My thoughts (in order of what came to me) are::
bool collectionHasAtLeast2Items = collection.Take(2).Count() == 2; or
bool collectionHasAtLeast2Items = collection.Take(2).ToList().Count == 2;
Which would seem to work, though the behaviour of taking more elements than the collection contains is not defined (in the documentation) Enumerable.Take Method , however, it seems to do what one would expect.
It's not the most efficient solution, either enumerating once to take the elements, then enumerating again to count them, which is unnecessary, or enumerating once to take the elements, then constructing a list in order to get the count property which isn't enumerator-y, as I don't actually want the list.
It's not pretty as I always have to make two assertions, first taking 'x', then checking that I actually received 'x', and it depends upon undocumented behaviour.
Or perhaps I could use:
bool collectionHasAtLeast2Items = collection.ElementAtOrDefault(2) != null;
However, that's not semantically-clear. Maybe the best is to wrap that with a method-name that means what I want. I'm assuming that this will be efficient, I haven't reflected on the code.
Some other thoughts are using Last(), but I explicitly don't want to enumerate through the whole collection.
Or maybe Skip(2).Any(), again not semantically completely obvious, but better than ElementAtOrDefault(2) != null, though I would think they produce the same result?
Any thoughts?
public static bool AtLeast<T>(this IEnumerable<T> source, int count)
{
// Optimization for ICollection<T>
var genericCollection = source as ICollection<T>;
if (genericCollection != null)
return genericCollection.Count >= count;
// Optimization for ICollection
var collection = source as ICollection;
if (collection != null)
return collection.Count >= count;
// General case
using (var en = source.GetEnumerator())
{
int n = 0;
while (n < count && en.MoveNext()) n++;
return n == count;
}
}
You can use Count() >= 2, if you sequence implements ICollection?
Behind the scene, Enumerable.Count() extension method checks does the sequence under loop implements ICollection. If it does indeed, Count property returned, so target performance should be O(1).
Thus ((IEnumerable<T>)((ICollection)sequence)).Count() >= x also should have O(1).
You could use Count, but if performance is an issue, you will be better off with Take.
bool atLeastX = collection.Take(x).Count() == x;
Since Take (I believe) uses deferred execution, it will only go through the collection once.
abatishchev mentioned that Count is O(1) with ICollection, so you could do something like this and get the best of both worlds.
IEnumerable<int> col;
// set col
int x;
// set x
bool atLeastX;
if (col is ICollection<int>)
{
atLeastX = col.Count() >= x;
}
else
{
atLeastX = col.Take(x).Count() == x;
}
You could also use Skip/Any, in fact I bet it would be even faster than Take/Count.
I have a LIST<T> where T:IComparable<T>
I want to write a List<T> GetFirstNElements (IList<T> list, int n) where T :IComparable <T> which returns the first n distinct largest elements ( the list can have dupes) using expression trees.
In some performance-critical code I wrote recently, I had a very similar requirement - the candidate set was very large, and the number needed very small. To avoid sorting the entire candidate set, I use a custom extension method that simply keeps the n largest items found so far in a linked list. Then I can simply:
loop once over the candidates
if I haven't yet found "n" items, or the current item is better than the worst already selected, then add it (at the correct position) in the linked-list (inserts are cheap here)
if we now have more than "n" selected, drop the worst (deletes are cheap here)
then we are done; at the end of this, the linked-list contains the best "n" items, already sorted. No need to use expression-trees, and no "sort a huge list" overhead. Something like:
public static IEnumerable<T> TakeTopDistinct<T>(this IEnumerable<T> source, int count)
{
if (source == null) throw new ArgumentNullException("source");
if (count < 0) throw new ArgumentOutOfRangeException("count");
if (count == 0) yield break;
var comparer = Comparer<T>.Default;
LinkedList<T> selected = new LinkedList<T>();
foreach(var value in source)
{
if(selected.Count < count // need to fill
|| comparer.Compare(selected.Last.Value, value) < 0 // better candidate
)
{
var tmp = selected.First;
bool add = true;
while (tmp != null)
{
var delta = comparer.Compare(tmp.Value, value);
if (delta == 0)
{
add = false; // not distinct
break;
}
else if (delta < 0)
{
selected.AddBefore(tmp, value);
add = false;
if(selected.Count > count) selected.RemoveLast();
break;
}
tmp = tmp.Next;
}
if (add && selected.Count < count) selected.AddLast(value);
}
}
foreach (var value in selected) yield return value;
}
If i get the question right, you just want to sort the entries in the list.
Wouldn't it be possible for you to implement the IComparable and use the "Sort" Method of the List?
The code in "IComparable" can handle the tree compare and everything you want to use to compare and sort so you just can use the Sort mechnism at this point.
List<T> GetFirstNElements (IList<T> list, int n) where T :IComparable <T>{
list.Sort();
List<T> returnList = new List<T>();
for(int i = 0; i<n; i++){
returnList.Add(list[i]);
}
return returnList;
}
Wouldn't be the fastest code ;-)
The standard algorithm for doing so, which takes expected time O(list.Length) is in Wikipedia as "quickfindFirstK" on this page:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
This improves on #Marc Gravell's answer because the expected running time of it is linear in the length of the input list, regardless of the value of n.
I have two multisets, both IEnumerables, and I want to compare them.
string[] names1 = { "tom", "dick", "harry" };
string[] names2 = { "tom", "dick", "harry", "harry"};
string[] names3 = { "tom", "dick", "harry", "sally" };
string[] names4 = { "dick", "harry", "tom" };
Want names1 == names4 to return true (and self == self returns true obviously)
But all other combos return false.
What is the most efficient way? These can be large sets of complex objects.
I looked at doing:
var a = name1.orderby<MyCustomType, string>(v => v.Name);
var b = name4.orderby<MyCustomType, string>(v => v.Name);
return a == b;
First sort as you have already done, and then use Enumerable.SequenceEqual. You can use the first overload if your type implements IEquatable<MyCustomType> or overrides Equals; otherwise you will have to use the second form and provide your own IEqualityComparer<MyCustomType>.
So if your type does implement equality, just do:
return a.SequenceEqual(b);
Here's another option that is both faster, safer, and requires no sorting:
public static bool UnsortedSequencesEqual<T>(
this IEnumerable<T> first,
IEnumerable<T> second)
{
return UnsortedSequencesEqual(first, second, null);
}
public static bool UnsortedSequencesEqual<T>(
this IEnumerable<T> first,
IEnumerable<T> second,
IEqualityComparer<T> comparer)
{
if (first == null)
throw new ArgumentNullException("first");
if (second == null)
throw new ArgumentNullException("second");
var counts = new Dictionary<T, int>(comparer);
foreach (var i in first) {
int c;
if (counts.TryGetValue(i, out c))
counts[i] = c + 1;
else
counts[i] = 1;
}
foreach (var i in second) {
int c;
if (!counts.TryGetValue(i, out c))
return false;
if (c == 1)
counts.Remove(i);
else
counts[i] = c - 1;
}
return counts.Count == 0;
}
The most efficient way would depend on the datatypes. A reasonably efficient O(N) solution that's very short is the following:
var list1Groups=list1.ToLookup(i=>i);
var list2Groups=list2.ToLookup(i=>i);
return list1Groups.Count == list2Groups.Count
&& list1Groups.All(g => g.Count() == list2Groups[g.Key].Count());
The items are required to have a valid Equals and GetHashcode implementation.
If you want a faster solution, cdhowie's solution below is comparably fast # 10000 elements, and pulls ahead by a factor 5 for large collections of simple objects - probably due to better memory efficiency.
Finally, if you're really interested in performance, I'd definitely try the Sort-then-SequenceEqual approach. Although it has worse complexity, that's just a log N factor, and those can definitely be drowned out by differences in the constant for all practical data set sizes - and you might be able to sort in-place, use arrays or even incrementally sort (which can be linear). Even at 4 billion elements, the log-base-2 is just 32; that's a relevant performance difference, but the difference in constant factor could conceivably be larger. For example, if you're dealing with arrays of ints and don't mind modifying the collection order, the following is faster than either option even for 10000000 items (twice that and I get an OutOfMemory on 32-bit):
Array.Sort(list1);
Array.Sort(list2);
return list1.SequenceEqual(list2);
YMMV depending on machine, data-type, lunar cycle, and the other usual factors influencing microbenchmarks.
You could use a binary search tree to ensure that the data is sorted. That would make it an O(log N) operation. Then you can run through each tree one item at a time and break as soon as you find a not equal to condition. This would also give you the added benefit of being able to first compare the size of the two trees since duplicates would be filtered out. I'm assuming these are treated as sets, whereby {"harry", "harry"} == {"harry").
If you are counting duplicates, then do a quicksort or a mergesort first, that would then make your comparison operation an O(N) operation. You could of course compare the size first, as two enums cannot be equal if the sizes are different. Since the data is sorted, the first non-equal condition you encounter would render the entire operation as "not-equal".
#cdhowie's answer is great, but here's a nice trick that makes it even better for types that declare .Count by comparing that value prior to decomposing parameters to IEnumerable. Just add this to your code in addition to his solution:
public static bool UnsortedSequencesEqual<T>(this IReadOnlyList<T> first, IReadOnlyList<T> second, IEqualityComparer<T> comparer = null)
{
if (first.Count != second.Count)
{
return false;
}
return UnsortedSequencesEqual((IEnumerable<T>)first, (IEnumerable<T>)second, comparer);
}
Given a List<T> in c# is there a way to extend it (within its capacity) and set the new elements to null? I'd like something that works like a memset. I'm not looking for sugar here, I want fast code. I known that in C the operation could be done in something like 1-3 asm ops per entry.
The best solution I've found is this:
list.AddRange(Enumerable.Repeat(null, count-list.Count));
however that is c# 3.0 (<3.0 is preferred) and might be generating and evaluating an enumerator.
My current code uses:
while(list.Count < lim) list.Add(null);
so that's the starting point for time cost.
The motivation for this is that I need to set the n'th element even if it is after the old Count.
The simplest way is probably by creating a temporary array:
list.AddRange(new T[size - count]);
Where size is the required new size, and count is the count of items in the list. However, for relatively large values of size - count, this can have bad performance, since it can cause the list to reallocate multiple times.(*) It also has the disadvantage of allocating an additional temporary array, which, depending on your requirements, may not be acceptable. You could mitigate both issues at the expense of more explicit code, by using the following methods:
public static class CollectionsUtil
{
public static List<T> EnsureSize<T>(this List<T> list, int size)
{
return EnsureSize(list, size, default(T));
}
public static List<T> EnsureSize<T>(this List<T> list, int size, T value)
{
if (list == null) throw new ArgumentNullException("list");
if (size < 0) throw new ArgumentOutOfRangeException("size");
int count = list.Count;
if (count < size)
{
int capacity = list.Capacity;
if (capacity < size)
list.Capacity = Math.Max(size, capacity * 2);
while (count < size)
{
list.Add(value);
++count;
}
}
return list;
}
}
The only C# 3.0 here is the use of the "this" modifier to make them extension methods. Remove the modifier and it will work in C# 2.0.
Unfortunately, I never compared the performance of the two versions, so I don't know which one is better.
Oh, and did you know you could resize an array by calling Array.Resize<T>? I didn't know that. :)
Update:
(*) Using list.AddRange(array) will not cause an enumerator to be used. Looking further through Reflector showed that the array will be casted to ICollection<T>, and the Count property will be used so that allocation is done only once.
static IEnumerable<T> GetValues<T>(T value, int count) {
for (int i = 0; i < count; ++i)
yield return value;
}
list.AddRange(GetValues<object>(null, number_of_nulls_to_add));
This will work with 2.0+
Why do you want to do that ?
The main advantage of a List is that it can grow as needed, so why do you want to add a number of null or default elements to it ?
Isn't it better that you use an array in this case ?