I have an ordering query to a List and calling for many times.
list = list.OrderBy().ToList();
In this code ToList() method is spending high resources and takes very long time. How can I speed up with another ordering method without converting back to a list. Should I use .Sort extension for arrays?
First of all, try to sort the list once, and keep it sorted.
To speed up things you can use Parallel LINQ.
see: http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
An OrderBy() Parallel looks like this:
var query = data.AsParallel().Where(x => p(x)).Orderby(x => k(x)).ToList();
You only need to call ToList() once to get your sorted list. All future actions should use sortedList.
sortedList = list.OrderBy().ToList();
Related
Considering this sample code
System.Collections.ArrayList fruits = new System.Collections.ArrayList();
fruits.Add("mango");
fruits.Add("apple");
fruits.Add("lemon");
IEnumerable<string> query = fruits.Cast<string>()
.OrderBy(fruit => fruit)
.Where(fruit => fruit.StartsWith("m"))
.Select(fruit => fruit);
I have two questions:
Do I need to write the last Select clause if Where returns the same type by itself? The example is from msdn, why do they always write it?
What is the correct order of these methods? Does the order affect something? What if I swap Select and Where, or OrderBy?
No, the Select is not necesssary if you are not actually transforming the returned type.
In this case, the ordering of the method calls could have an impact on performance. Sorting all the objects before filtering is sure to take longer than filtering and then sorting a smaller data set.
The .Select is unnecessary in this case because .Cast already guarantees that you're working with IEnumerable<string>.
The ordering of .OrderBy and .Where doesn't affect the results of the query, but in general if you use .Where first you'll get better performance because there will be fewer elements to sort.
I initially had a method that contained a LINQ query returning int[], which then got used later in a fashion similar to:
int[] result = something.Where(s => previousarray.Contains(s.field));
This turned out to be horribly slow, until the first array was retrieved as the native IQueryable<int>. It now runs very quickly, but I'm wondering how I'd deal with the situation if I was provided an int[] from elsewhere which then had to be used as above.
Is there a way to speed up the query in such cases? Converting to a List doesn't seem to help.
In LINQ-SQL, a Contains will be converted to a SELECT ... WHERE field IN(...) and should be relatively fast. In LINQ-Objects however, it will call ICollection<T>.Contains if the source is an ICollection<T>.
When a LINQ-SQL result is treated as an IEnumerable instead of an IQueryable, you lose the linq provider - i.e., any further operations will be done in memory and not in the database.
As for why its much slower in memory:
Array.Contains() is an O(n) operation so
something.Where(s => previousarray.Contains(s.field));
is O(p * s) where p is the size of previousarray and s is the size of something.
HashSet<T>.Contains() on the other hand is an O(1) operation. If you first create a hashset, you will see a big improvement on the .Contains operation as it will be O(s) instead of O(p * s).
Example:
var previousSet = new HashSet<int>(previousarray);
var result = something.Where(s => previousSet.Contains(s.field));
Where on Lists/Arrays/IEnumarables etc is O[N] operation. It is O[~1] on HashSet. So you should try to use it.
I am trying to create a method using LINQ that would take X ammount of products fron the DB, so I am using the .TAKE method for that.
The thing is, in situations I need to take all the products, so is there a wildcard I can give to .TAKE or some other method that would bring me all the products in the DB?
Also, what happens if I do a .TAKE (50) and there are only 10 products in the DB?
My code looks something like :
var ratingsToPick = context.RatingAndProducts
.ToList()
.OrderByDescending(c => c.WeightedRating)
.Take(pAmmount);
You could separate it to a separate call based on your flag:
IEnumerable<RatingAndProducts> ratingsToPick = context.RatingAndProducts
.OrderByDescending(c => c.WeightedRating);
if (!takeAll)
ratingsToPick = ratingsToPick.Take(pAmmount);
var results = ratingsToPick.ToList();
If you don't include the Take, then it will simply take everything.
Note that you may need to type your original query as IEnumerable<MyType> as OrderByDescending returns an IOrderedEnumerable and won't be reassignable from the Take call. (or you can simply work around this as appropriate based on your actual code)
Also, as #Rene147 pointed out, you should move your ToList to the end otherwise it will retrieve all items from the database every time and the OrderByDescending and Take are then actually operating on a List<> of objects in memory not performing it as a database query which I assume is unintended.
Regarding your second question if you perform a Take(50) but only 10 entries are available. That might depend on your database provider, but in my experience, they tend to be smart enough to not throw exceptions and will simply give you whatever number of items are available. (I would suggest you perform a quick test to make sure for your specific case)
Your current solution always takes all products from database. Because you are calling ToList(). After loading all products from database you are taking first N in memory. In order to conditionally load first N products, you need to build query
int? countToTake = 50;
var ratingsToPick = context.RatingAndProducts
.OrderByDescending(c => c.WeightedRating);
// conditionally take only first results
if (countToTake.HasValue)
ratingsToPick = ratingsToPick.Take(countToTake.Value);
var result = ratingsToPick.ToList(); // execute query
I have a very long list of Ids (integers) that represents all the items that are currently in my database:
var idList = GetAllIds();
I also have another huge generic list with items to add to the database:
List<T> itemsToAdd;
Now, I would like to remove all items from the generic list whose Id is already in the idList.
Currently idList is a simple array and I subtract the lists like this:
itemsToAdd.RemoveAll(e => idList.Contains(e.Id));
I am pretty sure that it could be a lot faster, so what datatypes should I use for both collections and what is the most efficient practice to subtract them?
Thank you!
LINQ could help:
itemsToAdd.Except(idList)
Your code is slow because List<T>.Contains is O(n). So your total cost is O(itemsToAdd.Count*idList.Count).
You can make idList into a HashSet<T> which has O(1) .Contains. Or just use the Linq .Except extension method which does it for you.
Note that .Except will also remove all duplicates from the left side. i.e. new int[]{1,1,2}.Except(new int[]{2}) will result in just {1} and the second 1 was removed. But I assume it's no problem in your case because IDs are typically unique.
Transform temporarily idList to an HashSet<T> and use the same method i.e.:
items.RemoveAll(e => idListHash.Contains(e.Id));
it should be much faster
Assuming the following premises are true:
idList and itemsToAdd may not contain duplicate values
you are using the .NET Framework 4.0
you could use a HashSet<T> this way:
var itemsToAddSet = new HashSet(itemsToAdd);
itemsToAddSet.ExceptWith(idList);
According to the documentation the ISet<T>.ExceptWith method is pretty efficient:
This method is an O(n) operation,
where n is the number of elements in
the other parameter.
In your case n is the number of items in idList.
You should use two HashSet<int>s.
Note that they're unique and unordered.
On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.
.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.
It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.
No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.
I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.