Processing collection in sets

Processing collection in sets - c#

I've a C# generics list collection of customer Ids[customerIdsList].Lets say its count is 25.
I need to pass these Ids in sets of 10[a value which would be configurable and read from app.config]
to another method ProcessCustomerIds() which would process this customer Ids one by one.
ie. the first iteration will pass 10,next will pass the next 10 customer Ids and the last one will pass 5 Ids...and so on and so forth...
How do I achieve this using Linq?
Shall I be using Math.DivRem to do this?
int result=0;
int quotient = Math.DivRem(customerIdsList.Count, 10, out result)
Output:
quotient=2
result=5
So, I will iterate customerIdsList 2 times and invoke ProcessCustomerIds() in each step.
And if result value is greater than 0,then I will do customerIdsList.Skip(25-result) to get the last 5 customerIds from the collection.
Is there any other cleaner, more efficient way to do this? Please advise.

In our project, we have an extension method "Slice" which does exactly what you ask. It looks like this:
public static IEnumerable<IEnumerable<T>> Slice<T>(this IEnumerable<T> list, int size)
{
var slice = new List<T>();
foreach (T item in list)
{
slice.Add(item);
if (slice.Count >= size)
{
yield return slice;
slice = new List<T>();
}
}
if (slice.Count > 0) yield return slice;
}
You use it like this:
customerIdsList.Slice(10).ToList().ForEach(ProcessCustomerIds);
An important feature of this implementation is that it supports deferred execution. (Contrary to an approach using GroupBy). Granted, this doesn't matter most of the time, but sometimes it does.

You could always use this to group the collection:
var n = 10;
var groups = customerIdsList
.Select((id, index) => new { id, index = index / n })
.GroupBy(x => x.index);
Then just run through the groups and issue the members of the group to the server one group at a time.

Yes, you can use Skip and Take methods.
For example:
List <MyObject> list = ...;
int pageSize = 10;
int pageNumber = list.Count / pageSize;
for (int i =0; i<pageNumber; i++){
int currentItem = i * pageSize;
var query = (from obj in list orderby obj.Id).Skip(currentItem).Take(pageSize);
// call method
}
Remember to order the list if you want to use Skip and Take .

A simple extension:
public static class Extensions
{
public static IEnumerable<IEnumerable<T>> Chunks<T>(this List<T> source, int size)
{
for (int i = 0; i < source.Count; i += size)
{
yield return i - source.Count > size
? source.Skip(i)
: source.Skip(i).Take(size);
}
}
}
And then use it like:
var chunks = customerIdsList.Chunks(10);
foreach(var c in chunks)
{
ProcessCustomerIds(c);
}

Related

performance issue with System.Linq when subdividing a list into multiple lists

I wrote a method to subdivide a list of items into multiple lists using System.Linq.
When I run this method for 50000 of simple integers it takes about 59.862 seconds.
Stopwatch watchresult0 = new Stopwatch();
watchresult0.Start();
var result0 = SubDivideListLinq(Enumerable.Range(0, 50000), 100).ToList();
watchresult0.Stop();
long elapsedresult0 = watchresult0.ElapsedMilliseconds;
So I tried to boost it, and wrote it with a simple loop iterating over each item in my list and it only needs 4 milliseconds:
Stopwatch watchresult1 = new Stopwatch();
watchresult1.Start();
var result1 = SubDivideList(Enumerable.Range(0, 50000), 100).ToList();
watchresult1.Stop();
long elapsedresult1 = watchresult1.ElapsedMilliseconds;
This is my Subdivide-method using Linq:
private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
while (enumerable.Any())
{
yield return enumerable.Take(count).ToList();
enumerable = enumerable.Skip(count);
}
}
And this is my Subdivide-method with the foreach loop over each item:
private static IEnumerable<List<T>> SubDivideList<T>(IEnumerable<T> enumerable, int count)
{
List<T> allItems = enumerable.ToList();
List<T> items = new List<T>(count);
foreach (T item in allItems)
{
items.Add(item);
if (items.Count != count) continue;
yield return items;
items = new List<T>(count);
}
if (items.Any())
yield return items;
}
you have any idea, why my own implementation is so much faster than dividing with Linq? Or am I doing something wrong?
And: As you can see, I know how to split lists, so this is not a duplicated of the related question. I wanted to know about performance between linq and my implementation. Not how to split lists

If someone comes here, with the same question:
So finally I did some more research and found, that the multiple enumeration with System.Linq is the cause of performance:
When I'm enumerating it to an array, to avoid the multiple enumeration, the performance gets much better (14 ms / 50k items):
T[] allItems = enumerable as T[] ?? enumerable.ToArray();
while (allItems.Any())
{
yield return allItems.Take(count);
allItems = allItems.Skip(count).ToArray();
}
Still, I won't use the linq approach, since it's slower.
Instead I wrote an extension-method to subdivide my lists and it takes 3ms for 50k items:
public static class EnumerableExtensions
{
public static IEnumerable<List<T>> Subdivide<T>(this IEnumerable<T> enumerable, int count)
{
List<T> items = new List<T>(count);
int index = 0;
foreach (T item in enumerable)
{
items.Add(item);
index++;
if (index != count) continue;
yield return items;
items = new List<T>(count);
index = 0;
}
if (index != 0 && items.Any())
yield return items;
}
}
Like #AndreasNiedermair already wrote, this is also contained in MoreLinq-Library, called Batch. (But I won't add the library now for just this one method)

If you are after readability and performance You may want to use this algorithm instead. in terms of speed this one is really close to your non-linq version. at the same time its much more readable.
private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
int index = 0;
return enumerable.GroupBy(l => index++/count).Select(l => l.ToList());
}
And its alternative:
private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
int index = 0;
return from l in enumerable
group l by index++/count
into l select l.ToList();
}
Another alternative:
private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
int index = 0;
return enumerable.GroupBy(l => index++/count,
item => item,
(key,result) => result.ToList());
}
In my computer I get linq 0.006 sec versus non-linq 0.002 sec which is completely fair and acceptable to use linq.
As an advice, don't torture your self with micro optimizing code. clearly no one is gonna feel the difference of few milliseconds, so write a code that later you and others can understand easily.

How to page an array using LINQ?

If I have an array like this :
string[] mobile_numbers = plst.Where(r => !string.IsNullOrEmpty(r.Mobile))
.Select(r => r.Mobile.ToString())
.ToArray();
I want to paging this array and loop according to those pages .
Say the array count is 400 and i wanna to take the first 20 then the second 20 and so on until the end of array to process each 20 item .
How to do this with linq ? .

Use Skip and Take methods for paging (but keep in mind that it will iterate collection for each page you are going to take):
int pageSize = 20;
int pageNumber = 2;
var result = mobile_numbers.Skip(pageNumber * pageSize).Take(pageSize);
If you need just split array on 'pages' then consider to use MoreLinq (available from NuGet) Batch method:
var pages = mobile_numbers.Batch(pageSize);
If you don't want to use whole library, then take a look on Batch method implementation. Or use this extension method:
public static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> source, int size)
{
T[] bucket = null;
var count = 0;
foreach (var item in source)
{
if (bucket == null)
bucket = new T[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket;
bucket = null;
count = 0;
}
if (bucket != null && count > 0)
yield return bucket.Take(count).ToArray();
}
Usage:
int pageSize = 20;
foreach(var page in mobile_numbers.Batch(pageSize))
{
foreach(var item in page)
// use items
}

You need a batching operator.
There is one in MoreLinq that you can use.
You would use it like this (for your example):
foreach (var batch in mobile_numbers.Batch(20))
process(batch);
batch in the above loop will be an IEnumerable of at most 20 items (the last batch may be smaller than 20; all the others will be 20 in length).

You can use .Skip(n).Take(x); to skip to the current index and take the amount required.
Take will only take the number available, i.e. what's left, when the number available is less than requested.

Is the order of execution of Linq the reason for this catch?

I have this function to repeat a sequence:
public static List<T> Repeat<T>(this IEnumerable<T> lst, int count)
{
if (count < 0)
throw new ArgumentOutOfRangeException("count");
var ret = Enumerable.Empty<T>();
for (var i = 0; i < count; i++)
ret = ret.Concat(lst);
return ret.ToList();
}
Now if I do:
var d = Enumerable.Range(1, 100);
var f = d.Select(t => new Person()).Repeat(10);
int i = f.Distinct().Count();
I expect i to be 100, but its giving me 1000! My question strictly is why is this happening? Shouldn't Linq be smart enough to figure out that it's the first selected 100 persons I need to concatenate with variable ret? I'm getting a feeling that here the Concat is being given preference when it's used with a Select when its executed at ret.ToList()..
Edit:
If I do this I get the correct result as expected:
var f = d.Select(t => new Person()).ToList().Repeat(10);
int i = f.Distinct().Count(); //prints 100
Edit again:
I have not overridden Equals. I'm just trying to get 100 unique persons (by reference of course). My question is can someone elucidate to me why is Linq not doing the select operation first and then concatenation (of course at the time of execution)?

The problem is that unless you call ToList, the d.Select(t => new Person()) is re-enumerated each time the Repeat goes through the list, creating duplicate Persons. The technique is known as the deferred execution.
In general, LINQ does not assume that each time it enumerates a sequence it would get the same sequence, or even a sequence of the same length. If this effect is not desirable, you can always "materialize" the sequence inside your Repeat method by calling ToList right away, like this:
public static List<T> Repeat<T>(this IEnumerable<T> lstEnum, int count) {
if (count < 0)
throw new ArgumentOutOfRangeException("count");
var lst = lstEnum.ToList(); // Enumerate only once
var ret = Enumerable.Empty<T>();
for (var i = 0; i < count; i++)
ret = ret.Concat(lst);
return ret.ToList();
}

I could break down my problem to something less trivial:
var d = Enumerable.Range(1, 100);
var f = d.Select(t => new Person());
Now essentially I am doing this:
f = f.Concat(f);
Mind you query hasn't been executed till now. At the time of execution f is still d.Select(t => new Person()) unexecuted. So the last statement at the time of execution can broken down to:
f = f.Concat(f);
//which is
f = d.Select(t => new Person()).Concat(d.Select(t => new Person()));
which is obvious to create 100 + 100 = 200 new instances of persons. So
f.Distinct().ToList(); //yields 200, not 100
which is the correct behaviour.
Edit: I could rewrite the extension method as simple as,
public static IEnumerable<T> Repeat<T>(this IEnumerable<T> source, int times)
{
source = source.ToArray();
return Enumerable.Range(0, times).SelectMany(_ => source);
}
I used dasblinkenlight's suggestion to fix the issue.

Each Person object is a separate object. All 1000 are distinct.
What is the definition of equality for the Person type? If you don't override it, that definition will be reference equality, meaning all 1000 objects are distinct.

Searching with Linq

I have a collection of objects, each with an int Frame property. Given an int, I want to find the object in the collection that has the closest Frame.
Here is what I'm doing so far:
public static void Search(int frameNumber)
{
var differences = (from rec in _records
select new { FrameDiff = Math.Abs(rec.Frame - frameNumber), Record = rec }).OrderBy(x => x.FrameDiff);
var closestRecord = differences.FirstOrDefault().Record;
//continue work...
}
This is great and everything, except there are 200,000 items in my collection and I call this method very frequently. Is there a relatively easy, more efficient way to do this?

var closestRecord = _records.MinBy(rec => Math.Abs(rec.Frame - frameNumber));
using MinBy from MoreLINQ.

What you might want to try is to store the frames in a datastructure that's sorted by Frame. Then you can do a binary search when you need to find the closest one to a given frameNumber.

I don't know that I would use LINQ for this, at least not with an orderby.
static Record FindClosestRecord(IEnumerable<Record> records, int number)
{
Record closest = null;
int leastDifference = int.MaxValue;
foreach (Record record in records)
{
int difference = Math.Abs(number - record.Frame);
if (difference == 0)
{
return record; // exact match, return early
}
else if (difference < leastDifference)
{
leastDifference = difference;
closest = record;
}
}
return closest;
}

you can combine your statements into one ala:
var closestRecord = (from rec in _records
select new { FrameDiff = Math.Abs(rec.Frame - frameNumber),
Record = rec }).OrderBy(x => x.FrameDiff).FirstOrDefault().Record;

Maybe you could divide your big itemlist in 5 - 10 smaller lists that are ordered by their Framediff or something ?
this way the search is faster if you know in which list you need to search

fastest way to remove an item in a list

I have a list of User objects, and I have to remove ONE item from the list with a specific UserID.
This method has to be as fast as possible, currently I am looping through each item and checking if the ID matches the UserID, if not, then I add the row to a my filteredList collection.
List allItems = GetItems();
for(int x = 0; x < allItems.Count; x++)
{
if(specialUserID == allItems[x].ID)
continue;
else
filteredItems.Add( allItems[x] );
}

If it really has to be as fast as possible, use a different data structure. List isn't known for efficiency of deletion. How about a Dictionary that maps ID to User?

Well, if you want to create a new collection to leave the original untouched, you have to loop through all the items.
Create the new list with the right capacity from the start, that minimises allocations.
Your program logic with the continue seems a bit backwards... just use the != operator instead of the == operator:
List<User> allItems = GetItems();
List<User> filteredItems = new List<User>(allItems.Count - 1);
foreach (User u in allItems) {
if(u.ID != specialUserID) {
filteredItems.Add(u);
}
}
If you want to change the original collection instead of creating a new, storing the items in a Dictionary<int, User> would be the fastest option. Both locating the item and removing it are close to O(1) operations, so that would make the whole operation close to an O(1) operation instead of an O(n) operation.

Use a hashtable. Lookup time is O(1) for everything assuming a good hash algorithm with minimal collision potential. I would recommend something that implements IDictionary

If you must transfer from one list to another here is the fasted result I've found:
var filtered = new List<SomeClass>(allItems);
for (int i = 0; i < filtered.Count; i++)
if (filtered[i].id == 9999)
filtered.RemoveAt(i);
I tried comparing your method, the method above, and a linq "where" statement:
var allItems = new List<SomeClass>();
for (int i = 0; i < 10000000; i++)
allItems.Add(new SomeClass() { id = i });
Console.WriteLine("Tests Started");
var timer = new Stopwatch();
timer.Start();
var filtered = new List<SomeClass>();
foreach (var item in allItems)
if (item.id != 9999)
filtered.Add(item);
var y = filtered.Last();
timer.Stop();
Console.WriteLine("Transfer to filtered list: {0}", timer.Elapsed.TotalMilliseconds);
timer.Reset();
timer.Start();
filtered = new List<SomeClass>(allItems);
for (int i = 0; i < filtered.Count; i++)
if (filtered[i].id == 9999)
filtered.RemoveAt(i);
var s = filtered.Last();
timer.Stop();
Console.WriteLine("Removal from filtered list: {0}", timer.Elapsed.TotalMilliseconds);
timer.Reset();
timer.Start();
var linqresults = allItems.Where(x => (x.id != 9999));
var m = linqresults.Last();
timer.Stop();
Console.WriteLine("linq list: {0}", timer.Elapsed.TotalMilliseconds);
The results were as follows:
Tests Started
Transfer to filtered list: 610.5473
Removal from filtered list: 207.5675
linq list: 379.4382
using the "Add(someCollection)" and using a ".RemoveAt" was a good deal faster.
Also, subsequent .RemoveAt calls are pretty cheap.

I know it's not the fastest, but what about generic list and remove()? (msdn). Anybody knows how it performs compared to eg. the example in the question?

Here's a thought, how about you don't remove it per se. What I mean is something like this:
public static IEnumerable<T> LoopWithExclusion<T>(this IEnumerable<T> list, Func<T,bool> excludePredicate)
{
foreach(var item in list)
{
if(excludePredicate(item))
{
continue;
}
yield return item;
}
}
The point being, whenever you need a "filtered" list, just call this extension method, which loops through the original list, returns all of the items, EXCEPT the ones you don't want.
Something like this:
List<User> users = GetUsers();
//later in the code when you need the filtered list:
foreach(var user in users.LoopWithExclusion(u => u.Id == myIdToExclude))
{
//do what you gotta do
}

Assuming the count of the list is even, I would :
(a) get a list of the number of processors
(b) Divide your list into equal chunks for each processors
(c) spawn a thread for each processor with these data chunks, with the terminating condition being if the predicate is found to return a boolean flag.

public static void RemoveSingle<T>(this List<T> items, Predicate<T> match)
{
int i = -1;
while (i < items.Count && !match(items[++i])) ;
if (i < items.Count)
{
items[i] = items[items.Count - 1];
items.RemoveAt(items.Count - 1);
}
}

I cannot understand why the most easy, straight-forward and obvious solution (also the fastest among the List-based ones) wasn't given by anyone.
This code removes ONE item with a matching ID.
for(int i = 0; i < items.Count; i++) {
if(items[i].ID == specialUserID) {
items.RemoveAt[i];
break;
}
}

If you have a list and you want to mutate it in place to remove an item matching a condition the following is faster than any of the alternatives posted so far:
for (int i = allItems.Count - 1; i >= 0; i--)
if (allItems[i].id == 9999)
allItems.RemoveAt(i);
A Dictionary may be faster for some uses, but don't discount a List. For small collections, it will likely be faster and for large collections, it may save memory which may, in turn make you application faster overall. Profiling is the only way to determine which is faster in a real application.

Here is some code that is efficient if you have hundreds or thousands of items:
List allItems = GetItems();
//Choose the correct loop here
if((x % 5) == 0 && (X >= 5))
{
for(int x = 0; x < allItems.Count; x = x + 5)
{
if(specialUserID != allItems[x].ID)
filteredItems.Add( allItems[x] );
if(specialUserID != allItems[x+1].ID)
filteredItems.Add( allItems[x+1] );
if(specialUserID != allItems[x+2].ID)
filteredItems.Add( allItems[x+2] );
if(specialUserID != allItems[x+3].ID)
filteredItems.Add( allItems[x+3] );
if(specialUserID != allItems[x+4].ID)
filteredItems.Add( allItems[x+4] );
}
}
Start testing if the size of the loop is divisible by the largest number to the smallest number. if you want 10 if statements in the loop then test if the size of the list is bigger then ten and divisible by ten then go down from there. For example if you have 99 items --- you can use 9 if statements in the loop. The loop will iterate 11 times instead of 99 times
"if" statements are cheap and fast

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Processing collection in sets - c#

You could always use this to group the collection: var n = 10; var groups = customerIdsList .Select((id, index) => new { id, index = index / n }) .GroupBy(x => x.index); Then just run through the groups and issue the members of the group to the server one group at a time.

Related

performance issue with System.Linq when subdividing a list into multiple lists

How to page an array using LINQ?

Is the order of execution of Linq the reason for this catch?

Searching with Linq

fastest way to remove an item in a list

Categories

Resources