Splitting collection into equal batches in Parallel using c# - c#

I am trying to split collection into equal number of batches.below is the code.
public static List<List<T>> SplitIntoBatches<T>(List<T> collection, int size)
{
var chunks = new List<List<T>>();
var count = 0;
var temp = new List<T>();
foreach (var element in collection)
{
if (count++ == size)
{
chunks.Add(temp);
temp = new List<T>();
count = 1;
}
temp.Add(element);
}
chunks.Add(temp);
return chunks;
}
can we do it using Parallel.ForEach() for better performance as we have around 1 Million items in the list?
Thanks!

If performance is the concern, my thoughts (in increasing order of impact):
right-sizing the lists when you create them would save a lot of work, i.e. figure out the output batch sizes before you start copying, i.e. temp = new List<T>(thisChunkSize)
working with arrays would be more effective than working with lists - new T[thisChunkSize]
especially if you use BlockCopy (or CopyTo, which uses that internally) rather than copying individual elements one by one
once you've calculated the offsets for each of the chunks, the individual block-copies could probably be executed in parallel, but I wouldn't assume it will be faster - memory bandwidth will be the limiting factor at that point
but the ultimate fix is: don't copy the data at all, but instead just create ranges over the existing data; for example, if using arrays, ArraySegment<T> would help; if you're open to using newer .NET features, this is a perfect fit for Memory<T>/Span<T> - creating memory/span ranges over an existing array is essentially free and instant - i.e. take a T[] and return List<Memory<T>> or similar.
Even if you can't switch to ArraySegment<T> / Memory<T> etc, returning something like that could still be used - i.e. List<ListSegment<T>> where ListSegment<T> is something like:
readonly struct ListSegment<T> { // like ArraySegment<T>, but for List<T>
public List<T> List {get;}
public int Offset {get;}
public int Count {get;}
}
and have your code work with ListSegment<T> by processing the Offset and Count appropriately.

Related

How to implement lazy shuffling of Lists in C#?

I am looking for an implementation of lazy shuffling in c#.
I only care about the time it takes to process the first couple of elements. I do not care whether or not the original list gets modified (i.e. removing elements would be fine). I do not care if the processing time gets longer as the iterator reaches the end of the list (as long as it stays within reasonable bounds of course).
Context: I have a large list, that I want to get a relatively small number of random samples from. In most cases I only need the very first random element, but in same rare cases I need all elements from the list.
If possible I would like to implement this as an extension method, like this (but answers without extension methods are fine too):
public static class Program
{
public static IEnumerable<T> lazy_shuffle<T>(this IEnumerable<T> input, Random r)
{
//do the magic
return input;
}
static void Main(string[] args)
{
var start = DateTime.Now;
var shuffled = Enumerable.Range(0, 1000000).lazy_shuffle(new Random(123));
var enumerate = shuffled.GetEnumerator();
foreach (var i in Enumerable.Range(0, 5))
{
enumerate.MoveNext();
Console.WriteLine(enumerate.Current);
}
Console.WriteLine($"time for shuffling 1000000 elements was {(DateTime.Now - start).TotalMilliseconds}ms");
}
}
Note:
input.OrderBy(i => r.Next()) would not be good enough, as it needs to iterate over the entire list once the generate one random number for each element of the list.
this is not a duplicate of Lazy Shuffle Algorithms because my question has less tight bounds for the algorithms but instead requires an implementation in c#
this is not a duplicate of Randomize a List<T> because that question is about regular shuffling and not lazy shuffling.
update:
A Count exists. Random Access to elements exists. It is not strictly an ienumerable, and instead just a big List or Array. I have update the question to say "list" instead of "ienumerable". Only the output of the lazy-shuffler needs to be enumerable, the source can be an actual list.
The selection should be fair, i.e. each element needs to have the same chance to be picked first.
mutation/modification of the source-list is fine
In the end I only need to take N random elements from the list, but I do not know the N beforehand
Since the original list can be modified, here is a very simple and efficient solution, based on this answer:
public static IEnumerable<T> Shuffle<T>(this IList<T> list, Random rng)
{
for(int i = list.Count - 1; i >= 0; i--)
{
int swapIndex = rng.Next(i + 1);
yield return list[swapIndex];
list[swapIndex] = list[i];
}
}

C# Adding and Removing elements to an array with an existing size

I just have a question. I noticed that unlike C++, C# is a bit complicated when it comes to array. One of the features or techniques I've been looking for in the array is that: I want to add elements or remove elements from it in a more efficient and simpler way.
Say for example, I have an array called 'food'.
string[] food = {'Bacon', 'Cheese', 'Patty', 'Crabs'}
Then I decided to add more food. Problem with C# as I can see it is this isn't possible to do unless you do use an ArrayList. How about for an array itself? I want to use the array as some sort of inventory where I add things.
Thanks a lot!
You can't do that with arrays in C# without allocating a new array. Because arrays are fixed in size.
If you want to be able to add/remove elements from a container, you could use List<T>. Alternativly you could use an ArrayList but that is not recommended, since in most cases List<T> has a performance advantage.
Internally both use an array as the default container for your data. They also take care of resizing the container according to how much data you put in the collection or take out.
In your example, you would use a list like
List<string> food = new List<string> { "Bacon", "Cheese", "Patty", "Crabs" };
food.Add("Milk"); //Will add Milk to the list
food.Remove("Bacon"); //Will remove "Bacon"
List on MSDN: Docs
Ideally, if you are going to have a variable size array of strings, a List would be better. All you would have to do is then call list.Add(""), list.Remove(""), and other equivalent methods.
But if you would like to keep using string arrays, you could create either a function or class that takes an array, creates a new array of either a larger or smaller size, repopulate that array with the values you had from the original array, and return the new array.
public string[] AddFood(string[] input, string var)
{
string[] result = new string[input.Length + 1];
for (int i = 0; i < input.Length; i++)
{
result[i] = input[i];
}
result[result.Length - 1] = var;
return result;
}
public string[] RemoveFood(string[] input, int index)
{
string[] result = new string[input.Length - 1];
for (int i = 0; i < input.Length; i++)
{
if (i < index) {
result[i] = input[i];
}
else
{
result[i] = input[i + 1];
}
}
return result;
}
Again, I would highly recommend doing the List method instead. The only down side to these lists is that it appends them to the end, rather then figuring out where you want to place said items.
List<string> myFoods = new List<String>(food);
myFoods.Add("Apple");
myFoods.Remove("Bacon");
myFoods.AddRange(new string[] { "Peach", "Pineapple" });
myFoods.RemoveAt(2);
Console.WriteLine(myFoods[0]);
There is also ArrayList if you want a list more like an array, but it is older code and unfavoured.
ArrayList myFoods = new ArrayList(food);
myFoods.Add("Apple");
myFoods.Remove("Bacon");
myFoods.AddRange(new string[] { "Peach", "Pineapple" });
myFoods.RemoveAt(2);
Console.WriteLine(myFoods[0]);
I hope this helps.
To actually answer the question, you just need to resize the array.
Array.Resize(ref array, );
is the new length of the array
To really add elements to an existing array without resizing you can't. Or, can you? Yes, but with some trickery, which at some point you might say is not worth it.
Consider allocating an array of the size you anticipate it could be. You obviously have to estimate well to avoid tons of unused space. Empty slots in the array would be marked by a sentinel value; for a string the obvious candidate is null. You'd know the "true" size of the array by keeping track of the first index of the sentinel. This suggests that an ArrayWrapper class would encapsulate the array and "true size".
That wrapper could add Add() and AddRange() that replace the sentinel values with real ones without allocating.
All that said, the drawback at some point will be that you have to allocate a new array. Doing this manually using the wrapper is pointless unless you have very specific requirements that allow you to reduce allocations.
So, for the most common cases, stick to a List<>, which does that for you. With the list you can construct it by calling the constructor that takes an initial capacity parameter. Adds will use the underlying array without reallocation until it hits the limit.
In a way that List<> is your wrapper that uses an allocation model the original authors decided would minimize allocations in most cases. That is likely to perform better than anything you write unless you can really leverage your domain.

Index Array Storage Memory

Is there a use case for storing index ranges when talking about a potentially huge list.
Let's say with a list of millions of records. These will be analysed and a sublist of indexes will be reported to the user. Rather than listing out a massive list of indexes it would be obviously more legible to present;
Identified Rows: 10, 21, 10000-30000, 700000... etc to the user.
Now I can obviously create this string from the array of indexes but I'm wondering if it would also be more memory efficient to create the list in this format (and not creating a massive list of indexes in memory). Or is it not worth the processing overhead?
List intList = new List{1,2,3,4,5,6,7...};
vs
List strList = new List{"1-3000","3002","4000-5000"...};
To apply this I would imagine creating a List and when adding an item update/add to the list as necessary. Would require quite a bit of converting strings to int and vice-versa I think which is where this process may not be worth it.
Let me know if this isn't clear enough and I can potentially explain further.
UPDATE
I quite like Patrick Hofman's solution below using a list of ranges. What would be really cool would be to extend this so that .add(int) would modify the list of ranges correctly. I think this would be quite complicated though, correct?
I would opt to create a list of ranges. Depending on the number of singles in it, it might be more or less efficient:
public struct Range
{
public Range(int from, int to)
{
this.From = from;
this.To = to;
}
public int From { get; }
public int To { get; }
public static implicit operator Range(int v)
{
return new Range(v, v);
}
}
You can use it like this:
List<Range> l = new List<Range>{ 1, 2, 3, new Range(5, 3000) };

Can you use List<List<struct>> to get around the 2gb object limit?

I'm running up against the 2gb object limit in c# (this applies even in 64 bit for some annoying reason) with a large collection of structs (est. size of 4.2 gig in total).
Now obviously using List is going to give me a list of size 4.2gb give or take, but would using a list made up of smaller lists, which in turn contain a portion of the structs, allow me to jump this limit?
My reasoning here is that it's only a hard-coded limit in the CLR that stops me instantiating a 9gig object on my 64bit platform, and it's entirely unrelated to system resources. Also Lists and Arrays are reference types, and so a List containing lists would only actually contain the references to each list. No one object therefore exceeds the size limit.
Is there any reason why this wouldn't work? I'd try this myself right now but I don't have a memory profiler on hand to verify.
Now obviously using List is going to give me a list of size 4.2gb give or take, but would using a list made up of smaller lists, which in turn contain a portion of the structs, allow me to jump this limit?
Yes - though, if you're trying to work around this limit, I'd consider using arrays yourself instead of letting the List<T> class manage the array.
The 2gb single object limit in the CLR is exactly that, a single object instance. When you make an array of a struct (which List<T> uses internally), the entire array is "one object instance" in the CLR. However, by using a List<List<T>> or a jagged array, each internal list/array is a separate object, which allows you to effectively have any size object you wish.
The CLR team actually blogged about this, and provided a sample BigArray<T> implementation that acts like a single List<T>, but does the "block" management internally for you. This is another option for getting >2gb lists.
Note that .NET 4.5 will have the option to provide larger than 2gb objects on x64, but it will be something you have to explicitly opt in to having.
The List holds references which are 4 or 8 bytes, depending on if you're running in 32-bit or 64-bit mode, therefore if you reference a 2GB object that would not increase the actual List size to 2 GB but it would only increase it by the number of bytes it is necessary to reference that object.
This will allow you to reference millions of objects and each object could be 2GB. If you have 4 objects in the List and each is 2 GB, then you would have 8 GB worth of objects referenced by the List, but the List object would have only used up an extra 4*8=32 bytes.
The number of references you can hold on a 32-bit machine before the List hits the 2GB limit is 536.87 million, on a 64-bit machine it's 268.43 million.
536 million references * 2 GB = A LOT OF DATA!
P.S. Reed pointed out, the above is true for reference types but not for value types. So if you're holding value types, then your workaround is valid. Please see the comment below for more info.
There's an interesting post around this subject here:
http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx
Which talks about writing your own 'BigArray' object.
In versions of .NET prior to 4.5, the maximum object size is 2GB. From 4.5 onwards you can allocate larger objects if gcAllowVeryLargeObjects is enabled. Note that the limit for string is not affected, but "arrays" should cover "lists" too, since lists are backed by arrays.
class HugeList<T>
{
private const int PAGE_SIZE = 102400;
private const int ALLOC_STEP = 1024;
private T[][] _rowIndexes;
private int _currentPage = -1;
private int _nextItemIndex = PAGE_SIZE;
private int _pageCount = 0;
private int _itemCount = 0;
#region Internals
private void AddPage()
{
if (++_currentPage == _pageCount)
ExtendPages();
_rowIndexes[_currentPage] = new T[PAGE_SIZE];
_nextItemIndex = 0;
}
private void ExtendPages()
{
if (_rowIndexes == null)
{
_rowIndexes = new T[ALLOC_STEP][];
}
else
{
T[][] rowIndexes = new T[_rowIndexes.Length + ALLOC_STEP][];
Array.Copy(_rowIndexes, rowIndexes, _rowIndexes.Length);
_rowIndexes = rowIndexes;
}
_pageCount = _rowIndexes.Length;
}
#endregion Internals
#region Public
public int Count
{
get { return _itemCount; }
}
public void Add(T item)
{
if (_nextItemIndex == PAGE_SIZE)
AddPage();
_itemCount++;
_rowIndexes[_currentPage][_nextItemIndex++] = item;
}
public T this[int index]
{
get { return _rowIndexes[index / PAGE_SIZE][index % PAGE_SIZE]; }
set { _rowIndexes[index / PAGE_SIZE][index % PAGE_SIZE] = value; }
}
#endregion Public
}

Set/extend List<T> length in c#

Given a List<T> in c# is there a way to extend it (within its capacity) and set the new elements to null? I'd like something that works like a memset. I'm not looking for sugar here, I want fast code. I known that in C the operation could be done in something like 1-3 asm ops per entry.
The best solution I've found is this:
list.AddRange(Enumerable.Repeat(null, count-list.Count));
however that is c# 3.0 (<3.0 is preferred) and might be generating and evaluating an enumerator.
My current code uses:
while(list.Count < lim) list.Add(null);
so that's the starting point for time cost.
The motivation for this is that I need to set the n'th element even if it is after the old Count.
The simplest way is probably by creating a temporary array:
list.AddRange(new T[size - count]);
Where size is the required new size, and count is the count of items in the list. However, for relatively large values of size - count, this can have bad performance, since it can cause the list to reallocate multiple times.(*) It also has the disadvantage of allocating an additional temporary array, which, depending on your requirements, may not be acceptable. You could mitigate both issues at the expense of more explicit code, by using the following methods:
public static class CollectionsUtil
{
public static List<T> EnsureSize<T>(this List<T> list, int size)
{
return EnsureSize(list, size, default(T));
}
public static List<T> EnsureSize<T>(this List<T> list, int size, T value)
{
if (list == null) throw new ArgumentNullException("list");
if (size < 0) throw new ArgumentOutOfRangeException("size");
int count = list.Count;
if (count < size)
{
int capacity = list.Capacity;
if (capacity < size)
list.Capacity = Math.Max(size, capacity * 2);
while (count < size)
{
list.Add(value);
++count;
}
}
return list;
}
}
The only C# 3.0 here is the use of the "this" modifier to make them extension methods. Remove the modifier and it will work in C# 2.0.
Unfortunately, I never compared the performance of the two versions, so I don't know which one is better.
Oh, and did you know you could resize an array by calling Array.Resize<T>? I didn't know that. :)
Update:
(*) Using list.AddRange(array) will not cause an enumerator to be used. Looking further through Reflector showed that the array will be casted to ICollection<T>, and the Count property will be used so that allocation is done only once.
static IEnumerable<T> GetValues<T>(T value, int count) {
for (int i = 0; i < count; ++i)
yield return value;
}
list.AddRange(GetValues<object>(null, number_of_nulls_to_add));
This will work with 2.0+
Why do you want to do that ?
The main advantage of a List is that it can grow as needed, so why do you want to add a number of null or default elements to it ?
Isn't it better that you use an array in this case ?

Categories