What does ParallelQuerys Count count?

What does ParallelQuerys Count count? - c#

I'm testing a self written element generator (ICollection<string>) and compare the calculated count to the actual count to get an idea if there's an error or not in my algorithm.
As this generator can generate lots of elements on demand I'm looking in Partitioner<string> and I have implemented a basic one which seems to also produce valid enumerators which together give the same amount of strings as calculated.
Now I want to test how this behaves if run parallel (again first testing for correct count):
MyGenerator generator = new MyGenerator();
MyPartitioner partitioner = new MyPartitioner(generator);
int isCount = partitioner.AsParallel().Count();
int shouldCount = generator.Count;
bool same = isCount == shouldCount; // false
I don't get why this count is not equal! What is the ParallelQuery<string> doing?
generator.Count() == generator.Count // true
partitioner.GetPartitions(xyz).Select(enumerator =>
{
int count = 0;
while (enumerator.MoveNext())
{
count++;
}
return count;
}).Sum() == generator.Count // true
So, I'm currently not seeing an error in my code. Next I tried to manualy count that ParallelQuery<string>:
int count = 0;
partitioner.AsParallel().ForAll(e => Interlocked.Increment(ref count));
count == generator.Count // true
Summed up: Everyone counts my enumerable correct, ParallelQuery.ForAll enumerates exactly generator.Count elements. But what does ParallelQuery.Count()?
If the correct count is something about 10k, ParallelQuery sees 40k.
internal sealed class PartialWordEnumerator : IEnumerator<string>
{
private object sync = new object();
private readonly IEnumerable<char> characters;
private readonly char[] limit;
private char[] buffer;
private IEnumerator<char>[] enumerators;
private int position = 0;
internal PartialWordEnumerator(IEnumerable<char> characters, char[] state, char[] limit)
{
this.characters = new List<char>(characters);
this.buffer = (char[])state.Clone();
if (limit != null)
{
this.limit = (char[])limit.Clone();
}
this.enumerators = new IEnumerator<char>[this.buffer.Length];
for (int i = 0; i < this.buffer.Length; i++)
{
this.enumerators[i] = SkipTo(state[i]);
}
}
private IEnumerator<char> SkipTo(char c)
{
IEnumerator<char> first = this.characters.GetEnumerator();
IEnumerator<char> second = this.characters.GetEnumerator();
while (second.MoveNext())
{
if (second.Current == c)
{
return first;
}
first.MoveNext();
}
throw new InvalidOperationException();
}
private bool ReachedLimit
{
get
{
if (this.limit == null)
{
return false;
}
for (int i = 0; i < this.buffer.Length; i++)
{
if (this.buffer[i] != this.limit[i])
{
return false;
}
}
return true;
}
}
public string Current
{
get
{
if (this.buffer == null)
{
throw new ObjectDisposedException(typeof(PartialWordEnumerator).FullName);
}
return new string(this.buffer);
}
}
object IEnumerator.Current
{
get { return this.Current; }
}
public bool MoveNext()
{
lock (this.sync)
{
if (this.position == this.buffer.Length)
{
this.position--;
}
if (this.position == -1)
{
return false;
}
IEnumerator<char> enumerator = this.enumerators[this.position];
if (enumerator.MoveNext())
{
this.buffer[this.position] = enumerator.Current;
this.position++;
if (this.position == this.buffer.Length)
{
return !this.ReachedLimit;
}
else
{
return this.MoveNext();
}
}
else
{
this.enumerators[this.position] = this.characters.GetEnumerator();
this.position--;
return this.MoveNext();
}
}
}
public void Dispose()
{
this.position = -1;
this.buffer = null;
}
public void Reset()
{
throw new NotSupportedException();
}
}
public override IList<IEnumerator<string>> GetPartitions(int partitionCount)
{
IEnumerator<string>[] enumerators = new IEnumerator<string>[partitionCount];
List<char> characters = new List<char>(this.generator.Characters);
int length = this.generator.Length;
int characterCount = this.generator.Characters.Count;
int steps = Math.Min(characterCount, partitionCount);
int skip = characterCount / steps;
for (int i = 0; i < steps; i++)
{
char c = characters[i * skip];
char[] state = new string(c, length).ToCharArray();
char[] limit = null;
if ((i + 1) * skip < characterCount)
{
c = characters[(i + 1) * skip];
limit = new string(c, length).ToCharArray();
}
if (i == steps - 1)
{
limit = null;
}
enumerators[i] = new PartialWordEnumerator(characters, state, limit);
}
for (int i = steps; i < partitionCount; i++)
{
enumerators[i] = Enumerable.Empty<string>().GetEnumerator();
}
return enumerators;
}

EDIT: I believe I have found the solution. According to the documentation on IEnumerable.MoveNext (emphasis mine):
If MoveNext passes the end of the collection, the enumerator is
positioned after the last element in the collection and MoveNext
returns false. When the enumerator is at this position, subsequent
calls to MoveNext also return false until Reset is called.
According to the following logic:
private bool ReachedLimit
{
get
{
if (this.limit == null)
{
return false;
}
for (int i = 0; i < this.buffer.Length; i++)
{
if (this.buffer[i] != this.limit[i])
{
return false;
}
}
return true;
}
}
The call to MoveNext() will return false only one time - when the buffer is exactly equal to the limit. Once you have passed the limit, the return value from ReachedLimit will start to become false again, making return !this.ReachedLimit return true, so the enumerator will continue past the end of the limit all the way until it runs out of characters to enumerate. Apparently, in the implementation of ParallelQuery.Count(), MoveNext() is called multiple times when it has reached the end, and since it starts to return a true value again, the enumerator happily continues returning more elements (this is not the case in your custom code that walks the enumerator manually, and apparently also is not the case for the ForAll call, so they "accidentally" return the correct results).
The simplest fix to this is to remember the return value from MoveNext() once it becomes false:
private bool _canMoveNext = true;
public bool MoveNext()
{
if (!_canMoveNext) return false;
...
if (this.position == this.buffer.Length)
{
if (this.ReachedLimit) _canMoveNext = false;
...
}
Now once it begins returning false, it will return false for every future call and this returns the correct result from AsParallel().Count(). Hope this helps!
The documentation on Partitioner notes (emphasis mine):
The static methods on Partitioner are all thread-safe and may
be used concurrently from multiple threads. However, while a created
partitioner is in use, the underlying data source should not be
modified, whether from the same thread that is using a partitioner or
from a separate thread.
From what I can understand of the code you have given, it would seem that ParallelQuery.Count() is most likely to have thread-safety issues because it may possibly be iterating multiple enumerators at the same time, whereas all the other solutions would require the enumerators to be run synchronized. Without seeing the code you are using for MyGenerator and MyPartitioner is it difficult to determine if thread-safety issues could be the culprit.
To demonstrate, I have written a simple enumerator that returns the first hundred numbers as strings. Also, I have a partitioner, that distributes the elements in the underlying enumerator over a collection of numPartitions separate lists. Using all the methods you described above on our 12-core server (when I output numPartitions, it uses 12 by default on this machine), I get the expected result of 100 (this is LINQPad-ready code):
void Main()
{
var partitioner = new SimplePartitioner(GetEnumerator());
GetEnumerator().Count().Dump();
partitioner.GetPartitions(10).Select(enumerator =>
{
int count = 0;
while (enumerator.MoveNext())
{
count++;
}
return count;
}).Sum().Dump();
var theCount = 0;
partitioner.AsParallel().ForAll(e => Interlocked.Increment(ref theCount));
theCount.Dump();
partitioner.AsParallel().Count().Dump();
}
// Define other methods and classes here
public IEnumerable<string> GetEnumerator()
{
for (var i = 1; i <= 100; i++)
yield return i.ToString();
}
public class SimplePartitioner : Partitioner<string>
{
private IEnumerable<string> input;
public SimplePartitioner(IEnumerable<string> input)
{
this.input = input;
}
public override IList<IEnumerator<string>> GetPartitions(int numPartitions)
{
var list = new List<string>[numPartitions];
for (var i = 0; i < numPartitions; i++)
list[i] = new List<string>();
var index = 0;
foreach (var s in input)
list[(index = (index + 1) % numPartitions)].Add(s);
IList<IEnumerator<string>> result = new List<IEnumerator<string>>();
foreach (var l in list)
result.Add(l.GetEnumerator());
return result;
}
}
Output:
100
100
100
100
This clearly works. Without more information it is impossible to tell you what is not working in your particular implementation.

Related

For-loop looping too often because it's not exiting properly (performance)

I currently have an array with a size of one million, and a method that parses new objects (each one with a key and value) or overwrites an object in case a new object has an already existing key. The thing is that the loop for the overwriting part is taking an extremely long time with an array this big because it is not exiting properly. For every element it parses it checks every index, so for a thousand elements to parse it would do a billion checks, which is obviously not desired behaviour.
private const int ARRAY_SIZE = 1000000;
private Paar[] pArr = new Paar[ARRAY_SIZE];
public void Put(string key, object value)
{
bool wasReplaced = false;
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null && pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
wasReplaced = true;
}
}
if (wasReplaced == false)
{
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] == null)
{
pArr[i] = new Paar(key, value);
break;
}
else if (i >= pArr.Length)
{
throw new Exception("All slots full");
}
}
}
}
Edit: Note that I am using two loops to prevent the overwriting function from being able to parse an object with a duplicate key and new value into an empty index if there happens to be one before the duplicate key's index (for example if I set one or more random indexes to null).
string keyparse = "Key";
string valueparse = "Value";
Random rnd = new Random();
int wdh = 1000;
for (int i = 0; i < wdh; i++)
{
myMap.Put(keyparse + rnd.Next(1, 10000), valueparse + rnd.Next(1, 10000));
}
I tried to make the first part of the function add 1 to an int and break if the int reaches the number of elements parsed, but it does not seem to function properly, and neither does it affect the time taken by the function.
bool wasReplaced = false;
int count = 0;
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null)
{
count += 1;
if (pArr[i] != null && pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
count += 1;
break;
}
else if (count == 1000)
{
break;
}
}
}
I don't really know why this wouldn't work or how else I could approach this, so I'm kind of stuck here... Any help is greatly appreciated!

Use a Dictionary - but you write it's for practice ...
In that case:
Speedup 1: (Assuming your key is unique!)
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null && pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
// wasReplaced = true;
return;
}
}
Speedup 2: When traversing in the first for-loop, save the first position of an empty space. Then you can use that instantly and only have to iterate the array once.
int empty = -1;
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null )
{
if (pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
return;
}
}
else if( empty < 0 )
{
empty = i;
}
}
if( pArr >= 0 )
pArr[empty] = new Paar(key,value);
else // Array has been fully traversed without match and without empty space
throw new Exception("All slots full");
Speedup 3: You could keep track of the highest index used, so you can break the for-loop early.
Mind that these measures are only to speed up that part. I did not take into account many other considerations and possible techniques like hashing, thread-safety, etc.

There are several things may done here-
A minor improvement will be break your loop when you set wasReplaced = true
if (pArr[i] != null && pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
wasReplaced = true;
break;
}
But it is not a good solution
Another solution is you may use TPL (Task Parallel Library), to use multithread and multicore. There is a ParallelForEach() loop there.
Another solution is use Dictionary<string, string> to store your values, instead of using an array. Dictionary<> uses hash map, you will get a better performance.
If you still want to use your own implementation with array, then try to use hashing or heap algorithm to store your data. By using hashing or heap you can store/get/update data by O(log(n)) time complexity.

private List<Paar> pArr = new list<Paar>();
public void Put(string key, object value)
{
(from p in pArr
where p!= null && p.key!=key select p).ToList()
.ForEach(x => x.key== key, x.Value==value);
}
Try with Linq Query

As written in a comment, I suggest you to use Dictionary instead of Paar class, so you can do something like:
int ARRAY_SIZE= 100000;
Dictionary<string, object> pArr = new Dictionary<string, object>()
public void Put(string key, string object)
{
if(pArr.ContainsKey(key))
pArr[key] = value;
else
{
if(pArr.Count() >= ARRAY_SIZE)
throw new Exception("All slots full");
pArr.Add(key, value)
}
}
If you need to use Paar class, you can use the .ToDictionary(x => x.key, x => x.value) to work with it.

Given that you don't want to use a dictionary and there may be a match anywhere within the array, then the best you can do is search the whole array until you find a match, and capture the index of the first null item along the way (for an insertion point if a match is not found):
public static void Put(string key, object value)
{
int insertionIndex = -1;
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null)
{
if (pArr[i].Key == key)
{
insertionIndex = i;
break;
}
}
else if (insertionIndex < 0)
{
insertionIndex = i;
}
}
if (insertionIndex < 0)
{
throw new Exception("All slots full");
}
else
{
pArr[insertionIndex] = new Paar(key, value);
}
}

I solved this by declaring an instance variable and a local variable which are then to be compared. I also fixed a logical error I made (now the second loop counts properly and the first loop doesn't count anymore if it overwrites something but instead returns).
The method still functions the same but now runs much more efficiently (took 7 seconds for 5kk parses compared to the previous 12 seconds for 1k parses!).
int _objectCount = 0;
public void Put(string key, object value)
{
int localCount = 0;
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] != null)
{
localCount++;
if (pArr[i].key == key)
{
pArr[i] = new Paar(key, value);
return;
}
}
if (localCount == _objectCount)
{
return;
}
}
for (int i = 0; i < pArr.Length; i++)
{
if (pArr[i] == null)
{
pArr[i] = new Paar(key, value);
_objectCount++;
return;
}
else if (i >= pArr.Length)
{
throw new Exception("All slots full");
}
}
}

Iterative Deepening Search in C#

I am trying to implement Iterative Deeping Search. I do not know what I am doing wrong, but I don't seem to be getting it right. I always end up with an infinite loop.
Can anyone point out my mistake?
I implemented the Depth-Limited Search and used it in my IDS code. DLS seems to be working fine on its own, but I do not understand IDS and why i'm ending up in an infinite loop.
public class IterativeDeepeningSearch<T> where T : IComparable
{
string closed;
public int maximumDepth;
public int depth = 0;
bool Found = false;
Stack<Vertex<T>> open;
public IterativeDeepeningSearch()
{
open = new Stack<Vertex<T>>();
}
public bool IDS(Vertex<T> startNode, Vertex<T> goalNode)
{
// loops through until a goal node is found
for (int _depth = 0; _depth < Int32.MaxValue; _depth++)
{
bool found = DLS(startNode, goalNode, _depth);
if (found)
{
return true;
}
}
// this will never be reached as it
// loops forever until goal is found
return false;
}
public bool DLS(Vertex<T> startNode, Vertex<T> goalNode, int _maximumDepth)
{
maximumDepth = _maximumDepth;
open.Push(startNode);
while (open.Count > 0 && depth < maximumDepth)
{
Vertex<T> node = open.Pop();
closed = closed + " " + node.Data.ToString();
if (node.Data.ToString() == goalNode.Data.ToString())
{
Debug.Write("Success");
Found = true;
break;
}
List<Vertex<T>> neighbours = node.Neighbors;
depth++;
if (neighbours != null)
{
foreach (Vertex<T> neighbour in neighbours)
{
if (!closed.Contains(neighbour.ToString()))
open.Push(neighbour);
}
Debug.Write("Failure");
}
}
Console.WriteLine(closed);
return Found;
}
}
}
PS: My Vertex Class just has two properties, Data and Children

The for loop iterates on _depth but in the DLS function you are passing depth, which is always 0
for (int _depth = 0; _depth < Int32.MaxValue; _depth++)
{
bool found = DLS(startNode, goalNode, depth);
if (found)
{
return true;
}
}

Postfix increment into if, c#

Code example:
using System;
public class Test {
public static void Main() {
int a = 0;
if(a++ == 0){
Console.WriteLine(a);
}
}
}
In this code the Console will write: 1. I can write this code in another way:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
These two examples work exactly the same (from what I know about postfix).
The problem is with this example coming from the Microsoft tutorials:
using System;
public class Document {
// Class allowing to view the document as an array of words:
public class WordCollection {
readonly Document document;
internal WordCollection (Document d){
document = d;
}
// Helper function -- search character array "text", starting
// at character "begin", for word number "wordCount". Returns
//false if there are less than wordCount words. Sets "start" and
//length to the position and length of the word within text
private bool GetWord(char[] text, int begin, int wordCount,
out int start, out int length) {
int end = text.Length;
int count = 0;
int inWord = -1;
start = length = 0;
for (int i = begin; i <= end; ++i){
bool isLetter = i < end && Char.IsLetterOrDigit(text[i]);
if (inWord >= 0) {
if (!isLetter) {
if (count++ == wordCount) {//PROBLEM IS HERE!!!!!!!!!!!!
start = inWord;
length = i - inWord;
return true;
}
inWord = -1;
}
} else {
if (isLetter) {
inWord = i;
}
}
}
return false;
}
//Indexer to get and set words of the containing document:
public string this[int index] {
get
{
int start, length;
if(GetWord(document.TextArray, 0, index, out start,
out length)) {
return new string(document.TextArray, start, length);
} else {
throw new IndexOutOfRangeException();
}
}
set {
int start, length;
if(GetWord(document.TextArray, 0, index, out start,
out length))
{
//Replace the word at start/length with
// the string "value"
if(length == value.Length){
Array.Copy(value.ToCharArray(), 0,
document.TextArray, start, length);
}
else {
char[] newText = new char[document.TextArray.Length +
value.Length - length];
Array.Copy(document.TextArray, 0, newText, 0, start);
Array.Copy(value.ToCharArray(), 0, newText, start, value.Length);
Array.Copy(document.TextArray, start + length, newText,
start + value.Length, document.TextArray.Length - start - length);
document.TextArray = newText;
}
} else {
throw new IndexOutOfRangeException();
}
}
}
public int Count {
get {
int count = 0, start = 0, length = 0;
while (GetWord(document.TextArray, start + length,
0, out start, out length)) {
++count;
}
return count;
}
}
}
// Class allowing the document to be viewed like an array
// of character
public class CharacterCollection {
readonly Document document;
internal CharacterCollection(Document d) {
document = d;
}
//Indexer to get and set character in the containing
//document
public char this[int index] {
get {
return document.TextArray[index];
}
set {
document.TextArray[index] = value;
}
}
//get the count of character in the containing document
public int Count {
get {
return document.TextArray.Length;
}
}
}
//Because the types of the fields have indexers,
//these fields appear as "indexed properties":
public WordCollection Words;
public readonly CharacterCollection Characters;
private char[] TextArray;
public Document(string initialText) {
TextArray = initialText.ToCharArray();
Words = new WordCollection(this);
Characters = new CharacterCollection(this);
}
public string Text {
get {
return new string(TextArray);
}
}
class Test {
static void Main() {
Document d = new Document(
"peter piper picked a peck of pickled peppers. How many pickled peppers did peter piper pick?"
);
//Change word "peter" to "penelope"
for(int i = 0; i < d.Words.Count; ++i){
if (d.Words[i] == "peter") {
d.Words[i] = "penelope";
}
}
for (int i = 0; i < d.Characters.Count; ++i) {
if (d.Characters[i] == 'p') {
d.Characters[i] = 'P';
}
}
Console.WriteLine(d.Text);
}
}
}
If I change the code marked above to this:
if (count == wordCount) {//PROBLEM IS HERE
start = inWord;
length = i - inWord;
count++;
return true;
}
I get an IndexOutOfRangeException, but I don't know why.

Your initial assumption is incorrect (that the two examples work exactly the same). In the following version, count is incremented regardless of whether or not it is equal to wordCount:
if (count++ == wordCount)
{
// Code omitted
}
In this version, count is ONLY incremented when it is equal to wordCount
if (count == wordCount)
{
// Other code omitted
count++;
}
EDIT
The reason this is causing you a failure is that, when you are searching for the second word (when wordCount is 1), the variable count will never equal wordCount (because it never gets incremented), and therefore the GetWord method returns false, which then triggers the else clause in your get method, which throws an IndexOutOfRangeException.

In your version of the code, count is only being incremented when count == wordCount; in the Microsoft version, it's being incremented whether the condition is met or not.

using System;
public class Test {
public static void Main() {
int a = 0;
if(a++ == 0){
Console.WriteLine(a);
}
}
}
Is not quite the same as:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
In the second case a++ is executed only if a == 0. In the first case a++ is executed every time we check the condition.

There is your mistake:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
It should be like this:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
else
a++;
}
a gets alwasy increased. This means, that in your code example count will get only increased when count == wordCount (In which case the method will return true anyway...). You basicly never increasing count.

How to improve compare of item in sorted List<MyItem> with item before and after current item?

Does anyone know about a good way to accomplish this task?
Currently i'm doing it more ore less this way, but i'm feeling someway unhappy with this code, unable to say what i could immediately improve.
So if anyone has a smarter way of doing this job i would be happy to know.
private bool Check(List<MyItem> list)
{
bool result = true;
//MyItem implements IComparable<MyItem>
list.Sort();
for (int pos = 0; pos < list.Count - 1; pos++)
{
bool previousCheckOk = true;
if (pos != 0)
{
if (!CheckCollisionWithPrevious(pos))
{
MarkAsFailed(pos);
result = false;
previousCheckOk = false;
}
else
{
MarkAsGood(pos);
}
}
if (previousCheckOk && pos != list.Count - 1)
{
if (!CheckCollisionWithFollowing(pos))
{
MarkAsFailed(pos);
result = false;
}
else
{
MarkAsGood(pos);
}
}
}
return result;
}
private bool CheckCollisionWithPrevious(int pos)
{
bool checkOk = false;
var previousItem = _Item[pos - 1];
// Doing some checks ...
return checkOk;
}
private bool CheckCollisionWithFollowing(int pos)
{
bool checkOk = false;
var followingItem = _Item[pos + 1];
// Doing some checks ...
return checkOk;
}
Update
After reading the answer from Aaronaught and a little weekend to refill full mind power i came up with the following solution, that looks far better now (and nearly the same i got from Aaronaught):
public bool Check(DataGridView dataGridView)
{
bool result = true;
_Items.Sort();
for (int pos = 1; pos < _Items.Count; pos++)
{
var previousItem = _Items[pos - 1];
var currentItem = _Items[pos];
if (previousItem.CollidesWith(currentItem))
{
dataGridView.Rows[pos].ErrorText = "Offset collides with item named " + previousItem.Label;
result = false;
sb.AppendLine("Line " + pos);
}
}
dataGridView.Refresh();
return result;
}

It's certainly possible to reduce the repetition:
private bool Check(List<MyItem> list)
{
list.Sort();
for (int pos = 1; pos < list.Count; pos++)
{
if (!CheckCollisionWithPrevious(list, pos))
{
MarkAsFailed();
return false;
}
MarkAsGood();
}
return true;
}
private bool CheckCollisionWithPrevious(List<MyItem> list, int pos)
{
bool checkOk = false;
var previousItem = list[pos - 1];
// Doing some checks ...
return checkOk;
}
Assuming that CheckCollisionWithPrevious and CheckCollisionWithFollowing perform essentially the same comparisons, then this will perform the same function with a lot less code.
I've also added the list as a parameter to the second function; it doesn't make sense to be taking it as a parameter in the first function, but then referencing a hard-coded member in the function it calls. If you're going to take a parameter, then pass that parameter down the chain.
As far as performance is concerned, though, you're re-sorting the list every time this happens; if it happens often enough, you might be better off using a sorted collection to begin with.
Edit: And just for good measure, if the whole point of this code is just to check for some kind of duplicate key, then you would be way better off using a data structure that prevents this in the first place, such as a Dictionary<TKey, TValue>.

Split a collection into `n` parts with LINQ? [duplicate]

This question already has answers here:
Split List into Sublists with LINQ
(34 answers)
Closed 1 year ago.
Is there a nice way to split a collection into n parts with LINQ?
Not necessarily evenly of course.
That is, I want to divide the collection into sub-collections, which each contains a subset of the elements, where the last collection can be ragged.

A pure linq and the simplest solution is as shown below.
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}

EDIT: Okay, it looks like I misread the question. I read it as "pieces of length n" rather than "n pieces". Doh! Considering deleting answer...
(Original answer)
I don't believe there's a built-in way of partitioning, although I intend to write one in my set of additions to LINQ to Objects. Marc Gravell has an implementation here although I would probably modify it to return a read-only view:
public static IEnumerable<IEnumerable<T>> Partition<T>
(this IEnumerable<T> source, int size)
{
T[] array = null;
int count = 0;
foreach (T item in source)
{
if (array == null)
{
array = new T[size];
}
array[count] = item;
count++;
if (count == size)
{
yield return new ReadOnlyCollection<T>(array);
array = null;
count = 0;
}
}
if (array != null)
{
Array.Resize(ref array, count);
yield return new ReadOnlyCollection<T>(array);
}
}

static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
return list.Select((item, index) => new {index, item})
.GroupBy(x => x.index % parts)
.Select(x => x.Select(y => y.item));
}
}

Ok, I'll throw my hat in the ring. The advantages of my algorithm:
No expensive multiplication, division, or modulus operators
All operations are O(1) (see note below)
Works for IEnumerable<> source (no Count property needed)
Simple
The code:
public static IEnumerable<IEnumerable<T>>
Section<T>(this IEnumerable<T> source, int length)
{
if (length <= 0)
throw new ArgumentOutOfRangeException("length");
var section = new List<T>(length);
foreach (var item in source)
{
section.Add(item);
if (section.Count == length)
{
yield return section.AsReadOnly();
section = new List<T>(length);
}
}
if (section.Count > 0)
yield return section.AsReadOnly();
}
As pointed out in the comments below, this approach doesn't actually address the original question which asked for a fixed number of sections of approximately equal length. That said, you can still use my approach to solve the original question by calling it this way:
myEnum.Section(myEnum.Count() / number_of_sections + 1)
When used in this manner, the approach is no longer O(1) as the Count() operation is O(N).

This is same as the accepted answer, but a much simpler representation:
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> items,
int numOfParts)
{
int i = 0;
return items.GroupBy(x => i++ % numOfParts);
}
The above method splits an IEnumerable<T> into N number of chunks of equal sizes or close to equal sizes.
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items,
int partitionSize)
{
int i = 0;
return items.GroupBy(x => i++ / partitionSize).ToArray();
}
The above method splits an IEnumerable<T> into chunks of desired fixed size with total number of chunks being unimportant - which is not what the question is about.
The problem with the Split method, besides being slower, is that it scrambles the output in the sense that the grouping will be done on the basis of i'th multiple of N for each position, or in other words you don't get the chunks in the original order.
Almost every answer here either doesn't preserve order, or is about partitioning and not splitting, or is plainly wrong. Try this which is faster, preserves order but a lil' more verbose:
public static IEnumerable<IEnumerable<T>> Split<T>(this ICollection<T> items,
int numberOfChunks)
{
if (numberOfChunks <= 0 || numberOfChunks > items.Count)
throw new ArgumentOutOfRangeException("numberOfChunks");
int sizePerPacket = items.Count / numberOfChunks;
int extra = items.Count % numberOfChunks;
for (int i = 0; i < numberOfChunks - extra; i++)
yield return items.Skip(i * sizePerPacket).Take(sizePerPacket);
int alreadyReturnedCount = (numberOfChunks - extra) * sizePerPacket;
int toReturnCount = extra == 0 ? 0 : (items.Count - numberOfChunks) / extra + 1;
for (int i = 0; i < extra; i++)
yield return items.Skip(alreadyReturnedCount + i * toReturnCount).Take(toReturnCount);
}
The equivalent method for a Partition operation here

I have been using the Partition function I posted earlier quite often. The only bad thing about it was that is wasn't completely streaming. This is not a problem if you work with few elements in your sequence. I needed a new solution when i started working with 100.000+ elements in my sequence.
The following solution is a lot more complex (and more code!), but it is very efficient.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
namespace LuvDaSun.Linq
{
public static class EnumerableExtensions
{
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> enumerable, int partitionSize)
{
/*
return enumerable
.Select((item, index) => new { Item = item, Index = index, })
.GroupBy(item => item.Index / partitionSize)
.Select(group => group.Select(item => item.Item) )
;
*/
return new PartitioningEnumerable<T>(enumerable, partitionSize);
}
}
class PartitioningEnumerable<T> : IEnumerable<IEnumerable<T>>
{
IEnumerable<T> _enumerable;
int _partitionSize;
public PartitioningEnumerable(IEnumerable<T> enumerable, int partitionSize)
{
_enumerable = enumerable;
_partitionSize = partitionSize;
}
public IEnumerator<IEnumerable<T>> GetEnumerator()
{
return new PartitioningEnumerator<T>(_enumerable.GetEnumerator(), _partitionSize);
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
class PartitioningEnumerator<T> : IEnumerator<IEnumerable<T>>
{
IEnumerator<T> _enumerator;
int _partitionSize;
public PartitioningEnumerator(IEnumerator<T> enumerator, int partitionSize)
{
_enumerator = enumerator;
_partitionSize = partitionSize;
}
public void Dispose()
{
_enumerator.Dispose();
}
IEnumerable<T> _current;
public IEnumerable<T> Current
{
get { return _current; }
}
object IEnumerator.Current
{
get { return _current; }
}
public void Reset()
{
_current = null;
_enumerator.Reset();
}
public bool MoveNext()
{
bool result;
if (_enumerator.MoveNext())
{
_current = new PartitionEnumerable<T>(_enumerator, _partitionSize);
result = true;
}
else
{
_current = null;
result = false;
}
return result;
}
}
class PartitionEnumerable<T> : IEnumerable<T>
{
IEnumerator<T> _enumerator;
int _partitionSize;
public PartitionEnumerable(IEnumerator<T> enumerator, int partitionSize)
{
_enumerator = enumerator;
_partitionSize = partitionSize;
}
public IEnumerator<T> GetEnumerator()
{
return new PartitionEnumerator<T>(_enumerator, _partitionSize);
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
class PartitionEnumerator<T> : IEnumerator<T>
{
IEnumerator<T> _enumerator;
int _partitionSize;
int _count;
public PartitionEnumerator(IEnumerator<T> enumerator, int partitionSize)
{
_enumerator = enumerator;
_partitionSize = partitionSize;
}
public void Dispose()
{
}
public T Current
{
get { return _enumerator.Current; }
}
object IEnumerator.Current
{
get { return _enumerator.Current; }
}
public void Reset()
{
if (_count > 0) throw new InvalidOperationException();
}
public bool MoveNext()
{
bool result;
if (_count < _partitionSize)
{
if (_count > 0)
{
result = _enumerator.MoveNext();
}
else
{
result = true;
}
_count++;
}
else
{
result = false;
}
return result;
}
}
}
Enjoy!

Interesting thread. To get a streaming version of Split/Partition, one can use enumerators and yield sequences from the enumerator using extension methods. Converting imperative code to functional code using yield is a very powerful technique indeed.
First an enumerator extension that turns a count of elements into a lazy sequence:
public static IEnumerable<T> TakeFromCurrent<T>(this IEnumerator<T> enumerator, int count)
{
while (count > 0)
{
yield return enumerator.Current;
if (--count > 0 && !enumerator.MoveNext()) yield break;
}
}
And then an enumerable extension that partitions a sequence:
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> seq, int partitionSize)
{
var enumerator = seq.GetEnumerator();
while (enumerator.MoveNext())
{
yield return enumerator.TakeFromCurrent(partitionSize);
}
}
The end result is a highly efficient, streaming and lazy implementation that relies on very simple code.
Enjoy!

I use this:
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> instance, int partitionSize)
{
return instance
.Select((value, index) => new { Index = index, Value = value })
.GroupBy(i => i.Index / partitionSize)
.Select(i => i.Select(i2 => i2.Value));
}

As of .NET 6 you can use Enumerable.Chunk<TSource>(IEnumerable<TSource>, Int32).

This is memory efficient and defers execution as much as possible (per batch) and operates in linear time O(n)
public static IEnumerable<IEnumerable<T>> InBatchesOf<T>(this IEnumerable<T> items, int batchSize)
{
List<T> batch = new List<T>(batchSize);
foreach (var item in items)
{
batch.Add(item);
if (batch.Count >= batchSize)
{
yield return batch;
batch = new List<T>();
}
}
if (batch.Count != 0)
{
//can't be batch size or would've yielded above
batch.TrimExcess();
yield return batch;
}
}

There are lots of great answers for this question (and its cousins). I needed this myself and had created a solution that is designed to be efficient and error tolerant in a scenario where the source collection can be treated as a list. It does not use any lazy iteration so it may not be suitable for collections of unknown size that may apply memory pressure.
static public IList<T[]> GetChunks<T>(this IEnumerable<T> source, int batchsize)
{
IList<T[]> result = null;
if (source != null && batchsize > 0)
{
var list = source as List<T> ?? source.ToList();
if (list.Count > 0)
{
result = new List<T[]>();
for (var index = 0; index < list.Count; index += batchsize)
{
var rangesize = Math.Min(batchsize, list.Count - index);
result.Add(list.GetRange(index, rangesize).ToArray());
}
}
}
return result ?? Enumerable.Empty<T[]>().ToList();
}
static public void TestGetChunks()
{
var ids = Enumerable.Range(1, 163).Select(i => i.ToString());
foreach (var chunk in ids.GetChunks(20))
{
Console.WriteLine("[{0}]", String.Join(",", chunk));
}
}
I have seen a few answers across this family of questions that use GetRange and Math.Min. But I believe that overall this is a more complete solution in terms of error checking and efficiency.

protected List<List<int>> MySplit(int MaxNumber, int Divider)
{
List<List<int>> lst = new List<List<int>>();
int ListCount = 0;
int d = MaxNumber / Divider;
lst.Add(new List<int>());
for (int i = 1; i <= MaxNumber; i++)
{
lst[ListCount].Add(i);
if (i != 0 && i % d == 0)
{
ListCount++;
d += MaxNumber / Divider;
lst.Add(new List<int>());
}
}
return lst;
}

Great Answers, for my scenario i tested the accepted answer , and it seems it does not keep order. there is also great answer by Nawfal that keeps order.
But in my scenario i wanted to split the remainder in a normalized way,
all answers i saw spread the remainder or at the beginning or at the end.
My answer also takes the remainder spreading in more normalized way.
static class Program
{
static void Main(string[] args)
{
var input = new List<String>();
for (int k = 0; k < 18; ++k)
{
input.Add(k.ToString());
}
var result = splitListIntoSmallerLists(input, 15);
int i = 0;
foreach(var resul in result){
Console.WriteLine("------Segment:" + i.ToString() + "--------");
foreach(var res in resul){
Console.WriteLine(res);
}
i++;
}
Console.ReadLine();
}
private static List<List<T>> splitListIntoSmallerLists<T>(List<T> i_bigList,int i_numberOfSmallerLists)
{
if (i_numberOfSmallerLists <= 0)
throw new ArgumentOutOfRangeException("Illegal value of numberOfSmallLists");
int normalizedSpreadRemainderCounter = 0;
int normalizedSpreadNumber = 0;
//e.g 7 /5 > 0 ==> output size is 5 , 2 /5 < 0 ==> output is 2
int minimumNumberOfPartsInEachSmallerList = i_bigList.Count / i_numberOfSmallerLists;
int remainder = i_bigList.Count % i_numberOfSmallerLists;
int outputSize = minimumNumberOfPartsInEachSmallerList > 0 ? i_numberOfSmallerLists : remainder;
//In case remainder > 0 we want to spread the remainder equally between the others
if (remainder > 0)
{
if (minimumNumberOfPartsInEachSmallerList > 0)
{
normalizedSpreadNumber = (int)Math.Floor((double)i_numberOfSmallerLists / remainder);
}
else
{
normalizedSpreadNumber = 1;
}
}
List<List<T>> retVal = new List<List<T>>(outputSize);
int inputIndex = 0;
for (int i = 0; i < outputSize; ++i)
{
retVal.Add(new List<T>());
if (minimumNumberOfPartsInEachSmallerList > 0)
{
retVal[i].AddRange(i_bigList.GetRange(inputIndex, minimumNumberOfPartsInEachSmallerList));
inputIndex += minimumNumberOfPartsInEachSmallerList;
}
//If we have remainder take one from it, if our counter is equal to normalizedSpreadNumber.
if (remainder > 0)
{
if (normalizedSpreadRemainderCounter == normalizedSpreadNumber-1)
{
retVal[i].Add(i_bigList[inputIndex]);
remainder--;
inputIndex++;
normalizedSpreadRemainderCounter=0;
}
else
{
normalizedSpreadRemainderCounter++;
}
}
}
return retVal;
}
}

If order in these parts is not very important you can try this:
int[] array = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int n = 3;
var result =
array.Select((value, index) => new { Value = value, Index = index }).GroupBy(i => i.Index % n, i => i.Value);
// or
var result2 =
from i in array.Select((value, index) => new { Value = value, Index = index })
group i.Value by i.Index % n into g
select g;
However these can't be cast to IEnumerable<IEnumerable<int>> by some reason...

This is my code, nice and short.
<Extension()> Public Function Chunk(Of T)(ByVal this As IList(Of T), ByVal size As Integer) As List(Of List(Of T))
Dim result As New List(Of List(Of T))
For i = 0 To CInt(Math.Ceiling(this.Count / size)) - 1
result.Add(New List(Of T)(this.GetRange(i * size, Math.Min(size, this.Count - (i * size)))))
Next
Return result
End Function

This is my way, listing items and breaking row by columns
int repat_count=4;
arrItems.ForEach((x, i) => {
if (i % repat_count == 0)
row = tbo.NewElement(el_tr, cls_min_height);
var td = row.NewElement(el_td);
td.innerHTML = x.Name;
});

I was looking for a split like the one with string, so the whole List is splitted according to some rule, not only the first part, this is my solution
List<int> sequence = new List<int>();
for (int i = 0; i < 2000; i++)
{
sequence.Add(i);
}
int splitIndex = 900;
List<List<int>> splitted = new List<List<int>>();
while (sequence.Count != 0)
{
splitted.Add(sequence.Take(splitIndex).ToList() );
sequence.RemoveRange(0, Math.Min(splitIndex, sequence.Count));
}

Here is a little tweak for the number of items instead of the number of parts:
public static class MiscExctensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int nbItems)
{
return (
list
.Select((o, n) => new { o, n })
.GroupBy(g => (int)(g.n / nbItems))
.Select(g => g.Select(x => x.o))
);
}
}

below code returns both given number of chunks also with sorted data
static IEnumerable<IEnumerable<T>> SplitSequentially<T>(int chunkParts, List<T> inputList)
{
List<int> Splits = split(inputList.Count, chunkParts);
var skipNumber = 0;
List<List<T>> list = new List<List<T>>();
foreach (var count in Splits)
{
var internalList = inputList.Skip(skipNumber).Take(count).ToList();
list.Add(internalList);
skipNumber += count;
}
return list;
}
static List<int> split(int x, int n)
{
List<int> list = new List<int>();
if (x % n == 0)
{
for (int i = 0; i < n; i++)
list.Add(x / n);
}
else
{
// upto n-(x % n) the values
// will be x / n
// after that the values
// will be x / n + 1
int zp = n - (x % n);
int pp = x / n;
for (int i = 0; i < n; i++)
{
if (i >= zp)
list.Add((pp + 1));
else
list.Add(pp);
}
}
return list;
}

int[] items = new int[] { 0,1,2,3,4,5,6,7,8,9, 10 };
int itemIndex = 0;
int groupSize = 2;
int nextGroup = groupSize;
var seqItems = from aItem in items
group aItem by
(itemIndex++ < nextGroup)
?
nextGroup / groupSize
:
(nextGroup += groupSize) / groupSize
into itemGroup
select itemGroup.AsEnumerable();

Just came across this thread, and most of the solutions here involve adding items to collections, effectively materialising each page before returning it. This is bad for two reasons - firstly if your pages are large there's a memory overhead to filling the page, secondly there are iterators which invalidate previous records when you advance to the next one (for example if you wrap a DataReader within an enumerator method).
This solution uses two nested enumerator methods to avoid any need to cache items into temporary collections. Since the outer and inner iterators are traversing the same enumerable, they necessarily share the same enumerator, so it's important not to advance the outer one until you're done with processing the current page. That said, if you decide not to iterate all the way through the current page, when you move to the next page this solution will iterate forward to the page boundary automatically.
using System.Collections.Generic;
public static class EnumerableExtensions
{
/// <summary>
/// Partitions an enumerable into individual pages of a specified size, still scanning the source enumerable just once
/// </summary>
/// <typeparam name="T">The element type</typeparam>
/// <param name="enumerable">The source enumerable</param>
/// <param name="pageSize">The number of elements to return in each page</param>
/// <returns></returns>
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> enumerable, int pageSize)
{
var enumerator = enumerable.GetEnumerator();
while (enumerator.MoveNext())
{
var indexWithinPage = new IntByRef { Value = 0 };
yield return SubPartition(enumerator, pageSize, indexWithinPage);
// Continue iterating through any remaining items in the page, to align with the start of the next page
for (; indexWithinPage.Value < pageSize; indexWithinPage.Value++)
{
if (!enumerator.MoveNext())
{
yield break;
}
}
}
}
private static IEnumerable<T> SubPartition<T>(IEnumerator<T> enumerator, int pageSize, IntByRef index)
{
for (; index.Value < pageSize; index.Value++)
{
yield return enumerator.Current;
if (!enumerator.MoveNext())
{
yield break;
}
}
}
private class IntByRef
{
public int Value { get; set; }
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

What does ParallelQuerys Count count? - c#

Related

For-loop looping too often because it's not exiting properly (performance)

Iterative Deepening Search in C#

Postfix increment into if, c#

How to improve compare of item in sorted List<MyItem> with item before and after current item?

Split a collection into `n` parts with LINQ? [duplicate]

Categories

Resources