How to implement lazy shuffling of Lists in C#? - c#

I am looking for an implementation of lazy shuffling in c#.
I only care about the time it takes to process the first couple of elements. I do not care whether or not the original list gets modified (i.e. removing elements would be fine). I do not care if the processing time gets longer as the iterator reaches the end of the list (as long as it stays within reasonable bounds of course).
Context: I have a large list, that I want to get a relatively small number of random samples from. In most cases I only need the very first random element, but in same rare cases I need all elements from the list.
If possible I would like to implement this as an extension method, like this (but answers without extension methods are fine too):
public static class Program
{
public static IEnumerable<T> lazy_shuffle<T>(this IEnumerable<T> input, Random r)
{
//do the magic
return input;
}
static void Main(string[] args)
{
var start = DateTime.Now;
var shuffled = Enumerable.Range(0, 1000000).lazy_shuffle(new Random(123));
var enumerate = shuffled.GetEnumerator();
foreach (var i in Enumerable.Range(0, 5))
{
enumerate.MoveNext();
Console.WriteLine(enumerate.Current);
}
Console.WriteLine($"time for shuffling 1000000 elements was {(DateTime.Now - start).TotalMilliseconds}ms");
}
}
Note:
input.OrderBy(i => r.Next()) would not be good enough, as it needs to iterate over the entire list once the generate one random number for each element of the list.
this is not a duplicate of Lazy Shuffle Algorithms because my question has less tight bounds for the algorithms but instead requires an implementation in c#
this is not a duplicate of Randomize a List<T> because that question is about regular shuffling and not lazy shuffling.
update:
A Count exists. Random Access to elements exists. It is not strictly an ienumerable, and instead just a big List or Array. I have update the question to say "list" instead of "ienumerable". Only the output of the lazy-shuffler needs to be enumerable, the source can be an actual list.
The selection should be fair, i.e. each element needs to have the same chance to be picked first.
mutation/modification of the source-list is fine
In the end I only need to take N random elements from the list, but I do not know the N beforehand

Since the original list can be modified, here is a very simple and efficient solution, based on this answer:
public static IEnumerable<T> Shuffle<T>(this IList<T> list, Random rng)
{
for(int i = list.Count - 1; i >= 0; i--)
{
int swapIndex = rng.Next(i + 1);
yield return list[swapIndex];
list[swapIndex] = list[i];
}
}

Related

Looking for a data structure that is optimized for finding the next closest element

I have two classes, let's call them foo and bar, that both have a DateTime property called ReadingTime.
I then have long lists of these classes, let's say foos and bars, where foos is List<foo>, bars is List<bar>.
My goal is for every element in foos to find the events in bars that happened right before and right after foo.
Some code to clarify:
var foos = new List<foo>();
var bars = new List<bar>();
...
foreach (var foo in foos)
{
bar before = bars.Where(b => b.ReadingTime <= foo.ReadingTime).OrderByDescending(b => b.ReadingTime).FirstOrDefault();
bar after = bars.Where(b => b.ReadingTime > foo.ReadingTime).OrderBy(b => b.ReadingTime).FirstOrDefault();
...
}
My issue here is performance. Is it possible to use some other data structure than a list to speed up the comparisons? In particular the OrderBy statement every single time seems like a huge waste, having it pre-ordered should also speed up the comparisons, right?
I just don't know what data structure is best, SortedList, SortedSet, SortedDictionary etc. there seem so many. Also all the information I find is on lookups, inserts, delets, etc., noone writes about finding the next closest element so I'm not sure if anything is optimized for that.
I'm on .net core 3.1 if that matters.
Thanks in advance!
Edit: Okay so to wrap this up:
First I tried implementing #derloopkat's approach. For this I figured I needed a data type that could save the data in a sorted order so I just left it as IOrderedEnumerable (which is what linq returns). Probably not very smart, as that actually brought things to a crawl. I then tried going with SortedList. Had to remove some duplicates first which was no problem in my case. Thanks for the help #Olivier Rogier! This got me up to roughly 2x the original performance, though I suspect it's mostly the removed linq OrderBys. For now this is good enough, if/when I need more performance I'm going to go with what #CamiloTerevinto suggested.
Lastly #Aldert thank you for your time but I'm too noob and under too much time pressure to understand what you suggested. Still appreciate it and might revisit this later.
Edit2: Ended up going with #CamiloTerevinto's suggestion. Cut my runtime down from 10 hours to a couple of minutes.
You don't need to sort bars ascending and descending on each iteration. Order bars just once before the loop by calling .OrderBy(f => f.ReadingTime) and then use LastOrDefault() and FirstOrDefault().
foreach (var foo in foos)
{
bar before = bars.LastOrDefault(b => b.ReadingTime <= foo.ReadingTime);
bar after = bars.FirstOrDefault(b => b.ReadingTime > foo.ReadingTime);
//...
}
This produces same output you get with your code and runs faster.
For memory performances and to have strong typing, you can use a SortedDictionary, or SortedList but it manipulates objects. Because you compare DateTime you don't need to implement comparer.
What's the difference between SortedList and SortedDictionary?
SortedList<>, SortedDictionary<> and Dictionary<>
Difference between SortedList and SortedDictionary in C#
For speed optimization you can use a double linked list where each item indicates the next and the previous items:
Doubly Linked List in C#
Linked List Implementation in C#
Using a linked list or a double linked list requires more memory because you store the next and the previous reference in a cell that embeed each instance, but you can have sometimes the most faster way to parse and compare data, as well as to search, sort, reorder, add, remove and move items, because you don't manipulate an array, but linked references.
You also can create powerfull trees and manage data in a better way than arrays.
You can use the binary sort for quick lookup. Below the code where bars is sorted and foo is looked up. You can do yourself some reading on binary searches and enhance the code by also sorting Foos. In this case you can minimize the search range of bars...
The code generates 2 lists with 100 items. then sorts bars and does a binary search for 100 times.
using System;
using System.Collections.Generic;
namespace ConsoleApp2
{
class BaseReading
{
private DateTime readingTime;
public BaseReading(DateTime dt)
{
readingTime = dt;
}
public DateTime ReadingTime
{
get { return readingTime; }
set { readingTime = value; }
}
}
class Foo:BaseReading
{
public Foo(DateTime dt) : base(dt)
{ }
}
class Bar: BaseReading
{
public Bar(DateTime dt) : base(dt)
{ }
}
class ReadingTimeComparer: IComparer<BaseReading>
{
public int Compare(BaseReading x, BaseReading y)
{
return x.ReadingTime.CompareTo(y.ReadingTime);
}
}
class Program
{
static private List<BaseReading> foos = new List<BaseReading>();
static private List<BaseReading> bars = new List<BaseReading>();
static private Random ran = new Random();
static void Main(string[] args)
{
for (int i = 0; i< 100;i++)
{
foos.Add(new BaseReading(GetRandomDate()));
bars.Add(new BaseReading(GetRandomDate()));
}
var rtc = new ReadingTimeComparer();
bars.Sort(rtc);
foreach (BaseReading br in foos)
{
int index = bars.BinarySearch(br, rtc);
}
}
static DateTime GetRandomDate()
{
long randomTicks = ran.Next((int)(DateTime.MaxValue.Ticks >> 32));
randomTicks = (randomTicks << 32) + ran.Next();
return new DateTime(randomTicks);
}
}
}
The only APIs available in the .NET platform for finding the next closest element, with a computational complexity better than O(N), are the List.BinarySearch and Array.BinarySearch methods:
// Returns the zero-based index of item in the sorted List<T>, if item is found;
// otherwise, a negative number that is the bitwise complement of the index of
// the next element that is larger than item or, if there is no larger element,
// the bitwise complement of Count.
public int BinarySearch (T item, IComparer<T> comparer);
These APIs are not 100% robust, because the correctness of the results depends on whether the underlying data structure is already sorted, and the platform does not check or enforce this condition. It's up to you to ensure that the list or array is sorted with the correct comparer, before attempting to BinarySearch on it.
These APIs are also cumbersome to use, because in case a direct match is not found you'll get the next largest element as a bitwise complement, which is a negative number, and you'll have to use the ~ operator to get the actual index. And then subtract one to get the closest item from the other direction.
If you don't mind adding a third-party dependency to your app, you could consider the C5 library, which contains the TreeDictionary collection, with the interesting methods below:
// Find the entry in the dictionary whose key is the predecessor of the specified key.
public bool TryPredecessor(K key, out SCG.KeyValuePair<K, V> res);
//Find the entry in the dictionary whose key is the successor of the specified key.
public bool TrySuccessor(K key, out SCG.KeyValuePair<K, V> res)
There are also the TryWeakPredecessor and TryWeakSuccessor methods available, that consider an exact match as a predecessor or successor respectively. In other words they are analogous to the <= and >= operators.
The C5 is a powerful and feature-rich library that offers lots of specialized collections, with its cons being its somewhat idiomatic API.
You should get excellent performance by any of these options.

Splitting collection into equal batches in Parallel using c#

I am trying to split collection into equal number of batches.below is the code.
public static List<List<T>> SplitIntoBatches<T>(List<T> collection, int size)
{
var chunks = new List<List<T>>();
var count = 0;
var temp = new List<T>();
foreach (var element in collection)
{
if (count++ == size)
{
chunks.Add(temp);
temp = new List<T>();
count = 1;
}
temp.Add(element);
}
chunks.Add(temp);
return chunks;
}
can we do it using Parallel.ForEach() for better performance as we have around 1 Million items in the list?
Thanks!
If performance is the concern, my thoughts (in increasing order of impact):
right-sizing the lists when you create them would save a lot of work, i.e. figure out the output batch sizes before you start copying, i.e. temp = new List<T>(thisChunkSize)
working with arrays would be more effective than working with lists - new T[thisChunkSize]
especially if you use BlockCopy (or CopyTo, which uses that internally) rather than copying individual elements one by one
once you've calculated the offsets for each of the chunks, the individual block-copies could probably be executed in parallel, but I wouldn't assume it will be faster - memory bandwidth will be the limiting factor at that point
but the ultimate fix is: don't copy the data at all, but instead just create ranges over the existing data; for example, if using arrays, ArraySegment<T> would help; if you're open to using newer .NET features, this is a perfect fit for Memory<T>/Span<T> - creating memory/span ranges over an existing array is essentially free and instant - i.e. take a T[] and return List<Memory<T>> or similar.
Even if you can't switch to ArraySegment<T> / Memory<T> etc, returning something like that could still be used - i.e. List<ListSegment<T>> where ListSegment<T> is something like:
readonly struct ListSegment<T> { // like ArraySegment<T>, but for List<T>
public List<T> List {get;}
public int Offset {get;}
public int Count {get;}
}
and have your code work with ListSegment<T> by processing the Offset and Count appropriately.

Finding 2-Tuple Combinations of IEnumerable<T> collection, C#

I would like to implement a method, that takes a collection of an unknown Type as a parameter and returns a Collection of 2-tuples which contains all possible distinct combinations from these elements (with no repetition). My Code:
public static IEnumerable<Tuple<T, T>> Get2Combinations<T>(this
IEnumerable<T> col)
{
/*foreach (var item1 in col)
{
col.GetEnumerator().MoveNext();
foreach (var item2 in col)
{
yield return new Tuple<T, T>(item1, item2);
}
}*/
for (int i = 0; i < col.Count(); i++)
{
for (int j = i + 1; j < col.Count(); j++)
{
yield return new Tuple<T, T>(col.ElementAt(i),
col.ElementAt(j));
}
}
}
What I'm doing is i take the first element and take a pair with every other. Then using this inner for loop i loop through all the remaining ones. The problem i see is the method col.ElementAt(i). If we look into source code, we see that if 'col' is of type IList, then this gets directly the value at given index, but taking any other collection, this would be veery very slow and would take a lot of time.
I attempted to deal with this using foreach loops (the commented section), which are efficient when using IEnumerable, but that part just doesn't work, because the enumerator is common for both inner and outer loop and therefore this produces set of all 2-tuples, where some of them are repeated.
Would anyone give me some suggestions, how to improve this code?
The problem is that Enumerable is designed to describe a class where you can iterate through it (like a stream). Its not intended to support efficiently random access (like an array).
Where you use Count() you are forcing the Enumerable to iterate itself to its end, so in the case of a Stream this will wait until the entire stream is read. Of course a Stream might not support efficient direct access, or even buffer its content in memory (remember - it just promises to support enumeration) - so subsequently calling ElementAt() could force it to re-read from the beginning to the position indicated.
Best way to solve this is to swap from IEnumerable to IList. This means it does support random access; clearly it could still be poorly performing, but thats not the responsibility of your function.

How to check if random values are unique?

C # code:
I have 20 random numbers between 1-100 in an array and the program should check if every value is unique. Now i should use another method which returns true if there are only unique values in the array and false if there are not any unique values in the array. I would appreciate if someone could help me with this.
bool allUnique = array.Distinct().Count() == array.Count(); // or array.Length
or
var uniqueNumbers = new HashSet<int>(array);
bool allUnique = uniqueNumbers.Count == array.Count();
A small alternative to #TimSchmelters excellent answers that can run a bit more efficient:
public static bool AllUniq<T> (this IEnumerable<T> data) {
HashSet<T> hs = new HashSet<T>();
return data.All(hs.Add);
}
What this basically does is generating a for loop:
public static bool AllUniq<T> (this IEnumerable<T> data) {
HashSet<T> hs = new HashSet<T>();
foreach(T x in data) {
if(!hs.Add(x)) {
return false;
}
}
return true;
}
From the moment one hs.Add fails - this because the element already exists - the method returns false, if no such object can be found, it returns true.
The reason that this can work faster is that it will stop the process from the moment a duplicate is found whereas the previously discussed approaches first construct a collection of unique numbers and then compare the size. Now if you iterate over large amount of numbers, constructing the entire distinct list can be computationally intensive.
Furthermore note that there are more clever ways than generate-and-test to generate random distinct numbers. For instance interleave the generate and test procedure. Once a project I had to correct generated Sudoku's this way. The result was that one had to wait entire days before it came up with a puzzle.
Here's a non linq solution
for(int i=0; i< YourArray.Length;i++)
{
for(int x=i+1; x< YourArray.Length; x++)
{
if(YourArray[i] == YourArray[x])
{
Console.WriteLine("Found repeated value");
}
}
}

Why do we need iterators in c#?

Can somebody provide a real life example regarding use of iterators. I tried searching google but was not satisfied with the answers.
You've probably heard of arrays and containers - objects that store a list of other objects.
But in order for an object to represent a list, it doesn't actually have to "store" the list. All it has to do is provide you with methods or properties that allow you to obtain the items of the list.
In the .NET framework, the interface IEnumerable is all an object has to support to be considered a "list" in that sense.
To simplify it a little (leaving out some historical baggage):
public interface IEnumerable<T>
{
IEnumerator<T> GetEnumerator();
}
So you can get an enumerator from it. That interface (again, simplifying slightly to remove distracting noise):
public interface IEnumerator<T>
{
bool MoveNext();
T Current { get; }
}
So to loop through a list, you'd do this:
var e = list.GetEnumerator();
while (e.MoveNext())
{
var item = e.Current;
// blah
}
This pattern is captured neatly by the foreach keyword:
foreach (var item in list)
// blah
But what about creating a new kind of list? Yes, we can just use List<T> and fill it up with items. But what if we want to discover the items "on the fly" as they are requested? There is an advantage to this, which is that the client can abandon the iteration after the first three items, and they don't have to "pay the cost" of generating the whole list.
To implement this kind of lazy list by hand would be troublesome. We would have to write two classes, one to represent the list by implementing IEnumerable<T>, and the other to represent an active enumeration operation by implementing IEnumerator<T>.
Iterator methods do all the hard work for us. We just write:
IEnumerable<int> GetNumbers(int stop)
{
for (int n = 0; n < stop; n++)
yield return n;
}
And the compiler converts this into two classes for us. Calling the method is equivalent to constructing an object of the class that represents the list.
Iterators are an abstraction that decouples the concept of position in a collection from the collection itself. The iterator is a separate object storing the necessary state to locate an item in the collection and move to the next item in the collection. I have seen collections that kept that state inside the collection (i.e. a current position), but it is often better to move that state to an external object. Among other things it enables you to have multiple iterators iterating the same collection.
Simple example : a function that generates a sequence of integers :
static IEnumerable<int> GetSequence(int fromValue, int toValue)
{
if (toValue >= fromValue)
{
for (int i = fromValue; i <= toValue; i++)
{
yield return i;
}
}
else
{
for (int i = fromValue; i >= toValue; i--)
{
yield return i;
}
}
}
To do it without an iterator, you would need to create an array then enumerate it...
Iterate through the students in a class
The Iterator design pattern provides
us with a common method of enumerating
a list of items or array, while hiding
the details of the list's
implementation. This provides a
cleaner use of the array object and
hides unneccessary information from
the client, ultimately leading to
better code-reuse, enhanced
maintainability, and fewer bugs. The
iterator pattern can enumerate the
list of items regardless of their
actual storage type.
Iterate through a set of homework questions.
But seriously, Iterators can provide a unified way to traverse the items in a collection regardless of the underlying data structure.
Read the first two paragraphs here for a little more info.
A couple of things they're great for:
a) For 'perceived performance' while maintaining code tidiness - the iteration of something separated from other processing logic.
b) When the number of items you're going to iterate through is not known.
Although both can be done through other means, with iterators the code can be made nicer and tidier as someone calling the iterator don't need to worry about how it finds the stuff to iterate through...
Real life example: enumerating directories and files, and finding the first [n] that fulfill some criteria, e.g. a file containing a certain string or sequence etc...
Beside everything else, to iterate through lazy-type sequences - IEnumerators. Each next element of such sequence may be evaluated/initialized upon iteration step which makes it possible to iterate through infinite sequences using finite amount of resources...
The canonical and simplest example is that it makes infinite sequences possible without the complexity of having to write the class to do that yourself:
// generate every prime number
public IEnumerator<int> GetPrimeEnumerator()
{
yield return 2;
var primes = new List<int>();
primesSoFar.Add(2);
Func<int, bool> IsPrime = n => primes.TakeWhile(
p => p <= (int)Math.Sqrt(n)).FirstOrDefault(p => n % p == 0) == 0;
for (int i = 3; true; i += 2)
{
if (IsPrime(i))
{
yield return i;
primes.Add(i);
}
}
}
Obviously this would not be truly infinite unless you used a BigInt instead of int but it gives you the idea.
Writing this code (or similar) for each generated sequence would be tedious and error prone. the iterators do that for you. If the above example seems too complex for you consider:
// generate every power of a number from start^0 to start^n
public IEnumerator<int> GetPowersEnumerator(int start)
{
yield return 1; // anything ^0 is 1
var x = start;
while(true)
{
yield return x;
x *= start;
}
}
They come at a cost though. Their lazy behaviour means you cannot spot common errors (null parameters and the like) until the generator is first consumed rather than created without writing wrapping functions to check first. The current implementation is also incredibly bad(1) if used recursively.
Wiriting enumerations over complex structures like trees and object graphs is much easier to write as the state maintenance is largely done for you, you must simply write code to visit each item and not worry about getting back to it.
I don't use this word lightly - a O(n) iteration can become O(N^2)
An iterator is an easy way of implementing the IEnumerator interface. Instead of making a class that has the methods and properties required for the interface, you just make a method that returns the values one by one and the compiler creates a class with the methods and properties needed to implement the interface.
If you for example have a large list of numbers, and you want to return a collection where each number is multiplied by two, you can make an iterator that returns the numbers instead of creating a copy of the list in memory:
public IEnumerable<int> GetDouble() {
foreach (int n in originalList) yield return n * 2;
}
In C# 3 you can do something quite similar using extension methods and lambda expressions:
originalList.Select(n => n * 2)
Or using LINQ:
from n in originalList select n * 2
IEnumerator<Question> myIterator = listOfStackOverFlowQuestions.GetEnumerator();
while (myIterator.MoveNext())
{
Question q;
q = myIterator.Current;
if (q.Pertinent == true)
PublishQuestion(q);
else
SendMessage(q.Author.EmailAddress, "Your question has been rejected");
}
foreach (Question q in listOfStackOverFlowQuestions)
{
if (q.Pertinent == true)
PublishQuestion(q);
else
SendMessage(q.Author.EmailAddress, "Your question has been rejected");
}

Categories