Random sample from an IEnumerable generated by yielding elements?

Random sample from an IEnumerable generated by yielding elements? - c#

I have a method that yields an IEnumerable, but it uses the yield keyword to return elements when executed. I don't always know how big the total collection is. It's kind of similar to the standard Fibonacci example when you go to Try .NET except that it will yield a finite number of elements. That being said, because there's no way of knowing how many elements it will return beforehand, it could keep yielding pretty much forever if there are too many.
When I looked for other questions about this topic here, one of the answers provided a clean LINQ query to randomly sample N elements from a collection. However, the assumption here was that the collection was static. If you go to the Try .NET website and modify the code to use the random sampling implementation from that answer, you will get an infinite loop.
public static void Main()
{
foreach (var i in Fibonacci().OrderBy(f => Guid.NewGuid()).Take(20))
{
Console.WriteLine(i);
}
}
The query tries to order all the elements in the returned IEnumerable, but to order all the elements it must first calculate all the elements, of which there are an infinite number, which means it will keep going on and on and never return an ordered collection.
So what would be a good strategyfor randomly sampling an IEnumerable with an unknown number of contained elements? Is it even possible?

If a sequence is infinite in length, you can't select N elements from it in less than infinite time where the chance of each element in the sequence being selected is the same.
However it IS possible to select N items from a sequence of unknown, but finite, length. You can do that using reservoir sampling.
Here's an example implementation:
/// <summary>
/// This uses Reservoir Sampling to select <paramref name="n"/> items from a sequence of items of unknown length.
/// The sequence must contain at least <paramref name="n"/> items.
/// </summary>
/// <typeparam name="T">The type of items in the sequence from which to randomly choose.</typeparam>
/// <param name="items">The sequence of items from which to randomly choose.</param>
/// <param name="n">The number of items to randomly choose from <paramref name="items"/>.</param>
/// <param name="rng">A random number generator.</param>
/// <returns>The randomly chosen items.</returns>
public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
// See http://en.wikipedia.org/wiki/Reservoir_sampling for details.
var result = new T[n];
int index = 0;
int count = 0;
foreach (var item in items)
{
if (index < n)
{
result[count++] = item;
}
else
{
int r = rng.Next(0, index + 1);
if (r < n)
result[r] = item;
}
++index;
}
if (index < n)
throw new ArgumentException("Input sequence too short");
return result;
}
This must still iterate over the entire sequence, however, so it does NOT work for an infinitely long sequence.
If you want to support input sequences longer than 2^31, you can use longs in the implementation like so:
public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
// See http://en.wikipedia.org/wiki/Reservoir_sampling for details.
var result = new T[n];
long index = 0;
int count = 0;
foreach (var item in items)
{
if (index < n)
{
result[count++] = item;
}
else
{
long r = rng.NextInt64(0, index + 1);
if (r < n)
result[r] = item;
}
++index;
}
if (index < n)
throw new ArgumentException("Input sequence too short");
return result;
}
Note that this implementation requires .Net 6.0 or higher because of the rng.NextInt64().
Also note that there's no point in making n long because you can't have an array that exceeds ~2^31 elements, so it wouldn't be possible to fill it. You could in theory fix that by returning multiple arrays, but I'll leave that as an exercise for the reader. ;)

Related

Combination of a list of lists so that each combination has unique elements

Ok so, I have a list of lists, like the title says and I want to make combinations of k lists in which every list has different elements than the rest.
Example:
I have the following list of lists:
{ {1,2,3} , {1,11} , {2,3,6} , {6,5,7} , {4,8,9} }
A valid 3-sized combination of these lists could be:
{ {1,11}, {4,8,9} ,{6,5,7} }
This is only ONE of the valid combinations, what I want to return is a list of all the valid combinations of K lists.
An invalid combination would be:
{ {1,11} ,{2, 3, 6}, {6, 5, 7} }
because the element 6 is present in the second and third list.
I already have a code that does this but it just finds all possible combinations and checks if they are valid before addding it to a final result list. As this list of lists is quite large (153 lists) when K gets bigger, the time taken is ridiculously big too (at K = 5 it takes me about 10 minutes.)
I want to see if there's an efficient way of doing this.
Below is my current code (the lists I want to combine are attribute of the class Item):
public void recursiveComb(List<Item> arr, int len, int startPosition, Item[] result)
{
if (len == 0)
{
if (valid(result.ToList()))
{
//Here I add the result to final list
//valid is just a function that checks if any list has repeated elements in other
}
return;
}
for (int i = startPosition; i <= arr.Count - len; i++)
{
result[result.Length - len] = arr[i];
recursiveComb(arr, len - 1, i + 1, result);
}
}

Use a HashSet
https://msdn.microsoft.com/en-us/library/bb359438(v=vs.110).aspx
to keep track of distinct elements as you build the output from the candidates in the input list of lists/tuples
accumulate an output list of non overlapping tuples by Iterating across the input list of tuples and evaluate each tuple as a candidate as follows:
For each input tuple, insert each tuple element into the HashSet. If the element you are trying to insert is already in the set, then the tuple fails the constraint and should be skipped, otherwise the tuple elements are all distinct from ones already in the output.
The hashset object effectively maintains a registry of distinct items in your accepted list of tuples.

If I understood your code correctly then, you are passing each list<int> from your input to recursiveComb() function. which look like this
for(int i = 0; i < inputnestedList.Count; i++)
{
recursiveComb();
// Inside of recursiveComb() you are using one more for loop with recursion.
// This I observed from your first parameter i.e. List<int>
}
Correct me if I am wrong
This leads to time complexity more than O(n^2)
Here is my simplest solution, with two forloops without recursion.
List<List<int>> x = new List<List<int>>{ new List<int>(){1,2,3} , new List<int>(){1,11} , new List<int>(){2,3,6} , new List<int>(){6,5,7} , new List<int>(){4,8,9} };
List<List<int>> result = new List<List<int>>();
var watch = Stopwatch.StartNew();
for (int i = 0; i < x.Count;i++)
{
int temp = 0;
for (int j = 0; j < x.Count; j++)
{
if (i != j && x[i].Intersect(x[j]).Any())
temp++;
}
// This condition decides, that elements of ith list are available in other lists
if (temp <= 1)
result.Add(x[i]);
}
watch.Stop();
var elapsedMs = watch.Elapsed.TotalMilliseconds;
Console.WriteLine(elapsedMs);
Now when I print execution time then output is
Execution Time: 11.4628
Check execution time of your code. If execution time of your code is higher than mine, then you can consider it as efficient code
Proof of code: DotNetFiddler
Happy coding

If I understood your problem correctly then this will work:
/// <summary>
/// Get Unique List sets
/// </summary>
/// <param name="sets"></param>
/// <returns></returns>
public List<List<T>> GetUniqueSets<T>(List<List<T>> sets )
{
List<List<T>> cache = new List<List<T>>();
for (int i = 0; i < sets.Count; i++)
{
// add to cache if it's empty
if (cache.Count == 0)
{
cache.Add(sets[i]);
continue;
}
else
{
//check whether current item is in the cache and also whether current item intersects with any of the items in cache
var cacheItems = from item in cache where (item != sets[i] && item.Intersect(sets[i]).Count() == 0) select item;
//if not add to cache
if (cacheItems.Count() == cache.Count)
{
cache.Add(sets[i]);
}
}
}
return cache;
}
Tested, it's fast and took 00:00:00.0186033 for finding sets.

Quickest way to find position of item less than or equal to double in sorted list C#

I am exploring the fastest way to iterate through three sorted lists to find the position of the first item which is equal to or less than a double value. The lists contains two columns of doubles.
I have the two following working examples attached below, these are encompassed by a bigger while loop (which also modifies the currentPressure list changing the [0] value) value. But, considering the amount of rows (500,000+) being parsed by the bigger while loop, the code below is too slow (one iteration of the three while loops takes >20 ms).
"allPressures" contains all rows while currentPressure is modified by the remaining code. The while loops are used to align the time from the Flow, Rpm and Position lists to the Time in the pressure list.
In other words I am trying to find the quickest way to determine the x of
for instance
FlowList[x].Time =< currentPressure[0].Time
Any suggestions are greatly appreciated!
Examples:
for (int i = 0; i < allPressures.Count; i++)
{
if (FlowList[i].Time >= currentPressure[0].Time)
{
fl = i;
break;
}
}
for (int i = 0; i < allPressures.Count; i++)
{
if (RpmList[i].Time >= currentPressure[0].Time)
{
rp = i;
break;
}
}
for (int i = 0; i < allPressures.Count; i++)
{
if (PositionList[i].Time >= currentPressure[0].Time)
{
bp = i;
break;
}
}
Using while loop:
while (FlowList[fl].Time < currentPressure[0].Time)
{
fl++;
}
while (RpmList[rp].Time < currentPressure[0].Time)
{
rp++;
}
while (PositionList[bp].Time < currentPressure[0].Time)
{
bp++;
}

The problem is that your are doing a linear search. This means that in the worst case scenario your are iterating over all the elements in your lists. This gives you a computational complexity of O(3*n) where n is the length of your lists and 3 is the number of lists you are searching.
Since your lists are sorted you can use the much faster binary search which has a complexity of O(log(n)) and in your case O(3*log(n)).
Luckily you don't have to implement it yourself, because .NET offers the helper method List.BinarySearch(). You will need the one that takes a custom comparer, because you want to compare PressureData objects.
Since you are looking for the index of the closest value that's less than your search value, you'll have to use a little trick: when BinarySearch() doesn't find a matching value it returns the index of the next element that is larger than the search value. From this it's easy to find the previous element that is smaller than the search value.
Here is an extension method the implements this:
public static int FindMaxIndex<T>(
this List<T> sortedList, T inclusiveUpperBound, IComparer<T> comparer = null)
{
var index = sortedList.BinarySearch(inclusiveUpperBound, comparer);
// The max value was found in the list. Just return its index.
if (index >= 0)
return index;
// The max value was not found and "~index" is the index of the
// next value greater than the search value.
index = ~index;
// There are values in the list less than the search value.
// Return the index of the closest one.
if (index > 0)
return index - 1;
// All values in the list are greater than the search value.
return -1;
}
Test it at https://dotnetfiddle.net/kLZsM5
Use this method with a comparer that understands PressureData objects:
var pdc = Comparer<PressureData>.Create((x, y) => x.Time.CompareTo(y.Time));
var fl = FlowList.FindMaxIndex(currentPressure[0], pdc);
Here is a working example: https://dotnetfiddle.net/Dmgzsv

C#/.NET - Performance degrading with minimally-parallelized Quicksort

I am currently working on a recursively parallel Quicksort extension function for the List class. The code below represents the most basic thread distribution criteria I've considered because it should be the simplest to conceptually explain. It branches to the depth of the base-2 logarithm of the number of detected processors, and proceeds sequentially from there. Thus, each CPU should get one thread with a (roughly) equal, large share of data to process, avoiding excessive overhead time. The basic sequential algorithm is provided for comparison.
public static class Quicksort
{
/// <summary>
/// Helper class to hold information about when to parallelize
/// </summary>
/// <attribute name="maxThreads">Maximum number of supported threads</attribute>
/// <attribute name="threadDepth">The depth to which new threads should
/// automatically be made</attribute>
private class ThreadInfo
{
internal int maxThreads;
internal int threadDepth;
public ThreadInfo(int length)
{
maxThreads = Environment.ProcessorCount;
threadDepth = (int)Math.Log(maxThreads, 2);
}
}
/// <summary>
/// Helper function to perform the partitioning step of quicksort
/// </summary>
/// <param name="list">The list to partition</param>
/// <param name="start">The starting index</param>
/// <param name="end">The ending index/param>
/// <returns>The final index of the pivot</returns>
public static int Partition<T>(this List<T> list, int start, int end) where T: IComparable
{
int middle = (int)(start + end) / 2;
// Swap pivot and first item.
var temp = list[start];
list[start] = list[middle];
list[middle] = temp;
var pivot = list[start];
var swapPtr = start;
for (var cursor = start + 1; cursor <= end; cursor++)
{
if (list[cursor].CompareTo(pivot) < 0)
{
// Swap cursor element and designated swap element
temp = list[cursor];
list[cursor] = list[++swapPtr];
list[swapPtr] = temp;
}
}
// Swap pivot with final lower item
temp = list[start];
list[start] = list[swapPtr];
list[swapPtr] = temp;
return swapPtr;
}
/// <summary>
/// Method to begin parallel quicksort algorithm on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
public static void QuicksortParallel<T>(this List<T> list) where T : IComparable
{
if (list.Count < 2048)
list.QuicksortSequential();
else
{
var info = new ThreadInfo(list.Count);
list.QuicksortRecurseP(0, list.Count - 1, 0, info);
}
}
/// <summary>
/// Method to implement parallel quicksort recursion on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
/// <param name="start">The starting index of the partition</param>
/// <param name="end">The ending index of the partition (inclusive)</param>
/// <param name="depth">The current recursive depth</param>
/// <param name="info">Structure holding decision-making info for threads</param>
private static void QuicksortRecurseP<T>(this List<T> list, int start, int end, int depth,
ThreadInfo info)
where T : IComparable
{
if (start >= end)
return;
int middle = list.Partition(start, end);
if (depth < info.threadDepth)
{
var t = Task.Run(() =>
{
list.QuicksortRecurseP(start, middle - 1, depth + 1, info);
});
list.QuicksortRecurseP(middle + 1, end, depth + 1, info);
t.Wait();
}
else
{
list.QuicksortRecurseS(start, middle - 1);
list.QuicksortRecurseS(middle + 1, end);
}
}
/// <summary>
/// Method to begin sequential quicksort algorithm on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
public static void QuicksortSequential<T>(this List<T> list) where T : IComparable
{
list.QuicksortRecurseS(0, list.Count - 1);
}
/// <summary>
/// Method to implement sequential quicksort recursion on a Comparable list
/// </summary>
/// <param name="list">The list to sort</param>
/// <param name="start">The starting index of the partition</param>
/// <param name="end">The ending index of the partition (inclusive)</param>
private static void QuicksortRecurseS<T>(this List<T> list, int start, int end) where T : IComparable
{
if (start >= end)
return;
int middle = list.Partition(start, end);
// Now recursively sort the (approximate) halves.
list.QuicksortRecurseS(start, middle - 1);
list.QuicksortRecurseS(middle + 1, end);
}
}
As far as I understand, this methodology should produce a one-time startup cost, then proceed in sorting the rest of the data significantly faster than sequential method. However, the parallel method takes significantly more time than the sequential method, which actually increases as the load increases. Benchmarked at a list of ten million items on a 4-core CPU, the sequential method averages around 18 seconds to run to completion, while the parallel method takes upwards of 26 seconds. Increasing the allowed thread depth rapidly exacerbates the problem.
Any help in finding the performance hog is appreciated. Thanks!

The problem is CPU cache conflict, also known as "false sharing."
Unless the memory address of the pivot point happens to fall on a cache line, one of the threads will get a lock on the L1 or L2 cache and the other one will have to wait. This can make performance even worse than a serial solution. The problem is described well in this article:
...where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. [1,2] This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
(snip)
In most cases, the parallel code ran actually ran slower than the sequential code, and in no case did we get any better than a 42% speedup no matter how many cores we threw at the problem.
Grasping at straws.... you may be able to do better if you separate the list into two objects, and pin them in memory (or even just pad them within a struct) so they are far enough apart that they won't have a cache conflict.

C# the name brojevi does not exist in current context

I want to make a simple program (lottery numbers generator) that takes numbers within a specific range and shuffles them "n" number of times, after each shuffle it selects one random number and moves it from the list of a given range to a new list, and does this for "n" number of times (until it selects specific amount of numbers, 7 to be exact). I have found an algorithm that does exactly that (an extension method or shuffling generic lists). But I'm not that into programming and I have a problem with displaying the results (the list with the drawn numbers) to a TextBox or Label, however I have got it to work with a MessageBox. But with TextBox/Label I get the error "The name * does not exist in current context". I've googled for a solution but no help what so ever.
Here's the code:
private void button1_Click(object sender, EventArgs e)
{
List<int> numbers;
numbers = Enumerable.Range(1, 39).ToList();
numbers.Shuffle();
}
private void brojevi_TextChanged(object sender, EventArgs e)
{
}
}
}
/// <summary>
/// Class for shuffling lists
/// </summary>
/// <typeparam name="T">The type of list to shuffle</typeparam>
public static class ListShufflerExtensionMethods
{
//for getting random values
private static Random _rnd = new Random();
/// <summary>
/// Shuffles the contents of a list
/// </summary>
/// <typeparam name="T">The type of the list to sort</typeparam>
/// <param name="listToShuffle">The list to shuffle</param>
/// <param name="numberOfTimesToShuffle">How many times to shuffle the list, by default this is 5 times</param>
public static void Shuffle<T>(this List<T> listToShuffle, int numberOfTimesToShuffle = 7)
{
//make a new list of the wanted type
List<T> newList = new List<T>();
//for each time we want to shuffle
for (int i = 0; i < numberOfTimesToShuffle; i++)
{
//while there are still items in our list
while (listToShuffle.Count >= 33)
{
//get a random number within the list
int index = _rnd.Next(listToShuffle.Count);
//add the item at that position to the new list
newList.Add(listToShuffle[index]);
//and remove it from the old list
listToShuffle.RemoveAt(index);
}
//then copy all the items back in the old list again
listToShuffle.AddRange(newList);
//display contents of a list
string line = string.Join(",", newList.ToArray());
brojevi.Text = line;
//and clear the new list
//to make ready for next shuffling
newList.Clear();
break;
}
}
}
}

The problem is that brojevi (either TextBox or Label) isn't defined in the scope of the extension method, it is a Control so it should be defined in your Form. So, when you shuffle your numbers, put them in the TextBox during the execution of the button1_Click event handler
Remove the lines:
string line = string.Join(",", newList.ToArray());
brojevi.Text = line;
EDIT:
You could change the extension method like this to return the string of drawn items or list of drawn items. Lets go for the list because you might want to use the numbers for other things. Also, I don't see the point in shuffling 7 times because you will be able to see only last shuffling. Therefore I think that one is enough. Check the code:
public static List<T> Shuffle<T>(this List<T> listToShuffle)
{
//make a new list of the wanted type
List<T> newList = new List<T>();
//while there are still items in our list
while (listToShuffle.Count >= 33)
{
//get a random number within the list
int index = _rnd.Next(listToShuffle.Count);
//add the item at that position to the new list
newList.Add(listToShuffle[index]);
//and remove it from the old list
listToShuffle.RemoveAt(index);
}
//then copy all the items back in the old list again
listToShuffle.AddRange(newList);
return newList;
}
And in button1_Click1 event handler we can have:
List<int> numbers;
numbers = Enumerable.Range(1, 39).ToList();
List<int> drawnNumbers = numbers.Shuffle();
string line = string.Join(",", drawnNumbers.ToArray());
brojevi.Text = line;

ListShufflerExtensionMethods doesn't know about your textbox (brojevi) because it's out of scope. You could restructure and make Shuffle return a value, then set the value of the textbox' text in callers scope.

Why is removing by index from an IList performing so much worse than removing by item from an ISet?

Edit: I will add some benchmark results. To about a 1000 - 5000 items in the list, IList and RemoveAt beats ISet and Remove, but that's not something to worry about since the differences are marginal. The real fun begins when collection size extends to 10000 and more. I'm posting only those data
I was answering a question here last night and faced a bizarre situation.
First a set of simple methods:
static Random rnd = new Random();
public static int GetRandomIndex<T>(this ICollection<T> source)
{
return rnd.Next(source.Count);
}
public static T GetRandom<T>(this IList<T> source)
{
return source[source.GetRandomIndex()];
}
------------------------------------------------------------------------------------------------------------------------------------
Let's say I'm removing N number of items from a collection randomly. I would write this function:
public static void RemoveRandomly1<T>(this ISet<T> source, int countToRemove)
{
int countToRemain = source.Count - countToRemove;
var inList = source.ToList();
int i = 0;
while (source.Count > countToRemain)
{
source.Remove(inList.GetRandom());
i++;
}
}
or
public static void RemoveRandomly2<T>(this IList<T> source, int countToRemove)
{
int countToRemain = source.Count - countToRemove;
int j = 0;
while (source.Count > countToRemain)
{
source.RemoveAt(source.GetRandomIndex());
j++;
}
}
As you can see the first function is written for an ISet and the second for normal IList. In the first function I'm removing by item from ISet and by index in IList, both of which I believe are O(1). Why is the second function performing so much worse than the first, especially when the lists get bigger?
Odds (my take):
1) In the first function the ISet is converted to an IList (to get the random item from the IList), where as there is no such thing performed in the second function.
Advantage IList.
2) In the first function a call to GetRandomItem is made, where as in the second, a call to GetRandomIndex is made, that's one step less again.
Though trivial, advantage IList.
3) In the first function, the random item is got from a separate list, so the obtained item might be already removed from ISet. This leads in more iterations in the while loop in the first function. In the second function, the random index is got from the source that is being iterated on, hence there are never repetitive iterations. I have tested this and verified this.
i > j always, advantage IList.
I thought the reason for this behaviour is that a List would need constant resizing when items are added or removed. But apparently no in some other testing. I ran:
public static void Remove1(this ISet<int> set)
{
int count = set.Count;
for (int i = 0; i < count; i++)
{
set.Remove(i + 1);
}
}
public static void Remove2(this IList<int> lst)
{
for (int i = lst.Count - 1; i >= 0; i--)
{
lst.RemoveAt(i);
}
}
and found that the second function runs faster.
Test bed:
var f = Enumerable.Range(1, 100000);
var s = new HashSet<int>(f);
var l = new List<int>(f);
Benchmark(() =>
{
//some examples...
s.RemoveRandomly1(2500);
l.RemoveRandomly2(2500);
s.Remove1();
l.Remove2();
}, 1);
public static void Benchmark(Action method, int iterations = 10000)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < iterations; i++)
method();
sw.Stop();
MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}
Just trying to know what's with the two structures.. Thanks..
Result:
var f = Enumerable.Range(1, 10000);
s.RemoveRandomly1(7500); => 5ms
l.RemoveRandomly2(7500); => 20ms
var f = Enumerable.Range(1, 100000);
s.RemoveRandomly1(7500); => 7ms
l.RemoveRandomly2(7500); => 275ms
var f = Enumerable.Range(1, 1000000);
s.RemoveRandomly1(75000); => 50ms
l.RemoveRandomly2(75000); => 925000ms
For most typical needs a list would do though..!

First off, IList and ISet aren't implementations of anything. I can write an IList or an ISet implementation that will run very differently, so the concrete implementations are what is important (List and HashSet in your case).
Accessing a List item by index is O(1) but not removing by RemoveAt which is O(n).
List removing from the end will be fast because it doesn't have to copy anything, it just decrements its internal counter that stores how many items it has until the number of empty spots in the underlying array goes below a threshold, at which point it will copy the array to a smaller one. Once you hit the max capacity of the underlying array it creates a new array double the size and copies the elements over. If you go below a certain threshold it will create an array half the size and copy the elements over. It tracks how large it is with a length property, so that unused slots appear like they aren't there.
Randomly removing from a list means that it will have to copy all the array entries that come after the index so that they slide down one spot, which is inherently pretty slow, particularly as the size of the list gets bigger. If you have a List with 1 million entries, and you remove something at index 500,000, it has to copy the second half of the array down a spot.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Random sample from an IEnumerable generated by yielding elements? - c#

Related

Combination of a list of lists so that each combination has unique elements

Quickest way to find position of item less than or equal to double in sorted list C#

C#/.NET - Performance degrading with minimally-parallelized Quicksort

C# the name brojevi does not exist in current context

Why is removing by index from an IList performing so much worse than removing by item from an ISet?

Categories

Resources