Index Array Storage Memory - c#

Is there a use case for storing index ranges when talking about a potentially huge list.
Let's say with a list of millions of records. These will be analysed and a sublist of indexes will be reported to the user. Rather than listing out a massive list of indexes it would be obviously more legible to present;
Identified Rows: 10, 21, 10000-30000, 700000... etc to the user.
Now I can obviously create this string from the array of indexes but I'm wondering if it would also be more memory efficient to create the list in this format (and not creating a massive list of indexes in memory). Or is it not worth the processing overhead?
List intList = new List{1,2,3,4,5,6,7...};
vs
List strList = new List{"1-3000","3002","4000-5000"...};
To apply this I would imagine creating a List and when adding an item update/add to the list as necessary. Would require quite a bit of converting strings to int and vice-versa I think which is where this process may not be worth it.
Let me know if this isn't clear enough and I can potentially explain further.
UPDATE
I quite like Patrick Hofman's solution below using a list of ranges. What would be really cool would be to extend this so that .add(int) would modify the list of ranges correctly. I think this would be quite complicated though, correct?

I would opt to create a list of ranges. Depending on the number of singles in it, it might be more or less efficient:
public struct Range
{
public Range(int from, int to)
{
this.From = from;
this.To = to;
}
public int From { get; }
public int To { get; }
public static implicit operator Range(int v)
{
return new Range(v, v);
}
}
You can use it like this:
List<Range> l = new List<Range>{ 1, 2, 3, new Range(5, 3000) };

Related

Removing oldest inserted data from sorted list

I have a List of objects that expands over time as more and more data gets added to it but will have a fixed maximum size at some point.
public class Calculation
{
public List<DataUnit> DataUnits { get; set; } = new();
public void AddSorted(DataUnit unit)
{
int index = DataUnits.BinarySearch(unit);
DataUnits.Insert((index>=0) x : ~x, unit);
}
public void AddData(DataUnit unit)
{
AddSorted(unit);
if(DataUnits.Count > 30)
{
// i need some sort of solution here
DataUnits.RemoveOldest();
}
}
public void SomeCalculation()
{
// performs some calculation, that is O(1) with a sorted list and O(N) with a non sorted list
}
}
The catches are restrictions to performance, RAM and time.
The actual code (not the dummy one above), will have a lot of data coming in (something around 1000-2000 dataunits) within a 1 second time frame on a limited ressources machine.
We have to perform calculations on this list, which is just way faster to do on a sorted list. And, since new data will be coming in each second, all the calculations have to be done within a second.
What would be the most optimal way of implementing this?
I thought about either sorting the list before each calculation separately, but I am afraid that O(sqrt(n)) will just not cut it.
My second Idea would be a second list, that just preserves the time of insertion sorting, but with a lot of data, I think we might hit a resource limit when holding two list of those dataunits.
// EDIT
Additional Information about the DataUnit
public class DataUnit : IComparable
{
public IComparable Data { get; set; }
public int CompareTo(object? o)
{
if(o is DataUnit other)
{
return Data.CompareTo(other.Data);
}
return 0;
}
}
Properties about the Calculation
the SomeCalculation method performs an intensiv calculation over all the present data. This calculation only works with a sorted list by its Property Data (some index math magic, not important here).
So in the end, the List DataUnits must be sorted by Data. Since we have a lot more calls of SomeCalculation then AddData my implementation uses a list with sorted insertion rather than sorting the list every time we call SomeCalculation.
Problem: The DataUnits List reaches a fixed size (for example 30 elements), and will preserve its size after that. If one element gets added to the list, the oldest DataUnit object should be removed (and therefore not included in the calculation)
If the performance of SomeCalculation is indeed the most important, and additional overhead in AddData is acceptable, then your approach is ok. Keep the list sorted at all times, do not sort it each time you perform SomeCalculation.
To remove the oldest element: keep a fixed-size queue of DataUnit (these are references), besides inserting the new element to the sorted list, add it to the end of the queue. If the queue size exceeds the intended size, remove one element from the front of the queue, then search for it in the sorted list and remove it from the list.
Searching for an element to be removed can be done using binary search, but removing it from the sorted list is O(n), but so is inserting a new element in the middle of the list, so the performance is similar.
You could actually remove the old element and insert the new one in one pass of the sorted list, to slightly improve performance, but I wouldn't go for it, as the gains for a small list will be marginal.
I'm not sure I understand you correctly, but according to MSDN List.Add appends element to the end of the list. Wouldn't be easiest to simply Add elements (which will always go to the end), count their numbers and once it reaches threshol remove first one? (aka the oldest one?) Of course this expects you do not change order of elements in the list. Something like cyclic buffer.
Something like this:
public void AddData(DataUnit unit)
{
DataUnits.Add(unit);
if (DataUnits.Count() > 30)
{
DataUnits.RemoveAt(0);
}
}
Edit:
Mathew Watson comment made me think a little bit. I believe this way you can have sorted array and remove last element:
public class Calculation
{
private DataUnit lastElement = null;
public List<DataUnit> DataUnits { get; set; } = new();
public void AddSorted(DataUnit unit)
{
int index = DataUnits.BinarySearch(unit);
DataUnits.Insert((index>=0) x : ~x, unit);
}
public void AddData(DataUnit unit)
{
AddSorted(unit);
// you don't need to use LINQ, you may test it against Count
if (DataUnits.Any() == false)
{
this.lastElement = unit;
}
if(DataUnits.Count > 30)
{
if (this.lastElement != null)
{
DataUnits.Remove(this.lastElement);
this.lastElement = unit;
}
}
}
because all you need to know is that between some element and 30th element were 30 additions (and no removal). So all you need to do is to remember current element, count 30 additions and then remove that element and the one you're adding will be new candidate for removal.

Looking for a data structure that is optimized for finding the next closest element

I have two classes, let's call them foo and bar, that both have a DateTime property called ReadingTime.
I then have long lists of these classes, let's say foos and bars, where foos is List<foo>, bars is List<bar>.
My goal is for every element in foos to find the events in bars that happened right before and right after foo.
Some code to clarify:
var foos = new List<foo>();
var bars = new List<bar>();
...
foreach (var foo in foos)
{
bar before = bars.Where(b => b.ReadingTime <= foo.ReadingTime).OrderByDescending(b => b.ReadingTime).FirstOrDefault();
bar after = bars.Where(b => b.ReadingTime > foo.ReadingTime).OrderBy(b => b.ReadingTime).FirstOrDefault();
...
}
My issue here is performance. Is it possible to use some other data structure than a list to speed up the comparisons? In particular the OrderBy statement every single time seems like a huge waste, having it pre-ordered should also speed up the comparisons, right?
I just don't know what data structure is best, SortedList, SortedSet, SortedDictionary etc. there seem so many. Also all the information I find is on lookups, inserts, delets, etc., noone writes about finding the next closest element so I'm not sure if anything is optimized for that.
I'm on .net core 3.1 if that matters.
Thanks in advance!
Edit: Okay so to wrap this up:
First I tried implementing #derloopkat's approach. For this I figured I needed a data type that could save the data in a sorted order so I just left it as IOrderedEnumerable (which is what linq returns). Probably not very smart, as that actually brought things to a crawl. I then tried going with SortedList. Had to remove some duplicates first which was no problem in my case. Thanks for the help #Olivier Rogier! This got me up to roughly 2x the original performance, though I suspect it's mostly the removed linq OrderBys. For now this is good enough, if/when I need more performance I'm going to go with what #CamiloTerevinto suggested.
Lastly #Aldert thank you for your time but I'm too noob and under too much time pressure to understand what you suggested. Still appreciate it and might revisit this later.
Edit2: Ended up going with #CamiloTerevinto's suggestion. Cut my runtime down from 10 hours to a couple of minutes.
You don't need to sort bars ascending and descending on each iteration. Order bars just once before the loop by calling .OrderBy(f => f.ReadingTime) and then use LastOrDefault() and FirstOrDefault().
foreach (var foo in foos)
{
bar before = bars.LastOrDefault(b => b.ReadingTime <= foo.ReadingTime);
bar after = bars.FirstOrDefault(b => b.ReadingTime > foo.ReadingTime);
//...
}
This produces same output you get with your code and runs faster.
For memory performances and to have strong typing, you can use a SortedDictionary, or SortedList but it manipulates objects. Because you compare DateTime you don't need to implement comparer.
What's the difference between SortedList and SortedDictionary?
SortedList<>, SortedDictionary<> and Dictionary<>
Difference between SortedList and SortedDictionary in C#
For speed optimization you can use a double linked list where each item indicates the next and the previous items:
Doubly Linked List in C#
Linked List Implementation in C#
Using a linked list or a double linked list requires more memory because you store the next and the previous reference in a cell that embeed each instance, but you can have sometimes the most faster way to parse and compare data, as well as to search, sort, reorder, add, remove and move items, because you don't manipulate an array, but linked references.
You also can create powerfull trees and manage data in a better way than arrays.
You can use the binary sort for quick lookup. Below the code where bars is sorted and foo is looked up. You can do yourself some reading on binary searches and enhance the code by also sorting Foos. In this case you can minimize the search range of bars...
The code generates 2 lists with 100 items. then sorts bars and does a binary search for 100 times.
using System;
using System.Collections.Generic;
namespace ConsoleApp2
{
class BaseReading
{
private DateTime readingTime;
public BaseReading(DateTime dt)
{
readingTime = dt;
}
public DateTime ReadingTime
{
get { return readingTime; }
set { readingTime = value; }
}
}
class Foo:BaseReading
{
public Foo(DateTime dt) : base(dt)
{ }
}
class Bar: BaseReading
{
public Bar(DateTime dt) : base(dt)
{ }
}
class ReadingTimeComparer: IComparer<BaseReading>
{
public int Compare(BaseReading x, BaseReading y)
{
return x.ReadingTime.CompareTo(y.ReadingTime);
}
}
class Program
{
static private List<BaseReading> foos = new List<BaseReading>();
static private List<BaseReading> bars = new List<BaseReading>();
static private Random ran = new Random();
static void Main(string[] args)
{
for (int i = 0; i< 100;i++)
{
foos.Add(new BaseReading(GetRandomDate()));
bars.Add(new BaseReading(GetRandomDate()));
}
var rtc = new ReadingTimeComparer();
bars.Sort(rtc);
foreach (BaseReading br in foos)
{
int index = bars.BinarySearch(br, rtc);
}
}
static DateTime GetRandomDate()
{
long randomTicks = ran.Next((int)(DateTime.MaxValue.Ticks >> 32));
randomTicks = (randomTicks << 32) + ran.Next();
return new DateTime(randomTicks);
}
}
}
The only APIs available in the .NET platform for finding the next closest element, with a computational complexity better than O(N), are the List.BinarySearch and Array.BinarySearch methods:
// Returns the zero-based index of item in the sorted List<T>, if item is found;
// otherwise, a negative number that is the bitwise complement of the index of
// the next element that is larger than item or, if there is no larger element,
// the bitwise complement of Count.
public int BinarySearch (T item, IComparer<T> comparer);
These APIs are not 100% robust, because the correctness of the results depends on whether the underlying data structure is already sorted, and the platform does not check or enforce this condition. It's up to you to ensure that the list or array is sorted with the correct comparer, before attempting to BinarySearch on it.
These APIs are also cumbersome to use, because in case a direct match is not found you'll get the next largest element as a bitwise complement, which is a negative number, and you'll have to use the ~ operator to get the actual index. And then subtract one to get the closest item from the other direction.
If you don't mind adding a third-party dependency to your app, you could consider the C5 library, which contains the TreeDictionary collection, with the interesting methods below:
// Find the entry in the dictionary whose key is the predecessor of the specified key.
public bool TryPredecessor(K key, out SCG.KeyValuePair<K, V> res);
//Find the entry in the dictionary whose key is the successor of the specified key.
public bool TrySuccessor(K key, out SCG.KeyValuePair<K, V> res)
There are also the TryWeakPredecessor and TryWeakSuccessor methods available, that consider an exact match as a predecessor or successor respectively. In other words they are analogous to the <= and >= operators.
The C5 is a powerful and feature-rich library that offers lots of specialized collections, with its cons being its somewhat idiomatic API.
You should get excellent performance by any of these options.

Splitting collection into equal batches in Parallel using c#

I am trying to split collection into equal number of batches.below is the code.
public static List<List<T>> SplitIntoBatches<T>(List<T> collection, int size)
{
var chunks = new List<List<T>>();
var count = 0;
var temp = new List<T>();
foreach (var element in collection)
{
if (count++ == size)
{
chunks.Add(temp);
temp = new List<T>();
count = 1;
}
temp.Add(element);
}
chunks.Add(temp);
return chunks;
}
can we do it using Parallel.ForEach() for better performance as we have around 1 Million items in the list?
Thanks!
If performance is the concern, my thoughts (in increasing order of impact):
right-sizing the lists when you create them would save a lot of work, i.e. figure out the output batch sizes before you start copying, i.e. temp = new List<T>(thisChunkSize)
working with arrays would be more effective than working with lists - new T[thisChunkSize]
especially if you use BlockCopy (or CopyTo, which uses that internally) rather than copying individual elements one by one
once you've calculated the offsets for each of the chunks, the individual block-copies could probably be executed in parallel, but I wouldn't assume it will be faster - memory bandwidth will be the limiting factor at that point
but the ultimate fix is: don't copy the data at all, but instead just create ranges over the existing data; for example, if using arrays, ArraySegment<T> would help; if you're open to using newer .NET features, this is a perfect fit for Memory<T>/Span<T> - creating memory/span ranges over an existing array is essentially free and instant - i.e. take a T[] and return List<Memory<T>> or similar.
Even if you can't switch to ArraySegment<T> / Memory<T> etc, returning something like that could still be used - i.e. List<ListSegment<T>> where ListSegment<T> is something like:
readonly struct ListSegment<T> { // like ArraySegment<T>, but for List<T>
public List<T> List {get;}
public int Offset {get;}
public int Count {get;}
}
and have your code work with ListSegment<T> by processing the Offset and Count appropriately.

What is the optimal collection if the keys are sequential integers but not zero-based?

I am specifically referring to using years as keys, but the question applies to any n-based sequential index.
Let's say I'm looking at apple harvests by year. I want to access my data by year, like so:
var harvests = GetLast50YearsOfHarvestData();
//1991 was a great year for apples
harvests[1991].ApplesPicked ...
The obvious answer is to use a Dictionary.
var harvests = Dictionary<int, AppleHarvest>();
Yet I know that arrays are faster. Apple harvest software is normally very performance-tuned. I will not be searching, adding, or deleting from my collection. I will only ever be accessing by key.
AppleHarvest[] harvests;
...
harvests[24] //1991 was a great year for apples
harvests[49] //wait what year is this? 2017?
I know that my keys will always be sequential, without gaps, but working with an array requires extra logic to know what year the zero-based index corresponds to. My performance may still be superior, but I'd prefer to not have to deal with that extra layer.
What are the options for achieving essentially an n-based array?
Create your own collection type:
public class SequentialKeyedCollection<T> : IEnumerable<T>
{
private T[] _innerArray;
private int _startIndex;
public SequentialKeyedCollection(int startIndex, int length)
{
_innerArray = new T[length];
_startIndex = startIndex;
}
public T this[int index]
{
get => _innerArray[index - _startIndex];
set => _innerArray[index - _startIndex] = value;
}
public int Length => _innerArray.Length;
public int IndexOf(T item)
{
int i = Array.IndexOf(_innerArray, item);
if (i < 0) return i; // Not found.
return i + _startIndex;
}
public IEnumerator<T> GetEnumerator() => ((IEnumerable<T>)_innerArray).GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => ((IEnumerable<T>)_innerArray).GetEnumerator();
}
An option is to use an array of AppleHarves, elements indexes indicate years:
AppleHarvest[] apples = new AppleHarvest[3000];
// 1991 was a great year for apples
apples[1991] = GetAppleHarvestForYear(1991);
Of course, there will be some unused year at the beginning of the array, but this overhead is really low.
Concerning algorithmic complexity, reading from an array is O(1) operation.
Reading from a dictionary is also O(1) operation, so the difference is only a constant multiplier, but arrays are faster.
To get the zero-based index, you can subtract a "base year" from any given year:
harvest[year - START_YEAR]
This is more easily readable than using a "magic number" and also more flexible since year is now a variable. You can even take this a step further by encapsulating the array in a class and creating a getter method which takes the year as a parameter and internally does the subtraction and the array access.

How many times the CompareTo method is called when a collection is sorted?

If a type implements IComparable<T> and you have a collection of this type with 100 elements. When you call the Sort method on this collection, how many times would the CompareTo method be called and how? Would it be used in this manner?
CompareTo(item0, item1);
CompareTo(item1, item2);
CompareTo(item2, item3);
CompareTo(item3, item4);
...
CompareTo(item97, item98);
CompareTo(item98, item99);
EDIT: Basically what I am trying to do is to turn this way of sorting into a value-based sorting where I assign some value to each item and then sort them. It's hard to explain but I am not able to use a -1,0,1 based sorting function for this problem. But all I have is a CompareTo function that I need to use to sort the items. So I need to generate some values for each item, and then the program will sort them from the smallest value to largest.
Well, you can't be 100% sure (with most sorting algorithms) as it will depend on the data. For example, certain sorting algorithms will only perform N (N being the size of the collection) comparisons of the data is already sorted, but needs to be much more if it's not.
The commonly used sorting algorithms, such as MergeSort, QuickSort, and HeapSort are all O(n*log(n)), which is to say the number of comparisons will be on the order of the number of items times the log base of the number of items. (The log base will be 2 for those algorithms.) While this won't be exact, it will scale with that relationship.
If you're interested in how many times it's called for a particular sorting operation you can use something like this:
public class LoggingComparer<T> : IComparer<T>
{
private IComparer<T> otherComparer;
public LoggingComparer(IComparer<T> otherComparer)
{
this.otherComparer = otherComparer;
}
public int Count { get; private set; }
public int Compare(T x, T y)
{
Count++;
return otherComparer.Compare(x, y);
}
}
It will wrap another comparer but also count the number of compare calls. Here's an example usage:
var list = new List<int>() { 5, 4, 1, 2, 3, 8, 7, 9, 0 };
LoggingComparer<int> comparer = new LoggingComparer<int>(Comparer<int>.Default);
list.Sort(comparer);
Console.WriteLine(comparer.Count);
Piggy backing on Servy's answer. Whatever the Asymptotic Complexity for comparison operations of the sorting algorithm is, that is how many calls will likely be made to CompareTo(). Note, this is usually a growth pattern and not an exact number of operations.

Categories