Removing oldest inserted data from sorted list - c#

I have a List of objects that expands over time as more and more data gets added to it but will have a fixed maximum size at some point.
public class Calculation
{
public List<DataUnit> DataUnits { get; set; } = new();
public void AddSorted(DataUnit unit)
{
int index = DataUnits.BinarySearch(unit);
DataUnits.Insert((index>=0) x : ~x, unit);
}
public void AddData(DataUnit unit)
{
AddSorted(unit);
if(DataUnits.Count > 30)
{
// i need some sort of solution here
DataUnits.RemoveOldest();
}
}
public void SomeCalculation()
{
// performs some calculation, that is O(1) with a sorted list and O(N) with a non sorted list
}
}
The catches are restrictions to performance, RAM and time.
The actual code (not the dummy one above), will have a lot of data coming in (something around 1000-2000 dataunits) within a 1 second time frame on a limited ressources machine.
We have to perform calculations on this list, which is just way faster to do on a sorted list. And, since new data will be coming in each second, all the calculations have to be done within a second.
What would be the most optimal way of implementing this?
I thought about either sorting the list before each calculation separately, but I am afraid that O(sqrt(n)) will just not cut it.
My second Idea would be a second list, that just preserves the time of insertion sorting, but with a lot of data, I think we might hit a resource limit when holding two list of those dataunits.
// EDIT
Additional Information about the DataUnit
public class DataUnit : IComparable
{
public IComparable Data { get; set; }
public int CompareTo(object? o)
{
if(o is DataUnit other)
{
return Data.CompareTo(other.Data);
}
return 0;
}
}
Properties about the Calculation
the SomeCalculation method performs an intensiv calculation over all the present data. This calculation only works with a sorted list by its Property Data (some index math magic, not important here).
So in the end, the List DataUnits must be sorted by Data. Since we have a lot more calls of SomeCalculation then AddData my implementation uses a list with sorted insertion rather than sorting the list every time we call SomeCalculation.
Problem: The DataUnits List reaches a fixed size (for example 30 elements), and will preserve its size after that. If one element gets added to the list, the oldest DataUnit object should be removed (and therefore not included in the calculation)

If the performance of SomeCalculation is indeed the most important, and additional overhead in AddData is acceptable, then your approach is ok. Keep the list sorted at all times, do not sort it each time you perform SomeCalculation.
To remove the oldest element: keep a fixed-size queue of DataUnit (these are references), besides inserting the new element to the sorted list, add it to the end of the queue. If the queue size exceeds the intended size, remove one element from the front of the queue, then search for it in the sorted list and remove it from the list.
Searching for an element to be removed can be done using binary search, but removing it from the sorted list is O(n), but so is inserting a new element in the middle of the list, so the performance is similar.
You could actually remove the old element and insert the new one in one pass of the sorted list, to slightly improve performance, but I wouldn't go for it, as the gains for a small list will be marginal.

I'm not sure I understand you correctly, but according to MSDN List.Add appends element to the end of the list. Wouldn't be easiest to simply Add elements (which will always go to the end), count their numbers and once it reaches threshol remove first one? (aka the oldest one?) Of course this expects you do not change order of elements in the list. Something like cyclic buffer.
Something like this:
public void AddData(DataUnit unit)
{
DataUnits.Add(unit);
if (DataUnits.Count() > 30)
{
DataUnits.RemoveAt(0);
}
}
Edit:
Mathew Watson comment made me think a little bit. I believe this way you can have sorted array and remove last element:
public class Calculation
{
private DataUnit lastElement = null;
public List<DataUnit> DataUnits { get; set; } = new();
public void AddSorted(DataUnit unit)
{
int index = DataUnits.BinarySearch(unit);
DataUnits.Insert((index>=0) x : ~x, unit);
}
public void AddData(DataUnit unit)
{
AddSorted(unit);
// you don't need to use LINQ, you may test it against Count
if (DataUnits.Any() == false)
{
this.lastElement = unit;
}
if(DataUnits.Count > 30)
{
if (this.lastElement != null)
{
DataUnits.Remove(this.lastElement);
this.lastElement = unit;
}
}
}
because all you need to know is that between some element and 30th element were 30 additions (and no removal). So all you need to do is to remember current element, count 30 additions and then remove that element and the one you're adding will be new candidate for removal.

Related

Looking for a data structure that is optimized for finding the next closest element

I have two classes, let's call them foo and bar, that both have a DateTime property called ReadingTime.
I then have long lists of these classes, let's say foos and bars, where foos is List<foo>, bars is List<bar>.
My goal is for every element in foos to find the events in bars that happened right before and right after foo.
Some code to clarify:
var foos = new List<foo>();
var bars = new List<bar>();
...
foreach (var foo in foos)
{
bar before = bars.Where(b => b.ReadingTime <= foo.ReadingTime).OrderByDescending(b => b.ReadingTime).FirstOrDefault();
bar after = bars.Where(b => b.ReadingTime > foo.ReadingTime).OrderBy(b => b.ReadingTime).FirstOrDefault();
...
}
My issue here is performance. Is it possible to use some other data structure than a list to speed up the comparisons? In particular the OrderBy statement every single time seems like a huge waste, having it pre-ordered should also speed up the comparisons, right?
I just don't know what data structure is best, SortedList, SortedSet, SortedDictionary etc. there seem so many. Also all the information I find is on lookups, inserts, delets, etc., noone writes about finding the next closest element so I'm not sure if anything is optimized for that.
I'm on .net core 3.1 if that matters.
Thanks in advance!
Edit: Okay so to wrap this up:
First I tried implementing #derloopkat's approach. For this I figured I needed a data type that could save the data in a sorted order so I just left it as IOrderedEnumerable (which is what linq returns). Probably not very smart, as that actually brought things to a crawl. I then tried going with SortedList. Had to remove some duplicates first which was no problem in my case. Thanks for the help #Olivier Rogier! This got me up to roughly 2x the original performance, though I suspect it's mostly the removed linq OrderBys. For now this is good enough, if/when I need more performance I'm going to go with what #CamiloTerevinto suggested.
Lastly #Aldert thank you for your time but I'm too noob and under too much time pressure to understand what you suggested. Still appreciate it and might revisit this later.
Edit2: Ended up going with #CamiloTerevinto's suggestion. Cut my runtime down from 10 hours to a couple of minutes.
You don't need to sort bars ascending and descending on each iteration. Order bars just once before the loop by calling .OrderBy(f => f.ReadingTime) and then use LastOrDefault() and FirstOrDefault().
foreach (var foo in foos)
{
bar before = bars.LastOrDefault(b => b.ReadingTime <= foo.ReadingTime);
bar after = bars.FirstOrDefault(b => b.ReadingTime > foo.ReadingTime);
//...
}
This produces same output you get with your code and runs faster.
For memory performances and to have strong typing, you can use a SortedDictionary, or SortedList but it manipulates objects. Because you compare DateTime you don't need to implement comparer.
What's the difference between SortedList and SortedDictionary?
SortedList<>, SortedDictionary<> and Dictionary<>
Difference between SortedList and SortedDictionary in C#
For speed optimization you can use a double linked list where each item indicates the next and the previous items:
Doubly Linked List in C#
Linked List Implementation in C#
Using a linked list or a double linked list requires more memory because you store the next and the previous reference in a cell that embeed each instance, but you can have sometimes the most faster way to parse and compare data, as well as to search, sort, reorder, add, remove and move items, because you don't manipulate an array, but linked references.
You also can create powerfull trees and manage data in a better way than arrays.
You can use the binary sort for quick lookup. Below the code where bars is sorted and foo is looked up. You can do yourself some reading on binary searches and enhance the code by also sorting Foos. In this case you can minimize the search range of bars...
The code generates 2 lists with 100 items. then sorts bars and does a binary search for 100 times.
using System;
using System.Collections.Generic;
namespace ConsoleApp2
{
class BaseReading
{
private DateTime readingTime;
public BaseReading(DateTime dt)
{
readingTime = dt;
}
public DateTime ReadingTime
{
get { return readingTime; }
set { readingTime = value; }
}
}
class Foo:BaseReading
{
public Foo(DateTime dt) : base(dt)
{ }
}
class Bar: BaseReading
{
public Bar(DateTime dt) : base(dt)
{ }
}
class ReadingTimeComparer: IComparer<BaseReading>
{
public int Compare(BaseReading x, BaseReading y)
{
return x.ReadingTime.CompareTo(y.ReadingTime);
}
}
class Program
{
static private List<BaseReading> foos = new List<BaseReading>();
static private List<BaseReading> bars = new List<BaseReading>();
static private Random ran = new Random();
static void Main(string[] args)
{
for (int i = 0; i< 100;i++)
{
foos.Add(new BaseReading(GetRandomDate()));
bars.Add(new BaseReading(GetRandomDate()));
}
var rtc = new ReadingTimeComparer();
bars.Sort(rtc);
foreach (BaseReading br in foos)
{
int index = bars.BinarySearch(br, rtc);
}
}
static DateTime GetRandomDate()
{
long randomTicks = ran.Next((int)(DateTime.MaxValue.Ticks >> 32));
randomTicks = (randomTicks << 32) + ran.Next();
return new DateTime(randomTicks);
}
}
}
The only APIs available in the .NET platform for finding the next closest element, with a computational complexity better than O(N), are the List.BinarySearch and Array.BinarySearch methods:
// Returns the zero-based index of item in the sorted List<T>, if item is found;
// otherwise, a negative number that is the bitwise complement of the index of
// the next element that is larger than item or, if there is no larger element,
// the bitwise complement of Count.
public int BinarySearch (T item, IComparer<T> comparer);
These APIs are not 100% robust, because the correctness of the results depends on whether the underlying data structure is already sorted, and the platform does not check or enforce this condition. It's up to you to ensure that the list or array is sorted with the correct comparer, before attempting to BinarySearch on it.
These APIs are also cumbersome to use, because in case a direct match is not found you'll get the next largest element as a bitwise complement, which is a negative number, and you'll have to use the ~ operator to get the actual index. And then subtract one to get the closest item from the other direction.
If you don't mind adding a third-party dependency to your app, you could consider the C5 library, which contains the TreeDictionary collection, with the interesting methods below:
// Find the entry in the dictionary whose key is the predecessor of the specified key.
public bool TryPredecessor(K key, out SCG.KeyValuePair<K, V> res);
//Find the entry in the dictionary whose key is the successor of the specified key.
public bool TrySuccessor(K key, out SCG.KeyValuePair<K, V> res)
There are also the TryWeakPredecessor and TryWeakSuccessor methods available, that consider an exact match as a predecessor or successor respectively. In other words they are analogous to the <= and >= operators.
The C5 is a powerful and feature-rich library that offers lots of specialized collections, with its cons being its somewhat idiomatic API.
You should get excellent performance by any of these options.

Index Array Storage Memory

Is there a use case for storing index ranges when talking about a potentially huge list.
Let's say with a list of millions of records. These will be analysed and a sublist of indexes will be reported to the user. Rather than listing out a massive list of indexes it would be obviously more legible to present;
Identified Rows: 10, 21, 10000-30000, 700000... etc to the user.
Now I can obviously create this string from the array of indexes but I'm wondering if it would also be more memory efficient to create the list in this format (and not creating a massive list of indexes in memory). Or is it not worth the processing overhead?
List intList = new List{1,2,3,4,5,6,7...};
vs
List strList = new List{"1-3000","3002","4000-5000"...};
To apply this I would imagine creating a List and when adding an item update/add to the list as necessary. Would require quite a bit of converting strings to int and vice-versa I think which is where this process may not be worth it.
Let me know if this isn't clear enough and I can potentially explain further.
UPDATE
I quite like Patrick Hofman's solution below using a list of ranges. What would be really cool would be to extend this so that .add(int) would modify the list of ranges correctly. I think this would be quite complicated though, correct?
I would opt to create a list of ranges. Depending on the number of singles in it, it might be more or less efficient:
public struct Range
{
public Range(int from, int to)
{
this.From = from;
this.To = to;
}
public int From { get; }
public int To { get; }
public static implicit operator Range(int v)
{
return new Range(v, v);
}
}
You can use it like this:
List<Range> l = new List<Range>{ 1, 2, 3, new Range(5, 3000) };

Does C# reserve three null instances ahead of time on a List<Object>?

I have the following:
public static class LocalFileModelList
{
public static List<LocalFileModel> ModelList = new List<LocalFileModel>();
}
public class LocalFileModel
{
public string Name { get; set; }
public string Extension { get; set; }
}
Then a method to read all files from a directory.
public static List<LocalFileModel> GetAllFiles(string path)
{
var files = Directory.GetFiles(path, "*.*");
foreach(var file in files)
{
var Extension = Path.GetExtension(file);
var Filename = Path.GetFileNameWithoutExtension(file);
var model = new LocalFileModel
{
Name = Filename,
Extension = Extension,
};
LocalFileModelList.ModelList.Add(model);
}
return LocalFileModelList.ModelList;
}
I noticed that, as I step through my code, when I create a new instance of LocalFileModel, populate it with data then add it to the list. Automatically the list created three additional instances of type null. Once those three were populated with their respective objects, it would again create thre more null instances...
I just realized this now, this is normal?
List<T> has an internal array, with a certain capacity which is always equal to or greater than the number of items on the list.
list.Capacity >= list.Count
You can actually tell the list what capacity its internal array should be created with.
new List<int>(capacity: 5);
When an item is inserted, and the array is at its capacity, the list creates a new array with double the previous size to accommodate the new element. So, in your case, if you were to insert a 5th item, the list would allocate a new internal array with 8 slots (5 of which would be filled).
For more details, check the implementation here.
Yes. .NET and most other libraries allocate a list or a vector with extra space (capacity) so it doesn't constantly have to resize and copy the data. The Size determines what is accessible.
The default capacity is defined in here to 4 (but the docs doesn't have to be):
http://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs,aa9d469618cd43b0,references
The initial capacity of the internal array held by List<T> is 4 (currently, that is an implementation detail and may change), granted you added an initial value. Once you start filling the list, it will resize itself by a multiple of 2 each time. That is why when you know ahead of time the number of minimum items, you can use the overload taking int capacity (or use an array if it's really a fixed size).
List<T> is backed by an array. The default/Initial capaicity of List<T> is defined as 4 as in reference source, but it doesn't take affect until an item is added to a list.
public class List<T> : IList<T>, System.Collections.IList, IReadOnlyList<T>
{
private const int _defaultCapacity = 4;
So when you added the first item in your List, the size is set to 4 using the following check in EnsureCapacity method:
int newCapacity = _items.Length == 0? _defaultCapacity : _items.Length * 2;
Later on, with each new item being added to the list, the capacity increases to Number of Elements * 2
That is why, when you add the first item, you can see three null spaces, three additional spaces reserved in your list.

Fast collection comparison

I have the following data type:
ISet<IEnumerable<Foo>>
So, I need to be able to create sets of sequences. E.g. this is ok:
ABC,AC,A
but this is not (since "AB" is repeated here"):
AB,A,ABC,BCA,AB
But, in order to do this - for "set" to not contain duplicates, I need to wrap my IEnumerable in some kind of other data type:
ISet<Seq>
//where
Seq : IEnumerable<Foo>, IEquatable<Seq>
Thus, I will be able to compare two sequences, and provide the Set data structure with a way of eliminating duplicates.
My question is: is there a fast data structure that allows for comparing sequences? I am thinking that somehow when Seq gets created, or added two, some kind of cumulative value is computed.
In other words, is it possible to implement Seq in such a way that I could do this:
var seq1 = new Seq( IList<Foo> );
var seq2 = new Seq( IList<Foo> )
seq1.equals(seq2) // O(1)
Thanks.
I have provided an implementation your sequence below. There are several points to note:
This only works if the IEnumerable<T> returns the same items every time it is enumerated, and that those items are not mutated during the scope of this object.
The hash code is cached. The first time it is requested it calculated it (feel free to improve the hash code algorithm if you know a better one) based on a full iteration of the underlying sequence. Because it only needs to be calculated once, this can be effectively considered O(1) if you compute it often. It's likely that adding to the set will be a bit slower (first time computation of the hash value) but searching or removing will be very quick.
The equals method first compares the hash codes. If the hash codes are different then the objects cannot possibly be equal (if the hash codes were properly implemented on all objects in the sequence, and nothing was mutated). As long as you have a low rate of collision, and are usually comparing items that aren't actually equal, this means that equals checks will not often get past that hash code check. If they do, an iteration of the sequence is needed (there is no way around that). Because of that the equals is likely to average O(1), even though its worst case is still O(n).
public class Foo : IEnumerable
{
private IEnumerable sequence;
private int? myHashCode = null;
public Foo(IEnumerable<T> sequence)
{
this.sequence = sequence;
}
public IEnumerator<T> GetEnumerator()
{
return sequence.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return sequence.GetEnumerator();
}
public override bool Equals(object obj)
{
Foo<T> other = obj as Foo<T>;
if(other == null)
return false;
//if the hash codes are different we don't need to bother doing a deep equals check
//the hash code is cached, so it's fast.
if (GetHashCode() != obj.GetHashCode())
return false;
return Enumerable.SequenceEqual(sequence, other.sequence);
}
public override int GetHashCode()
{
//note that the hash code is cached, so the underlying sequence
//needs to not change.
return myHashCode ?? populateHashCode();
}
private int populateHashCode()
{
int somePrimeNumber = 37;
myHashCode = 1;
foreach (T item in sequence)
{
myHashCode = (myHashCode * somePrimeNumber) + item.GetHashCode();
}
return myHashCode.Value;
}
}
O(1) essentially mean you are not allowed to compare values of elements. If you can represent sequence as list of immutable objects (with caching on add so there is no duplicates across all instances) you can achieve it as you'd only need to compare first element - similar how string interning works.
Insert will have to search for all instances of elements for "current"+"with this next" element. Some sort of dictionary may be reasonable approach...
EDIT: I think it simply tried to come up with suffix tree.

C# (.Net 2.0) Micro-Optimization Part 2: Finding Contiguous Groups within a grid

I have a very simple function which takes in a matching bitfield, a grid, and a square. It used to use a delegate but I did a lot of recoding and ended up with a bitfield & operation to avoid the delegate while still being able to perform matching within reason. Basically, the challenge is to find all contiguous elements within a grid which match the match bitfield, starting from a specific "leader" square.
Square is somewhat small (but not tiny) class. Any tips on how to push this to be even faster? Note that the grid itself is pretty small (500 elements in this test).
Edit: It's worth noting that this function is called over 200,000 times per second. In truth, in the long run my goal will be to call it less often, but that's really tough, considering that my end goal is to make the grouping system be handled with scripts rather than being hardcoded. That said, this function is always going to be called more than any other function.
Edit: To clarify, the function does not check if leader matches the bitfield, by design. The intention is that the leader is not required to match the bitfield (though in some cases it will).
Things tried unsuccessfully:
Initializing the dictionary and stack with a capacity.
Casting the int to an enum to avoid a cast.
Moving the dictionary and stack outside the function and clearing them each time they are needed. This makes things slower!
Things tried successfully:
Writing a hashcode function instead of using the default: Hashcodes are precomputed and are equal to x + y * parent.Width. Thanks for the reminder, Jim Mischel.
mquander's Technique: See GetGroupMquander below.
Further Optimization: Once I switched to HashSets, I got rid of the Contains test and replaced it with an Add test. Both Contains and Add are forced to seek a key, so just checking if an add succeeds is more efficient than adding if a Contains fails check fails. That is, if (RetVal.Add(s)) curStack.Push(s);
public static List<Square> GetGroup(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
curStack.Push(leader);
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
if (Retval.ContainsKey(curItem)) continue;
Retval.Add(curItem, true);
foreach (Square s in curItem.Neighbors)
{
if (0 != ((int)(s.RoomType) & match))
{
curStack.Push(s);
}
}
}
return new List<Square>(Retval.Keys);
}
=====
public static List<Square> GetGroupMquander(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
Retval.Add(leader, true);
curStack.Push(leader);
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
foreach (Square s in curItem.Neighbors)
{
if (0 != ((int)(s.RoomType) & match))
{
if (!Retval.ContainsKey(s))
{
curStack.Push(s);
Retval.Add(curItem, true);
}
}
}
}
return new List<Square>(Retval.Keys);
}
The code you posted assumes that the leader square matches the bitfield. Is that by design?
I assume your Square class has implemented a GetHashCode method that's quick and provides a good distribution.
You did say micro-optimization . . .
If you have a good idea how many items you're expecting, you'll save a little bit of time by pre-allocating the dictionary. That is, if you know you won't have more than 100 items that match, you can write:
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>(100);
That will avoid having to grow the dictionary and re-hash everything. You can also do the same thing with your stack: pre-allocate it to some reasonable maximum size to avoid resizing later.
Since you say that the grid is pretty small it seems reasonable to just allocate the stack and the dictionary to the grid size, if that's easy to determine. You're only talking grid_size references each, so memory isn't a concern unless your grid becomes very large.
Adding a check to see if an item is in the dictionary before you do the push might speed it up a little. It depends on the relative speed of a dictionary lookup as opposed to the overhead of having a duplicate item in the stack. Might be worth it to give this a try, although I'd be surprised if it made a big difference.
if (0 != ((int)(s.RoomType) & match))
{
if (!Retval.ContainsKey(curItem))
curStack.Push(s);
}
I'm really stretching on this last one. You have that cast in your inner loop. I know that the C# compiler sometimes generates a surprising amount of code for a seemingly simple cast, and I don't know if that gets optimized away by the JIT compiler. You could remove that cast from your inner loop by creating a local variable of the enum type and assigning it the value of match:
RoomEnumType matchType = (RoomEnumType)match;
Then your inner loop comparison becomes:
if (0 != (s.RoomType & matchType))
No cast, which might shave some cycles.
Edit: Micro-optimization aside, you'll probably get better performance by modifying your algorithm slightly to avoid processing any item more than once. As it stands, items that do match can end up in the stack multiple times, and items that don't match can be processed multiple times. Since you're already using a dictionary to keep track of items that do match, you can keep track of the non-matching items by giving them a value of false. Then at the end you simply create a List of those items that have a true value.
public static List<Square> GetGroup(int match, Model grid, Square leader)
{
Stack<Square> curStack = new Stack<Square>();
Dictionary<Square, bool> Retval = new Dictionary<Square, bool>();
curStack.Push(leader);
Retval.Add(leader, true);
int numMatch = 1;
while (curStack.Count != 0)
{
Square curItem = curStack.Pop();
foreach (Square s in curItem.Neighbors)
{
if (Retval.ContainsKey(curItem))
continue;
if (0 != ((int)(s.RoomType) & match))
{
curStack.Push(s);
Retval.Add(s, true);
++numMatch;
}
else
{
Retval.Add(s, false);
}
}
}
// LINQ makes this easier, but since you're using .NET 2.0...
List<Square> matches = new List<Square>(numMatch);
foreach (KeyValuePair<Square, bool> kvp in Retval)
{
if (kvp.Value == true)
{
matches.Add(kvp.Key);
}
}
return matches;
}
Here are a couple of suggestions -
If you're using .NET 3.5, you could change RetVal to a HashSet<Square> instead of a Dictionary<Square,bool>, since you're never using the values (only the keys) in the Dictionary. This would be a small improvement.
Also, if you changed the return to IEnumerable, you could just return the HashSet's enumerator directly. Depending on the usage of the results, it could potentially be faster in certain areas (and you can always use ToList() on the results if you really need a list).
However, there is a BIG optimization that could be added here -
Right now, you're always adding in every neighbor, even if that neighbor has already been processed. For example, when leader is processed, it adds in leader+1y, then when leader+1y is processed, it puts BACK in leader (even though you've already handled that Square), and next time leader is popped off the stack, you continue. This is a lot of extra processing.
Try adding:
foreach (Square s in curItem.Neighbors)
{
if ((0 != ((int)(s.RoomType) & match)) && (!Retval.ContainsKey(s)))
{
curStack.Push(s);
}
}
This way, if you've already processed the square of your neighbor, it doesn't get re-added to the stack, just to be skipped when it's popped later.

Categories