Clean up algorithm

Clean up algorithm - c#

I've made an C# application which connects to my webcam and reads the images at the speed the webcam delivers them. I'm parsing the stream, in a way that I have a few jpeg's per seconds.
I don't want to write all the webcam data to the disk, I want to store images in memory. Also the application will act as a webserver which I can supply a datetime in the querystring. And the webserver must server the image closest to that time which it still has in memory.
In my code I have this:
Dictionary<DateTime, byte[]> cameraImages;
of which DateTime is the timestamp of the received image and the bytearray is the jpeg.
All of that works; also handling the webrequest works. Basically I want to clean up that dictionary by keep images according to their age.
Now I need an algorithm for it, in that it cleans up older images.
I can't really figure out an algorithm for it, one reason for it is that the datetimes aren't exactly on a specific moment, and I can't be sure that an image always arrives. (Sometimes the image stream is aborted for several minutes). But what I want to do is:
Keep all images for the first minute.
Keep 2 images per second for the first half hour.
Keep only one image per second if it's older than 30 minutes.
Keep only one image per 30 seconds if it's older than 2 hours.
Keep only one image per minute if it's older than 12 hours.
Keep only one image per hour if it's older than 24 hours.
Keep only one image per day if it's older than two days.
Remove all images older than 1 weeks.
The above intervals are just an example.
Any suggestions?

I think #Kevin Holditch's approach is perfectly reasonable and has the advantage that it would be easy to get the code right.
If there were a large number of images, or you otherwise wanted to think about how to do this "efficiently", I would propose a thought process like the following:
Create 7 queues, representing your seven categories. We take care to keep the images in this queue in sorted time order. The queue data structure is able to efficiently insert at its front and remove from its back. .NET's Queue would be perfect for this.
Each Queue (call it Qi) has an "incoming" set and an "outgoing" set. The incoming set for Queue 0 is those images from the camera, and for any other set it is equal to the outgoing set for Queue i-1
Each queue has rules on both its input and output side which determine whether the queue will admit new items from its incoming set and whether it should eject items from its back into its outgoing set. As a specific example, if Q3 is the queue "Keep only one image per 30 seconds if it's older than 2 hours", then Q3 iterates over its incoming set (which is the outcoming set of Q2) and only admits item i where i's timestamp is 30 seconds or more away from Q3.first() (For this to work correctly the items need to be processed from highest to lowest timestamp). On the output side, we eject from Q3's tail any object older than 12 hours and this becomes the input set for Q4.
Again, #Kevin Holditch's approach has the virtue of simplicity and is probably what you should do. I just thought you might find the above to be food for thought.

You could do this quite easily (although it may not be the most efficient way by using Linq).
E.g.
var firstMinImages = cameraImages.Where(
c => c.Key >= DateTime.Now.AddMinutes(-1));
Then do an equivalent query for every time interval. Combine them into one store of images and overwrite your existing store (presuming you dont want to keep them). This will work with your current criteria as the images needed get progressively less over time.

My strategy would be to Group the elements into buckets that you plan to weed out, then pick 1 element from the list to keep... I have made an example of how to do this using a List of DateTimes and Ints but Pics would work exactly the same way.
My Class used to store each Pic
class Pic
{
public DateTime when {get;set;}
public int val {get;set;}
}
and a sample of a few items in the list...
List<Pic> intTime = new List<Pic>();
intTime.Add(new Pic() { when = DateTime.Now, val = 0 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-1), val = 1 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-1.01), val = 2 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-1.02), val = 3 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-2), val = 4 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-2.1), val = 5 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-2.2), val = 6 });
intTime.Add(new Pic() { when = DateTime.Now.AddDays(-3), val = 7 });
Now I create a helper function to bucket and remove...
private static void KeepOnlyOneFor(List<Pic> intTime, Func<Pic, int> Grouping, DateTime ApplyBefore)
{
var groups = intTime.Where(a => a.when < ApplyBefore).OrderBy(a=>a.when).GroupBy(Grouping);
foreach (var r in groups)
{
var s = r.Where(a=> a != r.LastOrDefault());
intTime.RemoveAll(a => s.Contains(a));
}
}
What this does is lets you specify how to group the object and set an age threshold on the grouping. Now finally to use...
This will remove all but 1 picture per Day for any pics greater than 2 days old:
KeepOnlyOneFor(intTime, a => a.when.Day, DateTime.Now.AddDays(-2));
This will remove all but 1 picture for each Hour after 1 day old:
KeepOnlyOneFor(intTime, a => a.when.Hour, DateTime.Now.AddDays(-1));

If you are on .net 4 you could use a MemoryCache for each interval with CachItemPolicy objects to expire them when you want them to expire and UpdateCallbacks to move some to the next interval.

Related

Tipps for a better performance on a C# code

I wrote a small program to calculate the total value of some items in the Steam market. Everything works fine but the program needs around 2 minutes to start, it's because it took some time to get the price of the items.
Now I would like to know if I could change the code a little bit to optimize the program so it doesn't take 2 minutes to start, maybe 1 or less would be great.
Between each connection to the website where I get the prices, I have to wait a few seconds (around 3,5 sec seems to be the best), or I get an error because of too many requests, so this is a thing I think I can't really change.
This is how I get the price from the website:
string urlsteammarkt = "https://steamcommunity.com/market/priceoverview/?appid=730&currency=3&market_hash_name=";
string Chroma()
{
WebClient chroma = new WebClient();
//get the informations
srcChroma = chroma.DownloadString(urlsteammarkt + "Chroma%20Case");
//cute out the min price
srcChroma = srcChroma.Remove(0, 32);
srcChroma = srcChroma.Remove(srcChroma.IndexOf('\\'));
//replace -- when it's a round price (e.g. 13€ will be displayed as 13,--)
srcChroma = srcChroma.Replace("--", "00");
return srcChroma + "€\n";
}
Code is too long, here is the rest of the code:
https://pastebin.com/ZrTqjMd0

First of all: you are literally telling your program to sleep (just wait and do nothing).
And you need to learn programming, especially oop (https://www.w3schools.com/cs/cs_oop.asp)

how much data to hold in memory when looping through large dataset

I am trying to create a trading simulator to test strategies over long periods of time.
I am using 1 minute data points. So if I were to run a simulation for say 10 years that would be approx 3,740,000 prices (Price class shown below). A simulation could be much longer than 10 years but using this an example.
class Price
{
DateTime DatePrice
double Open
double High
double Low
double Close
}
My simulator works however I can't help feeling the way I'm doing it isn't very optimal.
So currently what I do is to grab a years worth of prices from my SQL database so approx 374,400 prices. I do this because I don't want to use too much memory (this might be misguided have no idea).
Now when looping through time the code will also make use of the previous lets say 10 prices. So at 2.30am the code will look back the prices from 2.20am, all the prices previous to this are now redundant. So it seems somewhat wasteful to me if I hold 374,400 prices in memory.
time Close
00:00 102
00:01 99
00:02 100
...
02:20 84
02:21 88
So I have a loop that will loop through from my start date to my end date, checking at each step if I need to download additional prices from the database.
List<Price> PriceList = Database.GetPrices(first years worth or prices)
for(DateTime dtNow = dtStart; dtNow < dtEnd; dtNow = dtNow.AddMinutes(1))
{
// run some calculations which doesn't take long
// then check if PriceList[i] == PriceList.Count - 1
// if so get more prices from the database and obviously reset i to zero but baring in mind I need to keep the previous 10 prices
}
What is the best solution for this kind of problem? Should I be getting prices from the database on another thread or something?

Lets do some math
class Price
{
DateTime DatePrice;
double Open;
double High;
double Low;
double Close;
}
has a size of 8(DateTime)+4*8(double) = 40 alone for the members. Since it is a reference type you need a method table pointer and a SyncBlock pointer which add 16 byte additionally. Since you need to keep the pointer to the object (8 bytes on x64) somewhere we get a total size per instance of 64 bytes.
If you want to have a 10 year history with 3,7 million instances you will need 237 MB of memory which is not much in todays world.
You can shave off some overhead by switching from double to floats which will need only 4 bytes and if you go with a struct
struct Price
{
DateTime DatePrice;
float Open;
float High;
float Low;
float Close;
}
You will need only 24 bytes with no big loss of precision since the value range of stocks are not so high and you are interested in a long term trend or pattern and not 0,000000x fractions.
With this struct your 10 year time horizon will cost you only 88MB and it will keep the garbage collector off your data because it is opaque for the GC (no reference types inside your struct).
That simple optimization should be good enough for time horizons which span hundreds of years even with todays computers and memory sizes. It would even fit into an x86 address space but I would recommend running this on x64 because I suspect you will check not only one stock but several ones in parallel.

If I were you, I would keep the problem of caching (which seems to be your problem), separate from functionality.
I don't know how you currently fetch your data from the DB. I am guessing you are using some logic similar to
DataAdapter.Fill(dataset);
List<Price> PriceList = dataset.Tables[0].SomeLinqQuery();
Instead of fetching all the prices at teh same time, you can use something like below to fetch them incrementally and convert the fetched row into a Price object
IDataReader rdr = IDbCommand.ExecuteReader();
while(rdr.Read())
{
}
Now to make transparent access to Prices, you might want to roll in some class which can provide caching
class FixedSizeCircularBuffer<T> {
void Add(T item) { } // make sure to dequeue automatically to keep buffer size fixed
public T GetPrevious(int relativePosition) { } // provide indexer to access relative to the current element
}
class CachedPrices {
FixedSizeCircularBuffer<Price> Cache;
public CachedPrices() {
// connect to the DB and do ExecuteReader
// set the cache object to a good size
}
public Price this[int i] {
get {
if (i is in Cache)
return Cache[i];
else
reader.Read();
//store the newly fetched item to cache
}
}
}
Once you have such infrastructure, then you can pretty much use it to restrict how much pricing information is loaded and keep your functionality separate from the Caching mechanism. This provides you the flexibility to control how much memory you have to spare for pre-fetching prices and the amount of data you can process
Needless to say, this is just a guideline - you will have to understand this and implement for yourself

From a time efficiency perspective, what would be optimal is for you to get back an initial batch of prices, start processing those, then immediately begin to retrieve the rest. The problem with checking for new data during your processing is that you have to delay your program everytime you need new data.
If you really do care about memory, what you need to do is remove prices from the list you have after you are done with them. This will allow the garbage collector to free up the consumed memory. Otherwise with what you have, once your program is finishing and you pulled back the last year of prices you will have retrieved all of the prices and you would be consuming as much memory as if you had gotten all of the prices at once.
I believe you are being premature with your memory concerns. The only time I ever had to worry about memory/the garbage collector in .net was when I had a long running process and one step in that process included downloading PDF's. Even though I retrieved the PDF's as needed, the PDF's in memory would eventually consume GB's of memory after running for a while and throw an exception after consuming whatever the .net memory limit is for lists.

Find blocks of same values in a collection and manipulate surrounding values

I really don't know how to summarize the question in the title, sorry. :)
Let's assume I have a collection (i.e. ObservableCollection) containing thousands of objects. These objects consist of an ascending timestamp and a FALSE boolean value (very simplified).
Like this:
[0] 0.01, FALSE
[1] 0.02, FALSE
[2] 0.03, FALSE
[3] 0.04, FALSE
...
Now, let's assume that within this collection, there are blocks that have their flag set to TRUE.
Like this:
[2345] 23.46, FALSE
[2346] 23.47, FALSE
[2347] 23.48, FALSE
[2348] 23.49, TRUE
[2349] 23.50, TRUE
[2350] 23.51, TRUE
[2351] 23.52, TRUE
[2352] 23.53, TRUE
[2353] 23.54, FALSE
[2354] 23.55, FALSE
...
I need to find the blocks and set all flags within 1.5 seconds before and after the block to TRUE aswell.
How can I achieve this while maintaining a reasonable performance?

Matthias G solution is correct, although quite slow – seems to have n-squared complexity.
First algo scans the input values to filter them by IsActive, retrieve timestamps and put into a new list - this is O(n), at least. Then it scans the constructed list, which may be in the worst case the whole input – O(n), and for every timestamp retrieved it scans input values to modify appropriate of them – O(n^2).
Then it builds additional list just to be scanned once and destroyed.
I'd propose a solution similar somewhat to mergesort. First scan input values and for each Active item push appropriate time interval into a queue. You may delay pushing to see if the next interval overlaps the current one – then extend the interval instead of push. When the input list is done, finally push the last delayed interval. This way your queue will contain the (almost) minimum number of time intervals you want to modify.
Then scan again the values data and compare timestamps to the first interval in a queue. If the item's timestamp falls into the current interval, mark the item Active. If it falls past the interval remove the interval from the queue and compare the timestamp to the next one – and so on, until the item is in or before the interval. Your input data are in chronological order, so the intervals will be in the same order, too. This allows accomplishing the task in a single parallel pass through both lists.

Assuming you have a data structure like this:
Edit: Changed TimeStamp to double
public class Value
{
public double TimeStamp { get; set; }
public bool IsActive { get; set; }
}
And a list of this objects called values. Then you could search for active data sets and for each of them mark the values within a range around them as active:
double range = 1.5;
var activeTimeStamps = values.Where(value => value.IsActive)
.Select(value => value.TimeStamp)
.ToList();
foreach (var timeStamp in activeTimeStamps)
{
var valuesToMakeActive =
values.Where
(
value =>
value.TimeStamp >= timeStamp - range &&
value.TimeStamp <= timeStamp + range
);
foreach (var value in valuesToMakeActive)
{
value.IsActive = true;
}
}
Anyway, I guess there will be a solution with better performance..

Managing timeseries in c#

I wanted to have your opinion on what is the best way to manage time series in c# according to you. I need to have a 2 dimensions matrix-like with Datetime object as an index of rows (ordered and without duplicate) and each columns would represent the stock value for the relevant Datetime. I would like to know if any of those objects would be able to handle missing data for a date: adding a column or a time serie would add the missing date in the row index and would add "null" or "N/a" for missing values for existing dates.
A lot of stuff are already available in c# compared to c++ and I don't want to miss something obvious.

TeaFiles.Net is a library for time series storage in flat files. As I understand you only want to have the data in memory, in which case you would use a MemoryStream and pass it to the ctor.
// the time series item type
struct Tick
{
public DateTime Time;
public double Price;
public int Volume;
}
// create file and write some values
var ms = new MemoryStream();
using (var tf = TeaFile<Tick>.Create(ms))
{
tf.Write(new Tick { Price = 5, Time = DateTime.Now, Volume = 700 });
tf.Write(new Tick { Price = 15, Time = DateTime.Now.AddHours(1), Volume = 1700 });
// ...
}
ms.Position = 0; // reset the stream
// read typed
using (var tf = TeaFile<Tick>.OpenRead(ms))
{
Tick value = tf.Read();
Console.WriteLine(value);
}
https://github.com/discretelogics/TeaFiles.Net
You can install the library via NuGet packages Manager "TeaFiles.Net"
A vsix sample Project is also available in the VS Gallery.

You could use a mapping between the date and the stock value, such as Dictionary<DateTime, decimal>. This way the dates can be sparse.
If you need the prices of multiple stocks at each date, and not every stock appears for every date, then you could choose between Dictionary<DateTime, Dictionary<Stock, decimal>> and Dictionary<Stock, Dictionary<DateTime, decimal>>, depending on how you want to access the values afterwards (or even both if you don't mind storing the values twice).

The DateTime object in C# is a value Type which means it initializes with its default value and that is Day=1 Month=1 Year=1 Hour=1 Minute=1 Second=1. (or was it hour=12, i am not quite sure).
If I understood you right you need a datastructure that holds DateTime objects that are ordered in some way and when you insert a new object the adjacent dateTime objects will change to retain your order.
In this case I would focus mor on the datastructure than on the dateTime object.
Write a simple class that inherits from Lits<> for example and include the functionality you want on an insert oder delete operation.
Something like:
public class DateTimeList : List<DateTime> {
public void InsertDateTime (int position, DateTime dateTime) {
// insert the new object
this.InsertAt(position, dateTime)
// then take the adjacent objects (take care of integrity checks i.e.
// exists the index/object? in not null ? etc.
DateTime previous = this.ElementAt<DateTime>(position - 1);
// modify the previous DateTime obejct according to your needs.
DateTime next = this.ElementAt<DateTime>(position + 1);
// modify the next DateTime obejct according to your needs.
}
}

As you mentioned in your comment to Marc's answer, I believe the SortedList is a more appropriate structure to hold your time series data.
UPDATE
As zmbq mentioned in his comment to Marc's question, the SortedList is implemented as an array, so if faster insertion/removal times are needed then the SortedDictionary would be a better choice.
See Jon Skeet's answer to this question for an overview of the performance differences.

The is a time series library called TimeFlow, which allows smart creation and handling of time series.
The central TimeSeries class knows its timezone and is internally based on a sorted list of DatimeTimeOffset/Decimal pairs with specific frequency (Minute, Hour, Day, Month or even custom periods). The frequency can be changed during resample operations (e.g. hours -> days). It is also possible to combine time series unsing the standard operators (+,-,*,/) or advanced join operations using cusom methods.
Further more, the TimeFrame class combines multiple time series of same timezone and frequency (similar to pythons DataFrame but restricted to time series) for easier access.
Additional there is the great TimeFlow.Reporting library that provides advanced reporting / visualization (currently Excel and WPF) of time frames.
Disclaimer: I am the creator of these libraries.

Get fast random access to binary files, but also sequential when needed. How to layout?

I have about 1 billion datasets that have a DatasetKey and each has between 1 and 50 000 000 child entries (some objects), average is about 100, but there are many fat tails.
Once the data is written, there is no update to the data, only reads.
I need to read the data by DatasetKey and one of the following:
Get number of child entries
Get first 1000 child entries (max if less than 1000)
Get first 5000 child entries (max if less than 5000)
Get first 100000 child entries (max if less than 100000)
Get all child entries
Each child entry has a size of about 20 bytes to 2KB (450 bytes averaged).
My layout I want to use would be the following:
I create a file of a size of at least 5MB.
Each file contains at least one DatasetKey, but if the file is still less than 5MB I add new DatasetKeys (with child entries) till I exceed the 5 MB.
First I store a header that says at which file-offsets I will find what kind of data.
Further I plan to store serialized packages using protocol-buffers.
One package for the first 1000 entries,
one for the next 4000 entries,
one for the next 95000 entries,
one for the next remaining entries.
I store the file sizes in RAM (storing all the headers would be to much RAM needed on the machine I use).
When I need to access a specific DatasetKey I look in the RAM which file I need. Then I get the file size from the RAM.
When the file-size is about 5MB or less I will read the whole file to memory and process it.
If it is more than 5MB I will read only the first xKB to get the header. Then I load the position I need from disk.
How does this sound? Is this totaly nonsense? Or a good way to go?
Using this design I had the following in mind:
I want to store my data in an own binary file instead a database to have it easier to backup and process the files in future.
I would have used postgresql but I figured out storing binary data would make postgresqls-toast to do more than one seek to access the data.
Storing one file for each DatasetKey needs too much time for writing all the values to disk.
The data is calculated in the RAM (as not the whole data is fitting simultaniously in the RAM, it is calculated block wise).
The Filesize of 5MB is only a rough estimation.
What do you say?
Thank you for your help in advance!
edit
Some more background information:
DatasetKey is of type ulong.
A child entry (there are different types) is most of the time like the following:
public struct ChildDataSet
{
public string Val1;
public string Val2;
public byte Val3;
public long Val4;
}
I cannot tell what data exactly is accessed. Planned is that the users get access to first 1000, 5000, 100000 or all data of particular DatasetKeys. Based on their settings.
I want to keep the response time as low as possible and use as less as possible disk space.
#Regarding random access (Marc Gravells question):
I do not need access to element no. 123456 for a specific DatasetKey.
When storing more than one DatasetKey (with the child entries) in one file (the way I designed it to have not to create to much files),
I need random access to to first 1000 entries of a specific DatasetKey in that file, or the first 5000 (so I would read the 1000 and the 4000 package).
I only need access to the following regarding one specific DatasetKey (uint):
1000 child entries (or all child entries if less than 1000)
5000 child entries (or all child entries if less than 5000)
100000 child entries (or all child entries if less than 100000)
all child entries
All other things I mentioned where just a design try from me :-)
EDIT, streaming for one List in a class?
public class ChildDataSet
{
[ProtoMember(1)]
public List<Class1> Val1;
[ProtoMember(2)]
public List<Class2> Val2;
[ProtoMember(3)]
public List<Class3> Val3;
}
Could I stream for Val1, for example get the first 5000 entries of Val1

Go with a single file. At the front of the file, store the ID-to-offset mapping. Assuming your ID space is sparse, store an array of ID+offset pairs, sorted by ID. Use binary search to find the right entry. Roughly log(n/K) seeks, where "K" is the number of ID+offset pairs you can store on a single disk block (though the OS might need an additional additional seek or two to find each block).
If you want spend some memory to reduce disk seeks, store an in-memory sorted array of every 10,000th ID. When looking up an ID, find the closest ID without going over. This will give you a 10,000-ID range in the header that you can binary search over. You can very precisely scale up/down your memory usage by increasing/decreasing the number of keys in the in-memory table.
Dense ID space: But all of this is completely unnecessary if your ID space is relatively dense, which it seems it might be since you have 1 billion IDs out of a total possible ~4 billion (assuming uint is 32-bits).
The sorted array technique described above requires storing ID+offset for 1 billion IDs. Assuming offsets are 8 bytes, this requires 12 GB in the file header. If you went with a straight array of offsets it would require 32 GB in the file header, but now only a single disk seek (plus the OS's seeks) and no in-memory lookup table.
If 32 GB is too much, you can use a hybrid scheme where you use an array on the first 16 or 24 bits and use a sorted array for the last 16 or 8. If you have multiple levels of arrays, then you basically have a trie (as someone else suggested).
Note on multiple files: With multiple files you're basically trying to use the operating system's name lookup mechanism to handle one level of your ID-to-offset lookup. This is not as efficient as handling the entire lookup yourself.
There may be other reasons to store things as multiple files, though. With a single file, you need to rewrite your entire dataset if anything changes. With multiple files you only have to rewrite a single file. This is where the operating system's name lookup mechanism comes in handy.
But if you do end up using multiple files, it's probably more efficient for ID lookup to make sure they have roughly the same number of keys rather than the same file size.

The focus seems to be on the first n items; in which case, protobuf-net is ideal. Allow me to demonstrate:
using System;
using System.IO;
using System.Linq;
using ProtoBuf;
class Program
{
static void Main()
{
// invent some data
using (var file = File.Create("data.bin"))
{
var rand = new Random(12346);
for (int i = 0; i < 100000; i++)
{
// nothing special about these numbers other than convenience
var next = new MyData { Foo = i, Bar = rand.NextDouble() };
Serializer.SerializeWithLengthPrefix(file, next, PrefixStyle.Base128, Serializer.ListItemTag);
}
}
// read it back
using (var file = File.OpenRead("data.bin"))
{
MyData last = null;
double sum = 0;
foreach (var item in Serializer.DeserializeItems<MyData>(file, PrefixStyle.Base128, Serializer.ListItemTag)
.Take(4000))
{
last = item;
sum += item.Foo; // why not?
}
Console.WriteLine(last.Foo);
Console.WriteLine(sum);
}
}
}
[ProtoContract]
class MyData
{
[ProtoMember(1)]
public int Foo { get; set; }
[ProtoMember(2)]
public double Bar { get; set; }
}
In particular, because DeserializeItems<T> is a streaming API, it is easy to pull a capped quantity of data by using LINQ's Take (or just foreach with break).
Note, though, that the existing public dll won't love you for using struct; v2 does better there, but personally I would make that a class.

Create a solution with as much settings as possible. Then create a few test script and see which settings works best.
Create some settings for:
Original file size
Seperate file headers
Caching strategy (how much and what in mem)

Why not try Memory-mapped files or SQL with FileStream?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.