Storing Large Lookup Tables

Storing Large Lookup Tables - c#

I am developing an app that utilizes very large lookup tables to speed up mathematical computations. The largest of these tables is an int[] that has ~10 million entries. Not all of the lookup tables are int[]. For example, one is a Dictionary with ~200,000 entries. Currently, I generate each lookup table once (which takes several minutes) and serialize it to disk (with compression) using the following snippet:
int[] lut = GenerateLUT();
lut.Serialize("lut");
where Serialize is defined as follows:
public static void Serialize(this object obj, string file)
{
using (FileStream stream = File.Open(file, FileMode.Create))
{
using (var gz = new GZipStream(stream, CompressionMode.Compress))
{
var formatter = new BinaryFormatter();
formatter.Serialize(gz, obj);
}
}
}
The annoyance I am having is when launching the application, is that the Deserialization of these lookup tables is taking very long (upwards of 15 seconds). This type of delay will annoy users as the app will be unusable until all the lookup tables are loaded. Currently the Deserialization is as follows:
int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
int[] lut2 = (int[]) Deserialize("lut2");
...
where Deserialize is defined as:
public static object Deserialize(string file)
{
using (FileStream stream = File.Open(file, FileMode.Open))
{
using (var gz = new GZipStream(stream, CompressionMode.Decompress))
{
var formatter = new BinaryFormatter();
return formatter.Deserialize(gz);
}
}
}
At first, I thought it might have been the gzip compression that was causing the slowdown, but removing it only skimmed a few hundred milliseconds from the Serialization/Deserialization routines.
Can anyone suggest a way of speeding up the load times of these lookup tables upon the app's initial startup?

First, deserializing in a background thread will prevent the app from "hanging" while this happens. That alone may be enough to take care of your problem.
However, Serialization and deserialization (especially of large dictionaries) tends to be very slow, in general. Depending on the data structure, writing your own serialization code can dramatically speed this up, particularly if there are no shared references in the data structures.
That being said, depending on the usage pattern of this, a database might be a better approach. You could always make something that was more database oriented, and build the lookup table in a lazy fashion from the DB (ie: a lookup is lookup in the LUT, but if the lookup doesn't exist, load it from the DB and save it in the table). This would make startup instantaneous (at least in terms of the LUT), and probably still keep lookups fairly snappy.

I guess the obvious suggestion is to load them in the background. Once the app has started, the user has opened their project, and selected whatever operation they want, there won't be much of that 15 seconds left to wait.

Just how much data are we talking about here? In my experience, it takes about 20 seconds to read a gigabyte from disk into memory. So if you're reading upwards of half a gigabyte, you're almost certainly running into hardware limitations.
If data transfer rate isn't the problem, then the actual deserialization is taking time. If you have enough memory, you can load all of the tables into memory buffers (using File.ReadAllBytes()) and then deserialize from a memory stream. That will allow you to determine how much time reading is taking, and how much time deserialization is taking.
If deserialization is taking a lot of time, you could, if you have multiple processors, spawn multiple threds to do the serialization in parallel. With such a system, you could potentially be deserializing one or more tables while loading the data for another. That pipelined approach could make your entire load/deserialization time be almost as fast as load only.

Another option is to put your tables into, well, tables: real database tables. Even an engine like Access should yield pretty good performance, because you have an obvious index for every query. Now the app only has to read in data when it's actually about to use it, and even then it's going to know exactly where to look inside the file.
This might make the app's actual performance a bit lower, because you have to do a disk read for every calculation. But it would make the app's perceived performance much better, because there's never a long wait. And, like it or not, the perception is probably more important than the reality.

Why zip them?
Disk is bigger than RAM.
A straight binary read should be pretty quick.

Related

Performance issues while creating file checksums

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.
The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(#filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.
Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.
Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.
Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.
A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.

How to make these IO reads parallel and performant

I have a list of files: List<string> Files in my C#-based WPF application.
Files contains ~1,000,000 unique file paths.
I ran a profiler on my application. When I try to do parallel operations, it's REALLY laggy because it's IO bound. It even lags my UI threads, despite not having dispatchers going to them (note the two lines I've marked down):
Files.AsParallel().ForAll(x =>
{
char[] buffer = new char[0x100000];
using (FileStream stream = new FileStream(x, FileMode.Open, FileAccess.Read)) // EXTREMELY SLOW
using (StreamReader reader = new StreamReader(stream, true))
{
while (true)
{
int bytesRead = reader.Read(buffer, 0, buffer.Length); // EXTREMELY SLOW
if (bytesRead <= 0)
{
break;
}
}
}
}
These two lines of code take up ~70% of my entire profile test runs. I want to achieve maximum parallelization for IO, while keeping performance such that it doesn't cripple my app's UI entirely. There is nothing else affecting my performance. Proof: Using Files.ForEach doesn't cripple my UI, and WithDegreeOfParallelism helps out too (but, I am writing an application that is supposed to be used on any PC, so I cannot assume a specific degree of parallelism for this computation); also, the PC I am on has a solid-state hard disk. I have searched on StackOverflow, and have found links that talk about using asynchronous IO read methods. I'm not sure how they apply in this case, though. Perhaps someone can shed some light? Also; how can you tune down the constructor time of a new FileStream; is that even possible?
Edit: Well, here's something strange that I've noticed...the UI doesn't get crushed so bad when I swap Read for ReadAsync while still using AsParallel. Simply awaiting the task created by ReadAsync to finish causes my UI thread to maintain some degree of usability. I think this does some sort of asynchronous scheduling that is done in this method to maintain optimal disk usage while not crushing existing threads. And on that note, is there ever a chance that the operating system contends for existing threads to do IO, such as my application's UI thread? I seriously don't understand why its slowing my UI thread. Is the OS scheduling work from IO on my thread or something? Did they do something to the CLR to eat threads that haven't been explicitly affinated using Thread.BeginThreadAffinity or something? Memory is not an issue; I am looking at Task Manager and there is plenty.

I don't agree with your assertion that you can't use WithDegreeOfParallelism because it will be used on any PC. You can base it on number of CPU. By not using WithDegreeOfParallelism you are going to get crushed on some PCs.
You optimized for a solid state disc where heads don't have to move. I don't think this unrestricted parallel design will hold up on regular disc (any PC).
I would try a BlockingCollection with 3 queues : FileStream, StreamReader, and ObservableCollection. Limit the FileStream to like 4 - it just has to stay ahead of StreamReader. And no parallelism.
A single head is a single head. It cannot read from 5 or 5000 files faster than it can read from 1. On solid state the is no penalty switching from file to file - on a regular disc there is a significant penalty. If your files are fragmented there is a significant penalty (on regular disc).
You don't show what the data write but the next step would be to put the write in a another queue with a BlockingCollection in the BlockingCollection.
E.G. sb.Append(text); in a separate queue.
But that may be more overhead than it is worth.
Keep that head as close to 100% busy on a single contiguous file is the best you are going to do.
private async Task<string> ReadTextAsync(string filePath)
{
using (FileStream sourceStream = new FileStream(filePath,
FileMode.Open, FileAccess.Read, FileShare.Read,
bufferSize: 4096, useAsync: true))
{
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[0x1000];
int numRead;
while ((numRead = await sourceStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
{
string text = Encoding.Unicode.GetString(buffer, 0, numRead);
sb.Append(text);
}
return sb.ToString();
}
}

File access is inherently not parallel. You can only benefit from parallelism, if you treat some files while reading others. It makes no sense to wait for the disk in parallel.
Instead of waiting 100 000 time 1 ms for disk access, you program to wait once 100 000 ms = 100 s.

Unfortunately, it's a vague question without a reproducible code example. So it's impossible to offer specific advice. But my two recommendations are:
Pass a ParallelOptions instance where you've set the MaxDegreeOfParallelism property to something reasonably low. Something like the number of cores in your system, or even that number minus one.
Make sure you aren't expecting too much from the disk. You should start with the known speed of the disk and controller, and compare that with the data throughput you're getting. Adjust the degree of parallelism even lower if it looks like you're already at or near the maximum theoretical throughput.
Performance optimization is all about setting realistic goals based on known limitations of the hardware, measuring your actual performance, and then looking at how you can improve the costliest elements of your algorithm. If you haven't done the first two steps yet, you really should start there. :)

I got it working; the problem was me trying to use an ExtendedObservableCollection with AddRange instead of calling Add multiple times in every UI dispatch...for some reason, the performance of the methods people list in here is actually slower in my situation: ObservableCollection Doesn't support AddRange method, so I get notified for each item added, besides what about INotifyCollectionChanging?
I think because it forces you to call change notifications with .Reset (reload) instead of .Add (a diff), there is some sort of logic in place that causes bottlenecks.
I apologize for not posting the rest of the code; I was really thrown off by this, and I'll explain why in a moment. Also, a note for others who come across the same issue, this might help. The main problem with profiling tools in this scenario is that they don't help much here. Most of your app's time will be spent reading files regardless. So you have to unit test all dispatchers separately.

Parsing and Analysing few GBs data

I am looking for an approach to analyse custom log files.
I have right now implemented using LINQ and C#.NET. It only works on log files of size upto 500MB.
Each line of the log file is made in to an object that looks like
public class Metrics
{
public DateTime Date { get; set; }
public string Metrics1 { get; set; }
public string Metrics2 { get; set; }
:
:
public string Metrics9 { get; set; }
}
List<Metrics> MetricsList = new List<Metrics>();
MetricsList is populated.
Various LINQ queries are run on MetricsList to provide useful analytics.
It is observed that a Metrics object needs 300 bytes. I have approximately 4 million lines in 500MB log files which makes the size of MetricsList alone consuming more than 1GB of program memory.
My requirement is to parse and analyse files with size upto 2 GB which looks like going to consume 4 GB of memory.
Any better approaches or alternatives using Windows, Microsoft Technologies and any Open Source Libraries.

I have done a similar task using SQlite. Install System.Data.SQLite NuGet (optional: I have used Dapper NuGet as a very efficient micro-ORM too) and then you have a very good tool for performing queries and generating your reports. The only thing that you may not like is that you have to write SQL instead of LINQ (Although there is LINQ for SQLite too; but I have not used it).
This way that memory consumption will go away too.

Usually you don't want to store files like that in memory (unless you have enough of course), but process the data as you parse the file. I'd simply install more memory and set the solution to 64-bit probably...
However, if that is not an option, you can always optimize memory usage a bit. .NET stores strings as char[] where a char is basically a 2-byte short. You can easily save a lot of memory by simply not storing it as char[] but as byte[] using Encoding.UTF8.GetBytes.
Also, each string or byte[] consumes 24 bytes (16 for the object itself, 8 for the pointer) in a 64-bit environment. That can add up if you have a lot of small strings. Instead of storing them as strings, you can also store a single byte[] and do the parsing in the getters.
So to conclude my advice is: buy more memory or process the data as you read/need it.
[Update+1]
Just noticed that you use a List. The easiest way to process-as-you-go is to read the file as IEnumerable and use Linq on that. Don't put it in a list first. E.g.:
public IEnumerable<Metric> ReadFile()
{
string s;
while ((s=myFileReader.ReadLine())!=null)
{
yield return Parse(s);
}
}
int someAnalysis = ReadFile().Sum((a)=>(a.Metric1.Length)); // or whatever you do
[Update+2]
Oh I have another trick for you. Reading files can be a pain with performance, since file IO relatively sucks. So instead of using the IEnumeration trick from above, you can also use a compressed stream to store all the data in memory - and then use that during processing instead of the file.
For the people that are wondering if I'm serious about this weird solution: this is a frequently used technique when you're building search technology and databases, simply because having more in (fast) memory means having less (slow) disk IO. Further, a log file will probably compress very nicely.
So Read file && flatestream on top of a memorystream. Then read that for Linq in the way discussed above (again, flatestream on top of memorystream).

StringDictionary To TextFile

I am trying to export a stringdictionary to a text file, it has over one million of records, it takes over 3 minutes to export into a textfile if I use a loop.
Is there a way to do that faster?
Regards

Well, it depends on what format you're using for the export, but in general, the biggest overhead for exporting large amounts of data is going to be I/O. You can reduce this by using a more compact data format, and by doing less manipulation of the data in memory (to avoid memory copies) if possible.
The first thing to check is to look at your disk I/O speed and do some profiling of the code that does the writing.
If you're maxing out your disk I/O (e.g., writing at a good percentage of disk speed, which would be many tens of megabytes per second on a modern system), you could consider compressing the data before you write it. This uses more CPU, but you write less to the disk when you do this. This will also likely increase the speed of reading the file, if you have the same bottleneck on the reading side.
If you're maxing out your CPU, you need to do less processing work on the data before writing it. If you're using a serialization library, for example, avoiding that and switching to a simpler, more specialized data format might help. Consider the simplest format you need: probably just a word for the length of the string, followed by the string data itself, repeated for every key and value.

Note that most dictionary constructs don't preserve the insert order - this often makes them poor choices if you want repeatable file contents, but (depending on the size) we may be able to improve on the time.... this (below) takes about 3.5s (for the export) to write just under 30MB:
StringDictionary data = new StringDictionary();
Random rand = new Random(123456);
for (int i = 0; i < 1000000; i++)
{
data.Add("Key " + i, "Value = " + rand.Next());
}
Stopwatch watch = Stopwatch.StartNew();
using (TextWriter output = File.CreateText("foo.txt"))
{
foreach (DictionaryEntry pair in data)
{
output.Write((string)pair.Key);
output.Write('\t');
output.WriteLine((string)pair.Value);
}
output.Close();
}
watch.Stop();
Obviously the performance will depend on the size of the actual data getting written.

Reading a large file into a Dictionary

I have a 1GB file containing pairs of string and long.
What's the best way of reading it into a Dictionary, and how much memory would you say it requires?
File has 62 million rows.
I've managed to read it using 5.5GB of ram.
Say 22 bytes overhead per Dictionary entry, that's 1.5GB.
long is 8 bytes, that's 500MB.
Average string length is 15 chars, each char 2 bytes, that's 2GB.
Total is about 4GB, where does the extra 1.5 GB go to?
The initial Dictionary allocation takes 256MB.
I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?
Thanks.

It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)
When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.
At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.
So there are two things that come into play in terms of memory utilization.
The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.
The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.
You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.
There is a simple method to handle deciding what's the best parts to keep in memory:
Put the data into a database.
A real one, like MSSQL Express, or MySql or Oracle XE (all are free).
Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)
The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?
In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed.
If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.
This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.
In any case, without knowing a little more about the problem, this is merely a 'general solution'.
It may be that just keeping it in a local instance of a sql db would be sufficient :)

You'll need to specify the file format, but if it's just something like name=value, I'd do:
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] bits = line.Split('=');
// Error checking would go here
long value = long.Parse(bits[1]);
dictionary[bits[0]] = value;
}
}
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?
Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)
The amount of memory required will depend on the length of the strings, number of entries etc.

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.
Since you want to store it in a dict, I assume you need it for fast lookup?
You have not clarified which one should be the key, though.
Let's hope you want to use the long values for keys. Then try this:
Allocate a buffer that's as big as the file. Read the file into that buffer.
Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.
Now browse the data in the buffer like this:
Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.
That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.
At least with C++, this would be a rather memory-efficient way, I think.

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.
Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-
I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.
Alternatively you may roll your own solution using some database techniques.
Suppose your original data file looks like this (dots indicate that string lengths vary):
[key2][value2...][key1][value1..][key3][value3....]
Split it into index file and values file.
Values file:
[value1..][value2...][value3....]
Index file:
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
Records in index file are fixed-size key->value-offset pairs and are ordered by key.
Strings in values file are also ordered by key.
To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.
Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.