Reading a large file into a Dictionary

Reading a large file into a Dictionary - c#

I have a 1GB file containing pairs of string and long.
What's the best way of reading it into a Dictionary, and how much memory would you say it requires?
File has 62 million rows.
I've managed to read it using 5.5GB of ram.
Say 22 bytes overhead per Dictionary entry, that's 1.5GB.
long is 8 bytes, that's 500MB.
Average string length is 15 chars, each char 2 bytes, that's 2GB.
Total is about 4GB, where does the extra 1.5 GB go to?
The initial Dictionary allocation takes 256MB.
I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?
Thanks.

It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)
When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.
At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.
So there are two things that come into play in terms of memory utilization.
The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.
The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.
You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.
There is a simple method to handle deciding what's the best parts to keep in memory:
Put the data into a database.
A real one, like MSSQL Express, or MySql or Oracle XE (all are free).
Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)
The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?
In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed.
If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.
This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.
In any case, without knowing a little more about the problem, this is merely a 'general solution'.
It may be that just keeping it in a local instance of a sql db would be sufficient :)

You'll need to specify the file format, but if it's just something like name=value, I'd do:
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] bits = line.Split('=');
// Error checking would go here
long value = long.Parse(bits[1]);
dictionary[bits[0]] = value;
}
}
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?
Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)
The amount of memory required will depend on the length of the strings, number of entries etc.

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.
Since you want to store it in a dict, I assume you need it for fast lookup?
You have not clarified which one should be the key, though.
Let's hope you want to use the long values for keys. Then try this:
Allocate a buffer that's as big as the file. Read the file into that buffer.
Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.
Now browse the data in the buffer like this:
Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.
That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.
At least with C++, this would be a rather memory-efficient way, I think.

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.
Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-
I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.
Alternatively you may roll your own solution using some database techniques.
Suppose your original data file looks like this (dots indicate that string lengths vary):
[key2][value2...][key1][value1..][key3][value3....]
Split it into index file and values file.
Values file:
[value1..][value2...][value3....]
Index file:
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
Records in index file are fixed-size key->value-offset pairs and are ordered by key.
Strings in values file are also ordered by key.
To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.
Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

Related

StringBuilder used with PadLeft/Right OutOfMemoryException

All, I have the following Append which I am performing when I am producing a single line for a fixed text file
formattedLine.Append(this.reversePadding ?
strData.PadLeft(this.maximumLength) :
strData.PadRight(this.maximumLength));
This particular exception happens on the PadLeft() where this.maximumLength = 1,073,741,823 [a field length of an NVARCHAR(MAX) gathered from SQL Server]. formattedLine = "101102AA-1" at the time of exception so why is this happening. I should have a maximum allowed length of 2,147,483,647?
I am wondering if https://stackoverflow.com/a/1769472/626442 be the answer here - however, I am managing any memory with the appropriate Dispose() calls on any disposable objects and using block where possible.
Note. This fixed text export is being done on a background thread.
Thanks for your time.

This particular exception happens on the PadLeft() where this.maximumLength = 1,073,741,823
Right. So you're trying to create a string with over a billion characters in.
That's not going to work, and I very much doubt that it's what you really want to do.
Note that each char in .NET is two bytes, and also strings in .NET are null-terminated... and have some other fields beyond the data (the length, for one). That means you'd need at least 2147483652 bytes + object overhead, which pushes you over the 2GB-per-object limit.
If you're running on a 64-bit version of Windows, in .NET 4.5, there's a special app.config setting of <gcAllowVeryLargeObjects> that allows arrays bigger than 2GB. However, I don't believe that will change your particular use case:
Using this element in your application configuration file enables arrays that are larger than 2 GB in size, but does not change other limits on object size or array size:
The maximum number of elements in an array is UInt32MaxValue.
The maximum index in any single dimension is 2,147,483,591 (0x7FFFFFC7) for byte arrays and arrays of single-byte structures, and 2,146,435,071 (0X7FEFFFFF) for other types.
The maximum size for strings and other non-array objects is unchanged.
What would you want to do with such a string after creating it, anyway?

In order to allocate memory for this operation, the OS must find contiguous memory that is large enough to perform the operation.
Memory fragmentation can cause that to be impossible, especially when using a 32-bit .NET implementation.

I think there might be a better approach to what you are trying to accomplish. Presumably, this StringBuilder is going to be written to a file (that's what it sounds like from your description), and apparently, you are also potentially dealing with large (huge) database records.
You might consider a streaming approach, that wont require allocating such a huge block of memory.
To accomplish this you might investigate the following:
The SqlDataReader class exposes a GetChars() method, that allows you to read a chunk of a single large record.
Then, instead of using a StringBuilder, perhaps using a StreamWriter ( or some other TextWriter derived class) to write each chunk to the output.
This will only require having one buffer-full of the record in your application's memory space at one time. Good luck!

Storing 16 bytes of String array in 4 bytes memory, (compression) in RFID Tags

I hope that this question will not produce some vagueness. Actually I am working on RFID project and I am using Passive Tags. These Tags store only 4 bytes of Data, 32bits. I am trying to store more information in String in Tag's Data Bank. I searched the internet for String compression Algorithms but I didn't find any of them suitable. Someone please guide me through this issue. How can I save more data in this 4 bytes Data Bank, should I use some other strategy for storing, if yes, then what? Moreover, I am using C# on Handheld Window CE device.
I'll appreciate if someone could help me...

It depends on your tag, for example alien tag http://www.alientechnology.com/docs/products/Alien-Technology-Higgs-3-ALN-9662-Short.pdf , has EPC memory , I think you use your EPC memory but You can also use User Memory in your tag. You don't have to compress anything, just use your User Memory. Furthermore, technically I rather not to save many data on my tag, I use my own coding on 32 bit and relates(map) it to the more Data on my Software, and save my data on my Hard Disk. It is more safe too.

There is obviously no compression that can reduce arbitrary 16 byte values to 4 byte values. That's mathematically impossible, check the Pidgeonhole principle for details.
Store the actual data in some kind of database. Have the 4 bytes encode an integer that acts as a key for the row your want to refer to. For example by using an auto-increment primary key, or an index into an array. Works with up to 4 billion rows.

If you have less than 2^32 strings, simply enumerate them and then save the strings index (in your "dictionary") inside your 4 byte "Data Bank".

A compression scheme can't guarantee such high compression ratios.
The only way I can think of with 32-bits is to store an int in the 32-bits, and construct a local/remote URL out of it, which points to the actual data.
You could also make the stored value point to entries in a local look-up table on the device.

Unless you know a lot about the format of your string, it is impossible to do this. This is evident from the pigeonhole principle: you have a theoretical 2^128 different 16-byte strings, but only 2^32 different values to choose from.
In other words, no compression algorithm will guarantee that an arbitrary string in your possible input set will map to a 4-byte value in the output set.
It may be possible to devise an algorithm which will work in your particular case, but unless your data set is sufficiently restricted (at most 1 in 79,228,162,514,264,337,593,543,950,336 possible strings may be valid) and has a meaningful structure, then your only option is to store some mapping externally.

Alternative to Dictionary for lookup

I have requirement to build lookup table. I use Dictionary It's contain 45M long and 45M int . long as key and int as value . the size of collection is (45M*12) where long is 8 byte and int is 4 byte
The size about 515 Mbyte . But in fact the size of process is 1.3 Gbyte . The process contains only this lookup table.
Mat be, is there alternative to Dictionary ??
Thx

How much effort are you willing to spend?
You could use a
KeyValuePair<long,int>[] table = new KeyValuePair<long,int> [45 M];
then sort this on the first column (long Key) and use a binary search to find your values.

You could use a SortedList instead of a Dictionary which will be more memory efficient but may be marginally less CPU efficient, ignoring issues about measuring memory and why you need to load so much data in 1 go in the first place :)

Dictionaries have an underlying array that holds onto the data, but the size of the array must be larger than the number of items you have, this is where the lookup speed of a dictionary comes from. In fact, the size of the underlying array should be quite a bit larger than the number of items (25+%). Combine this with the fact that as you're adding items this underlying array is being de-allocated and recreated (to make it larger) you probably have a fair amount of memory ready to be garbage collected (meaning if you actually need more memory the GC will reclaim it, but since you currently have enough it's not bothering to).
Is this Dictionary consuming more memory than you can possibly allow it to, or are you just curious why it's more than you thought it would be? There are other options available to you (other answers and comments have listed some) that will use less memory but also be slower. Are you running into out of memory issues?

if your range is limited to max long values of 10^12, then a problem in regards to space is that you must use longs because you only need a few bits more than an int can hold. If that's the case you could do something like this:
Store your data in an array of 512 Dictionary
var myData = new Dictionary<int,int>[512];
to reference the int associated with a long value (which I'll call "key" for this example), you would do the following:
myData[key & 511].Add((int) (key >> 9), intValue);
int result = myData[(int) (key & 511)][(int) (key >> 9)];
Just how many dictionaries you create and the number of bits used in the bit fiddling might need to be adjusted to fit the true contraints of your data. Using this approach would reduce your memory usage by about a third

Another approach, assuming that the data is static: use two sorted arrays- one of long and one of int. Make sure that item at index N in one is the value for the key at index N in the other. Use Array.BinarySearch to find the key values that you are looking for.

which data structure is good for temporary big binary data storage?

Plan to have a data structure to store temporary binary data in the memory for analysis.
The max size of the data will be about 10MB.
data will be added at the end 408 bytes at a time.
no search, retrieve operations on those temporary binary data.
data will be wipe out and the storage will be reused for next analysis.
questions:
which structure is good for this purpose? byte[10MB], List<bytes>(10MB), List<MyStruct>(24000), or ...?
how to quickly wipe out the data (not List.Clear(), just set the value to 0) for List or array?
If I say List.Clear(), the memory for this List will shrink or the capacity (memory) of the List is still there and no memory allocation when I call List.AddRange() after the Clear()?
List.Insert() will make the List larger or it just replace the existing item?

You will have to describe what you are doing more to give better answers but it sounds like you are worried about efficiency/perf so
byte[]
no need to clear the array, just keep track of where the 'end' of your current cycle is
n/a
n/a

If your data is usually the same size, and always under a certain size, use a byte array.
Create a byte[], and a int that lets you know where the end of the "full" part of that buffer stops and the "free" part starts. You never need to clear it; just overwrite what was there. The only problem with this is if your data is sometimes 100 kb, sometimes 10 MB, and sometimes a bit larger than you originally planned for.
List will be slower to use and larger in memory, although they handle various sizes of data out of the box.

using (System.IO.MemoryStream memStream = new System.IO.MemoryStream())
{
Do stuff
} // the using ensures proper simple disposal occurs here so you don't have to worry about cleaning up.

Best approach to holding large editable documents in memory

I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.
Assumptions
The documents can be pretty large, up
to 100MB.
More often than not the document
will remain unchanged - (i.e. I don't
want to do unnecessary up front
processing).
Changes will typically be quite close
to each other in the document (i.e. as
the user types).
It should be possible to apply changes fast (without copying the whole document)
Changes will be applied in terms of
offsets and new/deleted text (not as
line/col).
To work in C#
Current considerations
Storing the data as a string. Easy to
code, fast to set, very slow to
update.
Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).
There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).
Thanks for your thoughts.

I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.
To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.
File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400
Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets
16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.
Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).
Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.

Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.

You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.
FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.

I would suggest you to take a look at Memory Mapped Files (MMF).
Some pointers:
Memory Mapped Files .NET
http://msdn.microsoft.com/en-us/library/ms810613.aspx

I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.
You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.
You can move lines within a node without much effort.
The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.
Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.

Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.
The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.
I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..

From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.
If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)
I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.