Related
i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}
No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.
A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.
If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.
I'm using C# and I write my data into csv files (for further use). However my files have grown into a large scale and i have to transpose them. what's the easiest way to do that. in any program?
Gil
In increasing order of complexity (and also increasing order of ability to handle large files):
Read the whole thing into a 2-D array (or jagged array aka array-of-arrays).
Memory required: equal to size of file
Track the file offset within each row. Start by finding each (non-quoted) newline, storing the current position into a List<Int64>. Then iterate across all rows, for each row: seek to the saved position, copy one cell to the output, save the new position. Repeat until you run out of columns (all rows reach a newline).
Memory required: eight bytes per row
Frequent file seeks scattered across a file much larger than the disk cache results in disk thrashing and miserable performance, but it won't crash.
Like above, but working on blocks of e.g. 8k rows. This will create a set of files each with 8k columns. The input block and output all fit within disk cache, so no thrashing occurs. After building the stripe files, iterate across the stripes, reading one row from each and appending to the output. Repeat for all rows. This results in sequential scan on each file, which also has very reasonable cache behavior.
Memory required: 64k for first pass, (column count/8k) file descriptors for second pass.
Good performance for tables of up to several million in each dimension. For even larger data sets, combine just a few (e.g. 1k) of the stripe files together, making a smaller set of larger stripes, repeat until you have only a single stripe with all data in one file.
Final comment: You might squeeze out more performance by using C++ (or any language with proper pointer support), memory-mapped files, and pointers instead of file offsets.
It really depends. Are you getting these out of a database? The you could use a MySql import statement. http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Or you could use could loop through the data add it to a file stream using streamwriter object.
StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}
I wrote a little proof-of-concept script here in python. I admit it's buggy and there are likely some performance improvements to be made, but it will do it. I ran it against a 40x40 file and got the desired result. I started to run it against something more like your example data set and it took too long for me to wait.
path = mkdtemp()
try :
with open('/home/user/big-csv', 'rb') as instream:
reader = csv.reader(instream)
for i, row in enumerate(reader):
for j, field in enumerate(row):
with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
contents = [ '{0},'.format(field) ]
new_row_stream.writelines(contents)
print 'read row {0:0>2}'.format(i)
with open('/home/user/transpose-csv', 'wb') as outstream:
files = glob(join(path, '*'))
files.sort()
for filename in files:
with open(filename, 'rb') as row_file:
contents = row_file.readlines()
outstream.writelines(contents + [ '\n' ])
finally:
print "done"
rmtree(path)
I have about 1 billion datasets that have a DatasetKey and each has between 1 and 50 000 000 child entries (some objects), average is about 100, but there are many fat tails.
Once the data is written, there is no update to the data, only reads.
I need to read the data by DatasetKey and one of the following:
Get number of child entries
Get first 1000 child entries (max if less than 1000)
Get first 5000 child entries (max if less than 5000)
Get first 100000 child entries (max if less than 100000)
Get all child entries
Each child entry has a size of about 20 bytes to 2KB (450 bytes averaged).
My layout I want to use would be the following:
I create a file of a size of at least 5MB.
Each file contains at least one DatasetKey, but if the file is still less than 5MB I add new DatasetKeys (with child entries) till I exceed the 5 MB.
First I store a header that says at which file-offsets I will find what kind of data.
Further I plan to store serialized packages using protocol-buffers.
One package for the first 1000 entries,
one for the next 4000 entries,
one for the next 95000 entries,
one for the next remaining entries.
I store the file sizes in RAM (storing all the headers would be to much RAM needed on the machine I use).
When I need to access a specific DatasetKey I look in the RAM which file I need. Then I get the file size from the RAM.
When the file-size is about 5MB or less I will read the whole file to memory and process it.
If it is more than 5MB I will read only the first xKB to get the header. Then I load the position I need from disk.
How does this sound? Is this totaly nonsense? Or a good way to go?
Using this design I had the following in mind:
I want to store my data in an own binary file instead a database to have it easier to backup and process the files in future.
I would have used postgresql but I figured out storing binary data would make postgresqls-toast to do more than one seek to access the data.
Storing one file for each DatasetKey needs too much time for writing all the values to disk.
The data is calculated in the RAM (as not the whole data is fitting simultaniously in the RAM, it is calculated block wise).
The Filesize of 5MB is only a rough estimation.
What do you say?
Thank you for your help in advance!
edit
Some more background information:
DatasetKey is of type ulong.
A child entry (there are different types) is most of the time like the following:
public struct ChildDataSet
{
public string Val1;
public string Val2;
public byte Val3;
public long Val4;
}
I cannot tell what data exactly is accessed. Planned is that the users get access to first 1000, 5000, 100000 or all data of particular DatasetKeys. Based on their settings.
I want to keep the response time as low as possible and use as less as possible disk space.
#Regarding random access (Marc Gravells question):
I do not need access to element no. 123456 for a specific DatasetKey.
When storing more than one DatasetKey (with the child entries) in one file (the way I designed it to have not to create to much files),
I need random access to to first 1000 entries of a specific DatasetKey in that file, or the first 5000 (so I would read the 1000 and the 4000 package).
I only need access to the following regarding one specific DatasetKey (uint):
1000 child entries (or all child entries if less than 1000)
5000 child entries (or all child entries if less than 5000)
100000 child entries (or all child entries if less than 100000)
all child entries
All other things I mentioned where just a design try from me :-)
EDIT, streaming for one List in a class?
public class ChildDataSet
{
[ProtoMember(1)]
public List<Class1> Val1;
[ProtoMember(2)]
public List<Class2> Val2;
[ProtoMember(3)]
public List<Class3> Val3;
}
Could I stream for Val1, for example get the first 5000 entries of Val1
Go with a single file. At the front of the file, store the ID-to-offset mapping. Assuming your ID space is sparse, store an array of ID+offset pairs, sorted by ID. Use binary search to find the right entry. Roughly log(n/K) seeks, where "K" is the number of ID+offset pairs you can store on a single disk block (though the OS might need an additional additional seek or two to find each block).
If you want spend some memory to reduce disk seeks, store an in-memory sorted array of every 10,000th ID. When looking up an ID, find the closest ID without going over. This will give you a 10,000-ID range in the header that you can binary search over. You can very precisely scale up/down your memory usage by increasing/decreasing the number of keys in the in-memory table.
Dense ID space: But all of this is completely unnecessary if your ID space is relatively dense, which it seems it might be since you have 1 billion IDs out of a total possible ~4 billion (assuming uint is 32-bits).
The sorted array technique described above requires storing ID+offset for 1 billion IDs. Assuming offsets are 8 bytes, this requires 12 GB in the file header. If you went with a straight array of offsets it would require 32 GB in the file header, but now only a single disk seek (plus the OS's seeks) and no in-memory lookup table.
If 32 GB is too much, you can use a hybrid scheme where you use an array on the first 16 or 24 bits and use a sorted array for the last 16 or 8. If you have multiple levels of arrays, then you basically have a trie (as someone else suggested).
Note on multiple files: With multiple files you're basically trying to use the operating system's name lookup mechanism to handle one level of your ID-to-offset lookup. This is not as efficient as handling the entire lookup yourself.
There may be other reasons to store things as multiple files, though. With a single file, you need to rewrite your entire dataset if anything changes. With multiple files you only have to rewrite a single file. This is where the operating system's name lookup mechanism comes in handy.
But if you do end up using multiple files, it's probably more efficient for ID lookup to make sure they have roughly the same number of keys rather than the same file size.
The focus seems to be on the first n items; in which case, protobuf-net is ideal. Allow me to demonstrate:
using System;
using System.IO;
using System.Linq;
using ProtoBuf;
class Program
{
static void Main()
{
// invent some data
using (var file = File.Create("data.bin"))
{
var rand = new Random(12346);
for (int i = 0; i < 100000; i++)
{
// nothing special about these numbers other than convenience
var next = new MyData { Foo = i, Bar = rand.NextDouble() };
Serializer.SerializeWithLengthPrefix(file, next, PrefixStyle.Base128, Serializer.ListItemTag);
}
}
// read it back
using (var file = File.OpenRead("data.bin"))
{
MyData last = null;
double sum = 0;
foreach (var item in Serializer.DeserializeItems<MyData>(file, PrefixStyle.Base128, Serializer.ListItemTag)
.Take(4000))
{
last = item;
sum += item.Foo; // why not?
}
Console.WriteLine(last.Foo);
Console.WriteLine(sum);
}
}
}
[ProtoContract]
class MyData
{
[ProtoMember(1)]
public int Foo { get; set; }
[ProtoMember(2)]
public double Bar { get; set; }
}
In particular, because DeserializeItems<T> is a streaming API, it is easy to pull a capped quantity of data by using LINQ's Take (or just foreach with break).
Note, though, that the existing public dll won't love you for using struct; v2 does better there, but personally I would make that a class.
Create a solution with as much settings as possible. Then create a few test script and see which settings works best.
Create some settings for:
Original file size
Seperate file headers
Caching strategy (how much and what in mem)
Why not try Memory-mapped files or SQL with FileStream?
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx
I have a 1GB file containing pairs of string and long.
What's the best way of reading it into a Dictionary, and how much memory would you say it requires?
File has 62 million rows.
I've managed to read it using 5.5GB of ram.
Say 22 bytes overhead per Dictionary entry, that's 1.5GB.
long is 8 bytes, that's 500MB.
Average string length is 15 chars, each char 2 bytes, that's 2GB.
Total is about 4GB, where does the extra 1.5 GB go to?
The initial Dictionary allocation takes 256MB.
I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?
Thanks.
It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)
When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.
At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.
So there are two things that come into play in terms of memory utilization.
The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.
The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.
You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.
Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.
There is a simple method to handle deciding what's the best parts to keep in memory:
Put the data into a database.
A real one, like MSSQL Express, or MySql or Oracle XE (all are free).
Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.
Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.
Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)
The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?
In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed.
If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.
This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.
In any case, without knowing a little more about the problem, this is merely a 'general solution'.
It may be that just keeping it in a local instance of a sql db would be sufficient :)
You'll need to specify the file format, but if it's just something like name=value, I'd do:
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] bits = line.Split('=');
// Error checking would go here
long value = long.Parse(bits[1]);
dictionary[bits[0]] = value;
}
}
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?
Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)
The amount of memory required will depend on the length of the strings, number of entries etc.
I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.
Since you want to store it in a dict, I assume you need it for fast lookup?
You have not clarified which one should be the key, though.
Let's hope you want to use the long values for keys. Then try this:
Allocate a buffer that's as big as the file. Read the file into that buffer.
Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.
Now browse the data in the buffer like this:
Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.
That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.
At least with C++, this would be a rather memory-efficient way, I think.
Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.
Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.
Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...
Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-
I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.
If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.
Alternatively you may roll your own solution using some database techniques.
Suppose your original data file looks like this (dots indicate that string lengths vary):
[key2][value2...][key1][value1..][key3][value3....]
Split it into index file and values file.
Values file:
[value1..][value2...][value3....]
Index file:
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
Records in index file are fixed-size key->value-offset pairs and are ordered by key.
Strings in values file are also ordered by key.
To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.
Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.