I have a big file (around 8GB) with thousands of entries (an entry can be up to 1100 bytes).
I need to read each entry and thought about parallelizing this process by splitting the big file into multiple smaller files.
For reading the files I am using C# (FileStream object).
As far as I understood it makes sense that every smaller file has the same size.
The question is now what the optimal size for those smaller files is or if it even make a difference if they all are 20MB or 50MB large.
On my research I came up with different sizes 20MB, 8KB, 10MB or 8MB but always without further explanation.
Update
The entries have a fixed size of bytes depending on the kind of entry which is described in the first 23 bytes of each.
So I'm reading the first 23 bytes, then i get the information of how many bytes i have to read until the entry ends. By this I managed to split the file by the kind of entry (34 files).
Since not every file has the same size the different tasks for the files terminate at a different times.
Related
I have two file of about 50GB each: an input and an output file.
I am using Memory Mapped File to manage these two files.
The input file contains 3 millions of Web pages, and after I have decided a permutation π of them, I have to write into the output file the Web pages in the new order.
So, I can choose to read sequentially the input file and write the web pages in different location of the output file, accordingly to the permutation π.
Or I can do the opposite: reading randomly the input file according to the permutation π and write sequentially into the output file.
Which option is faster? Why?
TL;DR: Due to caching, all file-append operations are sequential. Even writes to the middle of files will be elevator sorted and performed at block size, etc.
Random writing tends to be faster than random reading for several reasons:
When a file grows, the filesystem can choose where to put the new block.
Writes don't have to be performed immediately, the write buffer can assure that an entire block is written at once, meaning that data won't be added to an existing block, which already has a location.
Your processing can't take place until reads complete. And reading relies on a predictive cache. The OS is good at pre-caching sequential reads, horrible for random reads. If your reads are less than block sized, things are even worse -- the actual amount of data read from the disk will be greater than the size of the file.
I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?
At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).
Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort
If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.
I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).
I'm relatively new to C# and programming, so please bear with me. I'm working an an application where I need to read some files and process those files in chunks (for example data is processed in chunks of 48 bytes).
I would like to know what is better, performance-wise, to read the whole file at once in memory and then process it or to read file in chunks and process them directly or to read data in larger chunks (multiple chunks of data which are then processed).
How I understand things so far:
Read whole file in memory
pros:
-It's fast, because the most time expensive operation is seeking, once the head is in place it can read quite fast
cons:
-It consumes a lot of memory
-It consumes a lot of memory in very short time ( This is what I am mainly afraid of, because I do not want that it noticeably impacts overall system performance)
Read file in chunks
pros:
-It's easier (more intuitive) to implement
while(numberOfBytes2Read > 0)
read n bytes
process read data
-It consumes very little memory
cons:
-It could take much more time, if the disk has to seek the file again and move the head to the appropriate position, which in average costs around 12ms.
I know that the answer depends on file size (and hardware). I assume it is better to read the whole file at once, but for how large files is this true, what is the maximum recommended size to read in memory at once (in bytes or relative to the hardware - for example % of RAM)?
Thank you for your answers and time.
It is recommended to read files in buffers of 4K or 8K.
You should really never read files all at once if you want to write it back to another stream. Just read to a buffer and write the buffer back. This is especially through for web programming.
If you have to load the whole file since your operation (text-processing, etc) needs the whole content of the file, buffering does not really help, so I believe it is preferable to use File.ReadAllText or File.ReadAllBytes.
Why 4KB or 8KB?
This is closer to the underlying Windows operating system buffers. Files in NTFS are normally stored in 4KB or 8KB chuncks on the disk although you can choose 32KB chuncks
Your chunk needs to be just large enougth, 48 bytes is of course to small, 4K is reasonable.
My application requires that a task is run everyday in which 100,000+ PDF (~ 50kb each) files need to be zipped. Currently, I'm using 7-zip and calling 7za.exe (the command line tool with 7-zip) to zip each file (files are located in many different folders).
What are the limitations in this approach and how can they be solved? Is there a file size or file count limit for a 7zip archive?
The limit on file size is 16 exabytes, or 16000000000 GB.
There is no hard limit on the number of files, but there is a practical limit in how it manages the headers for the files. The exact limit depends on the path lengths but on a 32-bit system you'll run into limits somewhere around a million files.
I'm not sure if any other format supports more. Regular zip has far smaller limits.
http://en.wikipedia.org/wiki/7-Zip
One notable limitation of 7-Zip is that, while it supports file sizes of up to 16 exabytes, it has an unusually high overhead allocating memory for files, on top of the memory requirements for performing the actual compression.
Approximately 1 kilobyte is required per file (More if the pathname is very long) and the file listing alone can grow to an order of magnitude greater than the memory required to do the actual compression. In real world terms, this means 32-bit systems cannot compress more than a million or so files in one archive as the memory requirements exceed the 2 GB process limit.
64-bit systems do not suffer from the same process size limitation, but still require several gigabytes of RAM to overcome this limitation. Archives created on such systems would be unusable on machines with less memory however.
I am writing a program to read and write a specific binary file format.
I believe I have it 95% working. I am running into a a strange problem.
In the screenshot I am showing a program I wrote that compares two files byte by byte. The very last byte should be 0 but is FFFFFFF.
Using a binary viewer I can see no difference in the files. They appear to be identical.
Also, windows tells me the size of the files is different but the size on disk is the same.
Can someone help me understand what is going on?
The original is on the left and my copy is on the right.
Possible answers:
You forgot to call Stream.close() or Stream.Dispose().
Your code is messing up text and and other kinds of data (e.g. casting a -1 from a Read() method into a char, then writing it.
We need to see your code though...
Size on disk vs Size
First of all you should note that the Size on disk is almost always different from the Size value because the Size on disk value reflects the allocated drive storage but the Size reflects the actual length of the file.
A disk drive splits its space into blocks of the same size. For example, if your drive works with 4KB blocks then even the smallest file containing a single byte will still take up 4KB on the disk, as that is the minimum space it can allocate. Once you write out the 4KB + 1 byte it will then allocate another 4KB block of storage, thus making it 8KB on disk. Hence the Size on disk is always a multiple of 4KB. So the fact the source and destination files have the same Size on disk does not mean the files are the same length. (Different drives have different blocks sizes, it is not always 4KB).
The Size value is the actual defined length of the file data within the disk blocks.
Your Size Issue
As your Size values are different it means that the operating system has saved different lengths of data. Hence you have a fundamental problem with your copying routine and not just an issue with the last byte as you think at the moment. One of your files is 3,434 bytes and the other 2,008 which is a big difference. Your first step must be to work out why you have such a big difference.
If your hex comparing routine is simply looking at the block data then it will think they are the same length as it is comparing disk blocks rather than actual file length.