Archiving thousands of files and 7zip limitations - c#

My application requires that a task is run everyday in which 100,000+ PDF (~ 50kb each) files need to be zipped. Currently, I'm using 7-zip and calling 7za.exe (the command line tool with 7-zip) to zip each file (files are located in many different folders).
What are the limitations in this approach and how can they be solved? Is there a file size or file count limit for a 7zip archive?

The limit on file size is 16 exabytes, or 16000000000 GB.
There is no hard limit on the number of files, but there is a practical limit in how it manages the headers for the files. The exact limit depends on the path lengths but on a 32-bit system you'll run into limits somewhere around a million files.
I'm not sure if any other format supports more. Regular zip has far smaller limits.
http://en.wikipedia.org/wiki/7-Zip
One notable limitation of 7-Zip is that, while it supports file sizes of up to 16 exabytes, it has an unusually high overhead allocating memory for files, on top of the memory requirements for performing the actual compression.
Approximately 1 kilobyte is required per file (More if the pathname is very long) and the file listing alone can grow to an order of magnitude greater than the memory required to do the actual compression. In real world terms, this means 32-bit systems cannot compress more than a million or so files in one archive as the memory requirements exceed the 2 GB process limit.
64-bit systems do not suffer from the same process size limitation, but still require several gigabytes of RAM to overcome this limitation. Archives created on such systems would be unusable on machines with less memory however.

Related

Is FIle.Exists() suitable for big directories?

I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.

Optimal size of files for parallel reading

I have a big file (around 8GB) with thousands of entries (an entry can be up to 1100 bytes).
I need to read each entry and thought about parallelizing this process by splitting the big file into multiple smaller files.
For reading the files I am using C# (FileStream object).
As far as I understood it makes sense that every smaller file has the same size.
The question is now what the optimal size for those smaller files is or if it even make a difference if they all are 20MB or 50MB large.
On my research I came up with different sizes 20MB, 8KB, 10MB or 8MB but always without further explanation.
Update
The entries have a fixed size of bytes depending on the kind of entry which is described in the first 23 bytes of each.
So I'm reading the first 23 bytes, then i get the information of how many bytes i have to read until the entry ends. By this I managed to split the file by the kind of entry (34 files).
Since not every file has the same size the different tasks for the files terminate at a different times.

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?
At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).
Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort
If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.
I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).

Read whole file in memory VS read in chunks

I'm relatively new to C# and programming, so please bear with me. I'm working an an application where I need to read some files and process those files in chunks (for example data is processed in chunks of 48 bytes).
I would like to know what is better, performance-wise, to read the whole file at once in memory and then process it or to read file in chunks and process them directly or to read data in larger chunks (multiple chunks of data which are then processed).
How I understand things so far:
Read whole file in memory
pros:
-It's fast, because the most time expensive operation is seeking, once the head is in place it can read quite fast
cons:
-It consumes a lot of memory
-It consumes a lot of memory in very short time ( This is what I am mainly afraid of, because I do not want that it noticeably impacts overall system performance)
Read file in chunks
pros:
-It's easier (more intuitive) to implement
while(numberOfBytes2Read > 0)
read n bytes
process read data
-It consumes very little memory
cons:
-It could take much more time, if the disk has to seek the file again and move the head to the appropriate position, which in average costs around 12ms.
I know that the answer depends on file size (and hardware). I assume it is better to read the whole file at once, but for how large files is this true, what is the maximum recommended size to read in memory at once (in bytes or relative to the hardware - for example % of RAM)?
Thank you for your answers and time.
It is recommended to read files in buffers of 4K or 8K.
You should really never read files all at once if you want to write it back to another stream. Just read to a buffer and write the buffer back. This is especially through for web programming.
If you have to load the whole file since your operation (text-processing, etc) needs the whole content of the file, buffering does not really help, so I believe it is preferable to use File.ReadAllText or File.ReadAllBytes.
Why 4KB or 8KB?
This is closer to the underlying Windows operating system buffers. Files in NTFS are normally stored in 4KB or 8KB chuncks on the disk although you can choose 32KB chuncks
Your chunk needs to be just large enougth, 48 bytes is of course to small, 4K is reasonable.

How to measure characteristics of file (hard-disk) I/O?

How to measure characteristics of file (hard-disk) I/O? For example on a machine with a hard-disk (with speed X) and a cpu i7 (or whatever number of cores) and Y amount of ram (with Z Hz BIOS) what would be (on Windows OS):
Optimum number of files that can be written to the HD in parallel?
Optimum number of files that can be read from HD in parallel?
Facilities of file-system helping faster writings. (Like: Is there a feature or tool there that let you write batches of binary data on different sectors (or hards) and then bind them as a file? I do not know much about underlying file I/O in OS. But it would be reasonable to have such tools!)
If there are such tools as part before, are there in .NET too?
I want to write large files (streamed over the web or another source) as fast (and as parallel) as possible! I am coding this in C#. And it acts like a download manager; so if streaming got interrupted, it can carry on later.
The answer (as so often) depends on your usage. The whole operating system is one big tradeoff between different use scenarios. For the NTFS filesystem one could mention block size set to 4k, NTFS storing files less than block size in MTF, size of files, number of files, fragmentation, etc.
If you are planning to write large files then a block size of 64k may be good. That is if you plan to read large amounts of data. If you read smaller amounts of data then smaller sizes are good. The OS works in 4k pages, so 4k is good. Compression (and encryption?) as well as SQL and Exchange only work on 4k pages (iirc).
If you write small files (<4k)they will be stored inside the MFT so you don't have to make "an ekstra jump". This is especially useful in write operations (read may have MFT cached). MFT stores files in sequences (i.e. blocks 1000-1010,2000-2010) so fragmentation will make the MFT bigger. Writing files to disk in parallell is one of the main causes to fragmentation, the other is deleting files. You may pre-allocate the required size for a file and Windows will try to find a suitable place on the disk to counter fragmentation. There are also real-time defragmentation programs like O&O Defrag.
Windows maps a binarystream pretty much directly to the physical location on the disk, so using different read/write methods will not yield as much performance boost as other factors. For maximum speed programs use technieue for direct memory mapping to disk. See http://en.wikipedia.org/wiki/Memory-mapped_file
There is an option in Windows (under Device Manager, Harddisks) to increase caching on the disk. This is dangerous as ut could damage the filesystem if the computer bluescreens or looses power, but gives a big performance boost on writing smaller files (and on all writes). If the disk is busy this is especially valuable as the seek-time will decrease. Windows use what is called the elevator algorithm which basically means it moves the harddisk heads over the surface back and forth serving any application in the direction it is moving.
Hope this helps. :)

Categories