Performance creating multiple small files - c#

I need a test app that will create a big number of small files on disk as faster as possible.
Will asynch ops help creating files or just writing them? Is there a way to speed up the whole process (writing on a single file is not possible)

Wouldn't physical drive IO be the bottleneck here? You'll probably get different results if you write to a 4200rpm drive versus a 10,000rpm drive versus an ultrafast SSD.

It's hard for me to say without writing a test app myself, but disc access will be synchronized anyway, so it's not like you will have multiple threads writing to disk at the same time. You could speed up your performance by using threads if there was a fair amount of processing done before writing out each file.

If it's possible to test your app using a ramdisk it would probably speed up things considerably.

If possible, don't write them all in the same directory. Many filesystems slow down when dealing with directories containing large numbers of files. (I once brought our fileserver at work, which normally happily serves the whole office, to its knees by writing thousands of files to the same directory).
Instead, make a new directory for each 1000 files or so.

Related

Is FIle.Exists() suitable for big directories?

I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.

Concatenate large files using Win NT kernel API

I've been looking around for a way to concatenate large files (a few gigabytes) together without having to rewrite one of the files. I am sure the OS does this internally when manipulating the master file table. This is purely for an internal application where speed is critical even at the cost of data integrity (in case of risking undocumented APIs). The app processes a large amount of high-bandwidth, multi-channel ethernet data where a corrupt unit of work (file in this case) will not have a large impact on overall processing results.
At the moment when combining files A and B, the effort involved is equal to: A[Read] + B[Read] +C[Write]`. Would any of you NT gurus shed some light on how to work around this to get to the MFT directly?
I have not been able to gain any clues as to which API to explore and would appreciate some pointers. Although the app in managed, I would gladly explore native APIs and even setup light-weight VMs for testing.
Thanks in advance.
If you are appending File B to File A, all you have to do is open File A for write+append , seek to end of file, then read from B and write to A.
If you want to create File C as the concatenation of File A and File B, then you are going to have to create File C and copy A to C, then B to C.
There aren't any shortcuts.
That's not really something a file system would do. File systems allocate space for files in terms of clusters and blocks of data, not in terms of bytes. Concatenating two files like this would only work if they were both multiples of the cluster size, and the FS might have other assumptions about how blocks are allocated to files under the covers. You might be able to do this yourself to the file system if you dismounted it and wrote a tool to directly manipulate all the file system structures. But you're risking corrupting the whole disk if you do that, not just a single file.
I don't know your exact situation but would it be possible to not append the files together at all? Just keep throwing files into some directory as you receive data, and keep an index
Then as the data is needed use the index to piece together the data to create one new file?
So you only ever do the expensive file merging on demand?

Parallel Concurrent Binary Readers

I Have a Parallel.Foreach Loop creating Binary Readers on the same group of large Data Files
I was just wondering if it hurts performance that these readers are reading the same files in a Parallel Fashion (i.e, if they were reading exclusively different files would it go faster ?)
I am asking because there is a lot of I/O Disk access involved (I guess...)
Edit : I forgot to mention : I am using an Amazon EC2 instance and data is on the C:\ Disk assigned to it. I have no Idea how it affects this issue.
Edit 2: I'll make measurements duplicating the data folder and reading from 2 different sources and see what it gives.
It's not a good idea to read from the same disk using multiple threads. Since the disk's mechanical head needs to spin every time to seek the next reading location, you are basically bouncing it around with multiple threads, thus hurting performance.
The best approach is actually to read the files sequentially using a single thread and then handing the chunks to a group of threads to process them in parallel.
It depends on where your files are. If you're using one mechanical hard-disk, then no - don't read files in parallel, it's going to hurt performance. You may have other configurations, though:
On a single SDD, reading files in parallel will probably not hurt performance, but I don't expect you'll gain anything.
On two mirrored disks using RAID 1 and a half-decent RAID controller, you can read two files at once and gain considerable performance.
If your files are stored on a SAN, you can most definitely read a few at a time and improve performance.
You'll have to try it, but you have to be careful with this - if the files aren't large enough, the OS caching mechanisms are going to affect your measurements, and the second test run is going to be really fast.

save output of a compiled application to memory instead of hard disk

I have an application (EXE file). it is running and while running generate some files (jpeg files) on hard disk. we know read and write to hard disk has poor performance.
Is there any solution to force this application to use memory to save its output jpeg files.
If this solution will be under Windows and use C#, it will be ideal.
Thanks.
The simplest option is probably not a programmatic one - it's just to use a RAM disk such as RAMDisk (there are others available, of course).
That way other processes get to use the results easily, without any messing around.
Since you don't have the source for the EXE and you can't/won't use a RAM disk, the next option is to improve the IO performance of your machine:
Use an SSD or a RAID 0 array, or add loads of memory that can be used as a cache.
But without access to the source code for the application, this isn't really a programming question, because the only way you can 'program' a solution is to write your own RAM disk application - and you can't use a RAM disk, so you've said.
IF you really need to make this solution programmatic then you need to dig deep - depending on the application you will have to hook a lot of functions used by the exe...
That is a really tough thing to do and is prone to problems with several things - permissions/rights, antivirus-protections...
Starting points:
http://www.codeproject.com/KB/winsdk/MonitorWindowsFileSystem.aspx
http://msdn.microsoft.com/en-us/windows/hardware/gg462968.aspx

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.
You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.
One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.
Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.
Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

Categories