How to estimate the time for creating a zipfile in c#

How to estimate the time for creating a zipfile in c# - c#

I've a method for creating zip file in my project and its working perfectly. I want to know is there any way to estimate the approximate time for creating that zip file.I know about StopWatch but I dont think I can use that for my requirement. Any ideas????

This is really impossible to answer.
The amount of time a ZIP process will need depends on many factors, for instance:
The compressability of the file(s) to compress. Point in case: XML files zip very nicely, MP3 files hardly at all.
The amount of files to compress.
The algorithm/implementation you use.
Whether the Pc you are using is also doing other work (especially I/O).
...
The best you can do is ZIP a portion of the total data (say, 10%), then extrapolate to get an estimated time, then re-evaluate that estimate, say, every 10% of data or so.

Related

Is my binary files caching method stupid?

In my C# application i have to read a huge amount of binary files, but at the first run, reading those files using FileStream and BinaryReader, takes a lot of times. But the second times you run the app, reading the files is 4 times faster.
After reading this post "Slow reading hundreds of files" I decided to precache the binary files.
After reading this other post "How can I check if a program is running for the first time?", my app now can detect if it is the first time it is running then I precache the files by using this simple technique "Caching a binary file in C#".
Is there another way of precaching huge amount of binary files?
Edit:
This is how I read and parse the files
f_strm = new FileStream(#location, FileMode.Open, FileAccess.Read);
readBinary = new BinaryReader(f_strm);
Parse(readBinary);
The Parse() function just contains a switch statement that I use to parse the data.
I don't do anything more complicated. As an example, I tried to read and parse 10.000 binary files of 601KB, it took 39 secondes and about 589.000 cycles to read and parse the files.
When I run again the app, it finally took about 45.000 cycles and 1.5 seconds to read and parse.
Edit:
By "huge" amount of files I mean millions of files. It's not always the case, but most of the time I have to deal with at least 10.000 files. The size of those files can be between 600Ko and 700MB.

Just read them once and discard the results. That puts them into the OS cache and makes future reads very fast. Using the OS cache is automatic and very safe.
Or, make yourself a Dictionary<string, byte[]> where you store the file contents keyed by the file path. Be careful not to exhaust available memory or your app will fail or become very slow due to paging.

Fast CSV reader

I am working on a project, which actually loads data from a CSV file, processes it and then save it on the disk. For fast reading of CSV data, I am using Lumenworks CSV reader http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader. This works fine till a limit but when I have CSV with a size of 1GB or more, it takes time. Is there any other way for faster CSV reading?

Not a lot of info provided... so on the assumption that this is an IO limitation your options are:
Get Faster Storage [e.g. SSD, RAID].
Try compression - sometimes the time spent in compression [e.g Zip] saves multiples in IO.
Try threading - particularly useful if doing computationally hard calculations - but probably a bad fit in this scenario.
Change the problem - do you need to read/write a 1GB file? Maybe you can change the data format [156 is a lot smaller than "156,", maybe you only need to deal with smaller blocks, maybe the time taken honestly doesn't matter etc.
Any others?
Hmm, you could try caching the output, I've experimented with MemoryMappedFiles & RAM Drives... you could do it with some simple threading... while this does potentially return sooner, it has huge risks and complexities

When does it become worthwhile to spend the execution time to zip files?

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.

To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).

You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.

You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?

At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort

If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.

I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.

You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.

One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.

Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.

Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to estimate the time for creating a zipfile in c# - c#

I've a method for creating zip file in my project and its working perfectly. I want to know is there any way to estimate the approximate time for creating that zip file.I know about StopWatch but I dont think I can use that for my requirement. Any ideas????

Related

Is my binary files caching method stupid?

Fast CSV reader

When does it become worthwhile to spend the execution time to zip files?

Sorting gigantic binary files with C#

Efficient log backup program in C#

Categories

Resources