I have the following code for creating split archives using 7zip.
Compression level: MX9
Split archive size: 1MB
static void Main(string[] args)
{
string zipFileName = #"D:\ZIP\zipfile.7z";
string temp = #"D:\ZIP\ZM.pdf";
ProcessStartInfo info = new ProcessStartInfo();
info.FileName = AppDomain.CurrentDomain.BaseDirectory + #"..\..\7za.exe";
/**
* Switch -mx0: Don't compress at all. This is called "copy mode."
* Switch -mx1: Low compression. This is called "fastest" mode.
* Switch -mx3: Fast compression mode. Will automatically set various parameters.
* Switch -mx5: Same as above, but "normal."
* Switch -mx7: This means "maximum" compression.
* Switch -mx9: This means "ultra" compression.You probably want to use this.
**/
info.Arguments = string.Format("a -t7z \"" + zipFileName + "\" \"" + temp + "\" -v{0}k " + CompressionLevel.mx9, 1024);
info.WindowStyle = ProcessWindowStyle.Hidden;
Process process = Process.Start(info);
process.WaitForExit();
Console.WriteLine("Done zipping");
Console.ReadLine();
}
Normally for a 10MB file I get nine .7z files with extensions .7z.001, .7z.002, .7z.003 and so on. So for a 1MB file, I get one .7z file with the extension .7z.001. What I want to achieve is to eliminate the .001 extension if only a single file is generated. Is there any way to know how many split archives will be generated by 7zip based on its compression rate? I'm dealing with PDF files.
EDIT:
Basically what I want to do is to decide whether to create split archives or not. So I have to guess whether the resulting file will be greater than 1MB.
It is impossible to know what size the resulting files will have, unless you are able to analyze the content of the size and check how well it can be compressed. (Which can, to my knowledge, only be done by actually compressing it.)
For example, a PDF file containing only text might be better compressible than a file made up of only compressed images. The best solutions would be to stop splitting the archives, or to check for the presence of .002(etc.) files after compressing the input.
An alternative solution would be to compress the file in-memory using the C# LZMA sdk and then split the files manually if appropriate.
You could try compressing various combinations of PDF files and averaging out the compression rate, and then you'd know the rough input size after which you'd end up with more than one archive.
That said, it won't be precise. A simpler thing would be to wait for 7-zip to finish, and then check how many files you have, and drop the .001 if you only have one file.
Related
I have logic that downloads a group of files as a zip. The issue is there is no progress so the user does not know how far along the download is.
This Zip file doesn't exist before hand, the user selects the files they want to download and then I use the SharpZipLib nuget package to create a zip
and stream it to the response.
It seems I need to set the Content-Length header for the browser to show a total size progress indicator. The issue I'm having is it seems
this value has to be exact, if its too low or too high by 1 byte the file does not get downloaded properly. I can get an approximate
end value size by adding all the files size together and setting there to be no compressions level but I don't see a way I can calculate the final zip size exactly.
I hoped I could of just overesitmated the final size a bit and the browser would allow that but that doesn't work, the file isn't downloaded properly so you cant access it.
Here are some possible solution I've come up with but they have there own issues.
1 - I can create the zip on the server first and then stream it, therefore knowing the exact size I can set the Content-length. Issue with this
is the user will have to wait for all the files to be streamed to the web server, the zip to be created and then I can start streaming it to the user. While this is going on the user wont even see the file download as being started. This also results in more memory usage of the web server as it has to persist the entire zip file in memory.
2 - I can come up with my own progress UI, I will use the combined file sizes to get a rough final size estimation and then as the files are streamed I push updates to the user via signalR indicating the progress.
3- I show the user the total file size before download begins, this way they will at least have a way to assess themselves how far along it is. But the browser has no indication of how far along it is so if they may forget and when they look at the browser download progress there will be no indication how far along it is
These all have their own drawbacks. Is there a better way do this, ideally so its all handled by the browser?
Below is my ZipFilesToRepsonse method. It uses some objects that aren't shown here for simplicity sake. It also streams the files from azure blob storage
public void ZipFilesToResponse(HttpResponseBase response, IEnumerable<Tuple<string,string>> filePathNames, string zipFileName)
{
using (var zipOutputStream = new ZipOutputStream(response.OutputStream))
{
zipOutputStream.SetLevel(0); // 0 - store only to 9 - means best compression
response.BufferOutput = false;
response.AddHeader("Content-Disposition", "attachment; filename=" + zipFileName);
response.ContentType = "application/octet-stream";
Dictionary<string,long> sizeDictionary = new Dictionary<string, long>();
long totalSize = 0;
foreach (var file in filePathNames)
{
long size = GetBlobProperties(file.Item1).Length;
totalSize += size;
sizeDictionary.Add(file.Item1,size);
}
//Zip files breaks if we dont have exact content length
//and it isn't nesccarily the total lengths of the contents
//dont see a simple way to get it set correctly without downloading entire file to server first
//so for now we wont include a content length
//response.AddHeader("Content-Length",totalSize.ToString());
foreach (var file in filePathNames)
{
long size = sizeDictionary[file.Item1];
var entry = new ZipEntry(file.Item2)
{
DateTime = DateTime.Now,
Size = size
};
zipOutputStream.PutNextEntry(entry);
Container.GetBlockBlobReference(file.Item1).DownloadToStream(zipOutputStream);
response.Flush();
if (!response.IsClientConnected)
{
break;
}
}
zipOutputStream.Finish();
zipOutputStream.Close();
}
response.End();
}
I have a number of a zipped files in a folder stored on a Samsung EVO 970 SSD. Each zip file is 2GB+(while compressed) with 200K+ text files contained within, each file being between 5 to 1.5MB, essentially a large number of small text files.
Rather than extract the zip file and process each text file individually to an SSD, I'm trying to load each zip file in memory in full at the start of the processing and then read each file like shown at the end here.
My (maybe naive ) thinking is that if I could figure out a way to hold the whole zip file contents in the ram and process the text contents without decompressing the zip contents to disk, I would see material boost in the processing performance.
Currently it is taking about 10millisec on average to process every single text file even with the approach taken below.
var myMS = new MemoryStream();
using(var file = File.OpenRead(zipFile))
{
file.CopyTo(myMS);
using(var zip = new ZipArchive(myMS, ZipArchiveMode.Read))
{
foreach(var entry in zip.Entries)
{
using(var reader = new StreamReader(entry.Open(),Encoding.UTF8))
{
string fileContents = reader.ReadToEnd();
//do something with the file
My question is does this approach make sense ? Given that the total number of files, including all zip files is in the millions, I could be sat here for a week waiting for processing to finish.
I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of
I have 22k text (rtf) files which I must append to one final one.
The code looks something like this:
using (TextWriter mainWriter = new StreamWriter(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (TextReader currentFileRader = new StreamReader(currentFile))
{
string fileContent = currentFileRader.ReadToEnd();
mainWriter.Write(fileContent);
}
}
}
Clearly, this opens 22k times a stream to read from the files.
My questions are :
1) in general, is opening a stream a slow operation? Is reading from a stream a slow operation ?
2) is there any difference if I read the file as byte[] and append it as byte[] than using the file text?
3) any better ideas to merge 22k files ?
Thanks.
1) in general, is opening a stream a slow operation?
No, not at all. Opening a stream is blazing fast, it's only a matter of reserving a handle from the underlying Operating System.
2) is there any difference if I read the file as byte[] and append it
as byte[] than using the file text?
Sure, it might be a bit faster, rather than converting the bytes into strings using some encoding, but the improvement would be negligible (especially if you are dealing with really huge files) compared to what I suggest you in the next point.
3) any ways to achieve this better ? ( merge 22k files )
Yes, don't load the contents of every single file in memory, just read it in chunks and spit it to the output stream:
using (var output = File.OpenWrite(mainFileName))
{
foreach (string currentFile in filesToAppend)
{
using (var input = File.OpenRead(currentFile))
{
input.CopyTo(output);
}
}
}
The Stream.CopyTo method from the BCL will take care of the heavy lifting in my example.
Probably the best way to speed this up is to make sure that the output file is on a different physical disk drive than the input files.
Also, you can get some increase in speed by creating the output file with a large buffer. For example:
using (var fs = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None, BufferSize))
{
using (var mainWriter = new StreamWriter(fs))
{
// do your file copies here
}
}
That said, your primary bottleneck will be opening the files. That's especially true if those 22,000 files are all in the same directory. NTFS has some problems with large directories. You're better off splitting that one large directory into, say, 22 directories with 1,000 files each. Opening a file from a directory that contains tens of thousands of files is much slower than opening a file in a directory that has only a few hundred files.
What's slow about reading data from a file is the fact that you aren't moving around electrons which can propagate a signal at speeds that are...really fast. To read information in files you have to actually spin these metal disks around and use magnets to read data off of them. These disks are spinning at far slower than electrons can propagate signals through wires. Regardless of what mechanism you use in code to tell these disks to spin around, you're still going to have to wait for them to go a spinin' and that's going to take time.
Whether you treat the data as bytes or text isn't particularly relevant no.
Yes this is an exact duplicate of this question, but the link given and accepted as answer is not working for me. It is returning incorrect values (a 2 minutes mp3 will be listed as 1'30, 3 minutes as 2'20) with no obvious pattern.
So here it is again: how can I get the length of a MP3 using C# ?
or
What am I doing wrong with the MP3Header class:
MP3Header mp3hdr = new MP3Header();
bool boolIsMP3 = mp3hdr.ReadMP3Information("1.mp3");
if(boolIsMP3)
Response.Write(mp3hdr.intLength);
Apparently this class computes the duration using fileSize / bitRate. This can only work for constant bitrate, and I assume your MP3 has variable bitRate...
EDIT : have a look at TagLib Sharp, it can give you the duration
How have you ascertained the lengths of the MP3s which are "wrong"? I've often found that the header information can be wrong: there was a particular version of LAME which had this problem, for example.
If you bring the file's properties up in Windows Explorer, what does that show?
I wrapped mp3 decoder library and made it available for .net developers. You can find it here:
http://sourceforge.net/projects/mpg123net/
Included are the samples to convert mp3 file to PCM, and read ID3 tags.
I guess that you can use it to read mp3 file duration. Worst case will be that you read all the frames and compute the duration - VBR file.
To accurately determine mp3 duration, you HAVE TO read all the frames and calculate duration from their summed duration. There are lots of cases when people put various 'metadata' inside mp3 files, so if you estimate from bitrate and file size, you'll guess wrong.
I would consider using an external application to do this. Consider trying Sox and just run the version of the program that's executed by using soxi (no exe) and try parsing that output. Given your options I think you're better off just trusting someone else who has spent the time to work out all the weirdness in mp3 files unless this functionality is core to what you're doing. Good luck!
The second post in the thread might help you: http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/c72033c2-c392-4e0e-9993-1f8991acb2fd
Length of the VBR file CAN'T be estimated at all. Every mp3 frame inside of it could have different bitrate, so from reading any part of the file you can't know what density of the data is at any other part of the file. Only way of determining EXACT length of VBR mp3 is to DECODE it in whole, OR (if you know how) read all the headers of the frames one by one, and collect their decoded DURATION.
You will use later method only if the CPU that you use is a precious resource that you need to save. Otherwise, decode the whole file and you'll have the duration.
You can use my port of mpg123 to do the job: http://sourceforge.net/projects/mpg123net/
More: many mp3 files have "stuff" added to it, as a id3 tags, and if you don't go through all the file you could mistakenly use that tag in duration calculation.
There is my solution for C# with sox sound processing library.
public static double GetAudioDuration(string soxPath, string audioPath)
{
double duration = 0;
var startInfo = new ProcessStartInfo(soxPath,
string.Format("\"{0}\" -n stat", audioPath));
startInfo.UseShellExecute = false;
startInfo.CreateNoWindow = true;
startInfo.RedirectStandardError = true;
startInfo.RedirectStandardOutput = true;
var process = Process.Start(startInfo);
process.WaitForExit();
string str;
using (var outputThread = process.StandardError)
str = outputThread.ReadToEnd();
if (string.IsNullOrEmpty(str))
using (var outputThread = process.StandardOutput)
str = outputThread.ReadToEnd();
try
{
string[] lines = str.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string lengthLine = lines.First(line => line.Contains("Length (seconds)"));
duration = double.Parse(lengthLine.Split(':')[1]);
}
catch (Exception ex)
{
}
return duration;
}