Unzip file while reading it - c#

I have hundreds of CSV files zipped. This is great because they take very little space but when it is time to use them, I have to make some space on my HD and unzip them before I can process. I was wondering if it is possible with .NET to unzip a file while reading it. In other words, I would like to open a zip file, start to decompress the file and as we go, process the file.
So there would be no need for extra space on my drive. Any ideas or suggestions?

Yes. Zip is a streamed format which means that you can use the data as you decompress it rather than having to decompress everything first.
With .net's System.IO.Compression classes you can apply similar compression as used in zip files (Deflate & GZip) to any stream you like, but if you want to work with actual zip format files you'll need a third party library like this one (sharpziplib).

A better solution might be to keep the files decompressed on the drive, but turn on compression on the file system level. This way you'll just be reading CSV files, and the OS will take care of making sure it doesn't take too much space.
Anyhoo, to answer your question, maybe the GZipStream class can help you.

sharpziplib allows for stream-based decompression - see this related question - the item provides similar stream-based Read methods, so you can process each item like you would with any stream.

I'm not sure about zip files, but you could use GZ format with GZipSteam (works like any other input stream). Unfortunately, the entire System.IO.Compression namespace is only 2 classes (the other does DEFLATE).
EDIT: There's a class called ZipPackage. I'm not sure how if it will let you do decompression streaming, but it might be worth looking into.
Also, take a look at #ziplib.

Related

Unpacking tar/BZ2 files using C#

I have a tar.bz2 file and I want to extract it to a directory. In the examples I only see option of compress or decompress however I want actually to extract or unpack.
Also tried ICSharpCode.SharpZipLib.BZip2 but I didn't find an option to unpack.
While you use a ZipInputStream for .zip files, you should use a BZip2InputStream for .bz2 files (and GZipInputStream for .gz files etc.).
Taken from:
How to decompress .bz2 file in C#?
Decompressing and unpacking are two different operations. A foo.tar.bz2 file is actually a foo.tar file which was then compressed using bz2.
So to get single files you have to do this in the opposite direction. I.e. first decompress it (which you managed to do with sharpziplib). The result of this decompression has then to be untared (which can also be done with sharpziplib) see the docs for details.

Stream dynamic zip files with resume support

Suppose, I have a list of MP3 files on my server. And I want a user to download multiple files (which he wants through any means). For, this what i want is to create the zip file dynamically and while saving it into the Output Stream using the dotnetzip or ioniczip libraries.
Well, that's not the perfect solution if the zip file got heavy in size. As, in that scenario the server doesn't support resumable downloads. So, to overcome this approach, I need to handle the zip file structure internally and provide the resume support.
So, is there any library (open source) which i can use to provide resumable dyanamic zip files stream directly to the Output Stream. Or, if possible I will be happy if someone let me know the structure of zip file specially the header content + data content.
Once a download has started, you should not alter the ZIP file anymore. Because then a resume will just result in a broken ZIP file. So make sure your dynamically created ZIP file stays available!
The issue of providing resume-functionality was solved in this article for .NET 1.1, and it is still valid and functional.

Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?

I have a zip file that contains folder hierarchies and files.
\images\
\images\1.jpg
\images\2.jpg
\something\something\a.exe
\something\something\b.exe
1.txt
I need to decompress the contents of this zip file to a location. I also need to preserve the structure of the zip file.
I've read about .NET's GZipStream and DeflateStream but I am of the opinion that it is too "complicated" for my purpose.
I've also used DotNetZip and SharpZipLib in the past for personal projects but since this is work related and I'm working at a huge company, I would have a hard time convincing legal to use these libraries.
Question:
Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?
PS: I've also read this but I think it's hacky because you'll need to produce another executable just to hide the progress dialog.
Thanks!
Check out if Ionic Zip helps?
DotNetZip would do what you want, but I understand your concerns about legal approval.
On a side note, It might be good for you to navigate the legal jungle associated with getting an open-source library approved for use in the company, just to understand what's involved. But I'll leave that up to you.
Getting back to rolling your own...
DotNetZip is pretty full featured, and it handles a number of scenarios you probably don't care about. Like Unicode filenames and comments, setting windows timestamps and permissions of extracted files, getting timestamps of zip files created on old unix systems, split archives, Encrypted archives, files over 2gb, or self-extracting archives, etc etc etc. Many zip files use none of those things.
Also DotNetZip does eventing and zip updates and zip creation - all the code associated with these things is probably not of interest to you, if you confine yourself just to the requirements you described in your question.
You could, though, grab the DotNetZip code and use it to help you roll your own solution. If you constrain yourself to JUST reading zip files and not dealing with all the possible special cases, the zip format is not difficult to parse.
here's how to do it:
open the zip file using new FileStream() or File.Open. You want a FileStream object.
Read 4 bytes. Verify that it is the zip-entry-header descriptor. (0x04034b50)
In the file, the order you will find these bytes is 50 4b 03 04.
if you find a match, you're in business.
at offset 14 is a 4-byte CRC. Get it. (Same byte ordering as above)
at offset 18 - the 4-byte length of the compressed blob. get it. (N)
at offset 22 - the 4-byte length of the UNcompressed blob. get it. (U)
at 26 - the 2-byte length of the filename. get it (L)
at 28 - the 2-byte length of the "extra field". get it (E)
Beyond the extra field, at offset 30, is the actual filename. read L bytes for the filename, and call System.Text.Encoding.ASCII.GetString(). The result will include a directory path, with the backslashes replaced with slashes (unix style). String.Replace() the slashes.
after the filename comes the extra field - seek E bytes to get beyond it. You can mostly ifgnore it. This is where the compressed data starts.
Open a System.IO.DeflateStream() on the zip FileStream, using CompressionMode.Decompress, and using the current offset of the FileStream as input. open a new FileStream, for output, with the file path you read in step 3. in a loop, call inflater.Read(). and output.Write(), to write the decompressed output of the DeflateStream to a filesystem file with the correct name. You will need to stop reading from the DeflateStream when you read exactly U (uncompressed) bytes.
Check the uncompressed size (U) against the data you actually wrote out from the DeflateStream (after compression). They should match.
If you are fancy, you can check the CRC of the output against what was in the header.
go to step 2, to look for the next entry in the file.
The most complicated part is step 3. Working code for that is easily found in this source module, look for the ReadHeader method.
Maybe the full features set of GZipStream it's a bit complicated, but note that the sample in the msdn page it's exactly what you need. I mean this msdn web (the 4.0 version) not the one you supply in the question.
http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx#Y2750

C#: Archiving a File into Parts of 100MB

In my application, the user selects a big file (>100 mb) on their drive. I wish for the program to then take the file that was selected and chop it up into archived parts that are 100 mb or less. How can this be done? What libraries and file format should I use? Could you give me some sample code? After the first 100mb archived part is created, I am going to upload it to a server, then I will upload the next 100mb part, and so on until the upload is finished. After that, from another computer, I will download all these archived parts, and then I wish to connect them into the original file. Is this possible with the 7zip libraries, for example? Thanks!
UPDATE: From the first answer, I think I'm going to use SevenZipSharp, and I believe I understand now how to split a file into 100mb archived parts, but I still have two questions:
Is it possible to create the first 100mb archived part and upload it before creating the next 100mb part?
How do you extract a file with SevenZipSharp from multiple splitted archives?
UPDATE #2: I was just playing around with the 7-zip GUI and creating multi-volume/split archives, and I found that selecting the first one and extracting from it will extract the whole file from all of the split archives. This leads me to believe that paths to the subsequent parts are included in the first one (or is it consecutive?). However, I'm not sure if this would work directly from the console, but I will try that now, and see if it solves question #2 from the first update.
Take a look at SevenZipSharp, you can use this to create your spit 7z files, do whatever you want to upload them, then extract them on the server side.
To split the archive look at the SevenZipCompressor.CustomParameters member, passing in "v100m". (you can find more parameters in the 7-zip.chm file from 7zip)
You can split the data into 100MB "packets" first, and then pass each packet into the compressor in turn, pretending that they are just separate files.
However, this sort of compression is usually stream-based. As long as the library you are using will do its I/O via a Stream-derived class, it would be pretty simple to implement your own Stream that "packetises" the data any way you like on the fly - as data is passed into your Write() method you write it to a file. When you exceed 100MB in that file, you simply close that file and open a new one, and continue writing.
Either of these approaches would allow you to easily upload one "packet" while continuing to compress the next.
edit
Just to be clear - Decompression is just the reverse sequence of the above, so once you've got the compression code working, decompression will be easy.

Compress a file with GZipStream while maintaining its meta-data

How can I get the extension of compressed file after being compressed with System.IO.Compression.GZipStream?
For example, if the original file is named test.doc and compresses to test.gz, how do I know what file extension to use when decompressing?
There is no way to get the file name - in fact there may never be a filename at all, if for example a piece of data is created in memory and then send over a network connection.
Instead of replacing the file extension, why not append it, for example: test.doc.gz
Then you can simply strip it off when decompressing.
I had to do this some time ago. The solution is to use the J# libraries to do it. You still write it in C# however.
http://msdn.microsoft.com/en-us/magazine/cc164129.aspx
That's microsofts answer on the topic.
Not sure what is your question- I assume you want a mechanism to "remember" what the extension was before the compression took place?
If that is the question then the convention of test.doc compressing into test.doc.gz will work.
The test.gz is just a raw byte stream with no meta-data about what has been compressed (for example, original file name, extension etc). What you'd need to do is create an archive that contains the gzip stream and meta-data about each file contained in the archive.
The article linked to in Mech Software's answer provides a pretty reasonable way to implement this.
There was also this question (vaguely related) asked some time back which may help out:
How to compress a directory with the built in .net compression classes?

Categories