packaging files to be read without need for extraction - c#

In my current project i'm dealing with a huge number of files (over tens of milliard files with low volume-between 1 and 30 KB) as resources which copying them for my customer is time consuming job. i'm searching for a packaging mechanism that can help me to package each 1000 or 10000 one of them into a single file,resulting more copy speed because in that case i'm dealing with much less count of files; and also reading them from my application should not need any extraction and also no compression while i'm writing or changing them (because of the performance and nature of application which is distributed and resources are being shared between clients),I have searched and i know about following ZIP libraries:
SharpZipLib
DotNetZip
System.IO.Packaging
But seems above libraries need to be -at least- iterated through files to access a file in the zip or package without extraction. i need to access the files via their address (folder structure hierarchy) in the zip or package file! following links are similar questions which are answered via Iterating through the zip file:
how-to-read-data-from-a-zip-file-without-having-to-unzip-the-entire-file
content-inside-zip-file
Has anyone any idea or solution about this issue?
By the way,i'm coding in C# and the project is windows form-based.

I would do my own Package Format. With GZipStream or something else. For each files, you compress them with GZipStream, after you get the bytes values and you need to create a header in your Package Format which contains for each files (name, starting position and length). With this data in your header, that will probably by at the beginning of your package. You can get the information for your wanted file and after you just seek to the position of the compressed data, you get the byte array with the specified length.
But if you modify one files, you will need to recalculate all index after the modified files.

Related

Can you pre-compress data files to be inserted into a zip file at a later time to improve performance?

As part of our installer build, we have to zip thousands of large data files into about ten or twenty 'packages' with a few hundred (or even thousands of) files in each which are all dependent on being kept with the other files in the package. (They are versioned together if you will.)
Then during the actual install, the user selects which packages they want included on their system. This also lets them download updates to the packages from our site as one large, versioned file rather than asking them to download thousands of individual ones which could also lead to them being out of sync with others in the same package.
Since these are data files, some of them change regularly during the design and coding stages, meaning we then have to re-compress all files in that particular zip package, even if only one file has changed. This makes the packaging step of our installer build take well over an hour each time, with most of that going to re-compressing things that we haven't touched.
We've looked into leaving the zip packages alone, then replacing specific files inside them, but inserting and removing large files from the middle of a zip doesn't give us that much of a performance boost. (A little, but not enough that its worth it.)
I'm wondering if its possible to pre-process files down into a cached raw 'compressed state' that matches how it would be written to the zip package, but only the data itself, not the zip header info, etc.
My thinking is if that is possible, during our build step, we would first look for any data file that doesn't have a compressed cache associated with it, and if not, we would compress that file and write the result to the cache.
Next we would simply append all of the caches together in a file stream, adding any appropriate zip header needed for the files.
This would mean we are still recreating the entire zip during each build, but we are only recompressing data that has changed. The rest would just be written as-is which is very fast since it is a straight write-to-disk. And if a data file changes, its cache is destroyed, so next build-pass it would be recreated.
However, I'm not sure such a thing is possible. Is it, and if so, is there any documentation to show how one would go about attempting this?
Yes, that's possible. The most straightforward approach would be to zip each file individually into its own associated zip archive with one entry. When any file is modified, you replace its associated zip file to keep all of those up to date. Then you can write a simple program to take a set of those single entry zip files and merge them into a single zip file. You will need to refer to the documentation in the PKZip appnote. Take a look at that.
Now that you've read the appnote, what you need to do is use the local header, data, and central header from each individual zip file, write the local header and data as is sequentially to the new zip file, and save the central header and the offsets of the local headers in the new file. Then at the end of the new file save the current offset, write a new central directory using the central headers you saved, updating the offsets appropriately, and ending with a new end of central directory record with the offset of the start of the central directory.
Update:
I decided this was a useful enough thing to write. You can get it here.
You could zip each file before hand, and then "zip" them together with no compression at the end to quickly aggregate them into a distributable package. It won't be as efficient as compressing all the data at once, but should be faster to make modifications.
I cannot seem to locate an actual exe that implements this type of functionality. It appears that most existing tools I've tried that have the ability to merge/update will reprocess(compress) the data stream as you have already stated you saw.
However it seems what you describe can be done if you or someone wants to write it. If you take a look at this link for the ZIP file format specification, you can get an overview of the structure you would have to parse out and process. It looks like you can pretty quickly go from file to file gathering up and discarding the files of interest, then merging in your new/updated files. You would still need to rebuild a new central directory (refer to section 4.3.6 of the above linked document) within your new destination archive.
After a little more digging, the DotNetZip Library forum has a message asking about the same type of functionality which also gives a description just like I described above. It also links to this document which seems to indicate that support for that may be added to the DotNetZip library for you to further experiment with.

access a zipped file without unzipping?

My program produces a log of info every hour that the system is running, that contains various data like access times, data transfers and any faults/warnings experienced. unfortunately these log files can be anywhere from 10,000KB to 25,000KB in size, so I've begun zipping them individually once they're at least 24hr old, this way my system has only 24 unzipped log files at any one time.
The issue I need to resolve is that part of this software is a 'Diagnostics' window, where the user can load up log files from a selected date range based on file's creation time and view their contents in an easy to read format. I understand that in order for the files to show up in their search there must be an exception allowing .zip to be checked as well, but I cannot access any of the file's data to see if said .zip files fall into the date range.
My question is: is their a way for me to access the zipped file's information (and to further extent it's contents) without having to unzip the files, do the search, re-zip the files? that seems like too much work to unzip one hundred or more files if only 1 or 2 fall in your date range.
You should add a timestamp to the filename of each zipped file.
In general, when you zip a file you're putting the actual data of the file into a format that is unreadable. Most zipping algorithms (keep in mind that there are very many) work on a very bit-hacky level, which is why you really need to unzip the files to get your original data out. (There's no such thing as a free lunch.)
Luckily though, a file is not just a file! Because you're totally right, having to read a file to do things with it would be terrible! Imagine having to search a file system if you had to read each file to figure out where in the directory it was.
There are a number of ways to access the metadata associated with your file depending on what exact system you're on. For instance, in unix-style machines using the command ls -l will get you the last edited information.
That said, log files usually have names that start with a timestamp for this exact reason. If you want to keep your filenames pretty though, going through the last-edited date is probably the way to go.
A good zip library (e.g. SharpZipLib) ought to allow you to iterate over the files contained in the archive without extracting them. This will allow you to query the associated file dates. For example, using the aforementioned SharpZipLib, you would just need to inspect the DateTime property of the ZipEntry objects contained in the archive.

Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?

I have a zip file that contains folder hierarchies and files.
\images\
\images\1.jpg
\images\2.jpg
\something\something\a.exe
\something\something\b.exe
1.txt
I need to decompress the contents of this zip file to a location. I also need to preserve the structure of the zip file.
I've read about .NET's GZipStream and DeflateStream but I am of the opinion that it is too "complicated" for my purpose.
I've also used DotNetZip and SharpZipLib in the past for personal projects but since this is work related and I'm working at a huge company, I would have a hard time convincing legal to use these libraries.
Question:
Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?
PS: I've also read this but I think it's hacky because you'll need to produce another executable just to hide the progress dialog.
Thanks!
Check out if Ionic Zip helps?
DotNetZip would do what you want, but I understand your concerns about legal approval.
On a side note, It might be good for you to navigate the legal jungle associated with getting an open-source library approved for use in the company, just to understand what's involved. But I'll leave that up to you.
Getting back to rolling your own...
DotNetZip is pretty full featured, and it handles a number of scenarios you probably don't care about. Like Unicode filenames and comments, setting windows timestamps and permissions of extracted files, getting timestamps of zip files created on old unix systems, split archives, Encrypted archives, files over 2gb, or self-extracting archives, etc etc etc. Many zip files use none of those things.
Also DotNetZip does eventing and zip updates and zip creation - all the code associated with these things is probably not of interest to you, if you confine yourself just to the requirements you described in your question.
You could, though, grab the DotNetZip code and use it to help you roll your own solution. If you constrain yourself to JUST reading zip files and not dealing with all the possible special cases, the zip format is not difficult to parse.
here's how to do it:
open the zip file using new FileStream() or File.Open. You want a FileStream object.
Read 4 bytes. Verify that it is the zip-entry-header descriptor. (0x04034b50)
In the file, the order you will find these bytes is 50 4b 03 04.
if you find a match, you're in business.
at offset 14 is a 4-byte CRC. Get it. (Same byte ordering as above)
at offset 18 - the 4-byte length of the compressed blob. get it. (N)
at offset 22 - the 4-byte length of the UNcompressed blob. get it. (U)
at 26 - the 2-byte length of the filename. get it (L)
at 28 - the 2-byte length of the "extra field". get it (E)
Beyond the extra field, at offset 30, is the actual filename. read L bytes for the filename, and call System.Text.Encoding.ASCII.GetString(). The result will include a directory path, with the backslashes replaced with slashes (unix style). String.Replace() the slashes.
after the filename comes the extra field - seek E bytes to get beyond it. You can mostly ifgnore it. This is where the compressed data starts.
Open a System.IO.DeflateStream() on the zip FileStream, using CompressionMode.Decompress, and using the current offset of the FileStream as input. open a new FileStream, for output, with the file path you read in step 3. in a loop, call inflater.Read(). and output.Write(), to write the decompressed output of the DeflateStream to a filesystem file with the correct name. You will need to stop reading from the DeflateStream when you read exactly U (uncompressed) bytes.
Check the uncompressed size (U) against the data you actually wrote out from the DeflateStream (after compression). They should match.
If you are fancy, you can check the CRC of the output against what was in the header.
go to step 2, to look for the next entry in the file.
The most complicated part is step 3. Working code for that is easily found in this source module, look for the ReadHeader method.
Maybe the full features set of GZipStream it's a bit complicated, but note that the sample in the msdn page it's exactly what you need. I mean this msdn web (the 4.0 version) not the one you supply in the question.
http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx#Y2750

C#: Archiving a File into Parts of 100MB

In my application, the user selects a big file (>100 mb) on their drive. I wish for the program to then take the file that was selected and chop it up into archived parts that are 100 mb or less. How can this be done? What libraries and file format should I use? Could you give me some sample code? After the first 100mb archived part is created, I am going to upload it to a server, then I will upload the next 100mb part, and so on until the upload is finished. After that, from another computer, I will download all these archived parts, and then I wish to connect them into the original file. Is this possible with the 7zip libraries, for example? Thanks!
UPDATE: From the first answer, I think I'm going to use SevenZipSharp, and I believe I understand now how to split a file into 100mb archived parts, but I still have two questions:
Is it possible to create the first 100mb archived part and upload it before creating the next 100mb part?
How do you extract a file with SevenZipSharp from multiple splitted archives?
UPDATE #2: I was just playing around with the 7-zip GUI and creating multi-volume/split archives, and I found that selecting the first one and extracting from it will extract the whole file from all of the split archives. This leads me to believe that paths to the subsequent parts are included in the first one (or is it consecutive?). However, I'm not sure if this would work directly from the console, but I will try that now, and see if it solves question #2 from the first update.
Take a look at SevenZipSharp, you can use this to create your spit 7z files, do whatever you want to upload them, then extract them on the server side.
To split the archive look at the SevenZipCompressor.CustomParameters member, passing in "v100m". (you can find more parameters in the 7-zip.chm file from 7zip)
You can split the data into 100MB "packets" first, and then pass each packet into the compressor in turn, pretending that they are just separate files.
However, this sort of compression is usually stream-based. As long as the library you are using will do its I/O via a Stream-derived class, it would be pretty simple to implement your own Stream that "packetises" the data any way you like on the fly - as data is passed into your Write() method you write it to a file. When you exceed 100MB in that file, you simply close that file and open a new one, and continue writing.
Either of these approaches would allow you to easily upload one "packet" while continuing to compress the next.
edit
Just to be clear - Decompression is just the reverse sequence of the above, so once you've got the compression code working, decompression will be easy.

Best process for auto-zipping of multiple MP3s

I've got a project which requires a fairly complicated process and I want to make sure I know the best way to do this. I'm using ASP.net C# with Adobe Flex 3. The app server is Mosso (cloud server) and the file storage server is Amazon S3. The existing site can be viewed at NoiseTrade.com
I need to do this:
Allow users to upload MP3 files to
an album "widget"
After the user has uploaded their
album/widget, I need to
automatically zip the mp3 (for other
users to download) and upload the
zip along with the mp3 tracks to
Amazon S3
I actually have this working already (using client side processing in Flex) but this no longer works because of Adobe's flash 10 "security" update. So now I need to implement this server-side.
The way I am thinking of doing this is:
Store the mp3 in a temporary folder
on the app server
When the artist "publishes" create a
zip of the files in that folder
using a c# library
Start the amazon S3 upload process (zip and mp3s)
and email the user when it is
finished (as well as deleting the
temporary folder)
The major problem I see with this approach is that if a user deletes or adds a track later on I'll have to update the zip file but the temporary files will not longer exist.
I'm at a loss at the best way to do this and would appreciate any advice you might have.
Thanks!
The bit about updating the zip but not having the temporary files if the user adds or removes a track leads me to suspect that you want to build zips containing multiple tracks, possibly complete albums. If this is incorrect and you're just putting a single mp3 into each zip, then StingyJack is right and you'll probably end up making the file (slightly) larger rather than smaller by zipping it.
If my interpretation is correct, then you're in luck. Command-line zip tools frequently have flags which can be used to add files to or delete files from an existing zip archive. You have not stated which library or other method you're using to do the zipping, but I expect that it probably has this capability as well.
MP3's are compressed. Why bother zipping them?
I would say it is not necessary to zip a compressed file format, you are only gong to get a five percent reduction in filesize, give or take a little. Mp3's dont really zip up by their nature the have compressed most of the possible data already.
DotNetZip can zip up files from C#/ASP.NET. I concur with the prior posters regarding compressibility of MP3s. DotNetZip will automatically skip compression on MP3, and just store the file, just for this reason. It still may be interesting to use a zip as a packaging/archive container, aside from the compression.
If you change the zip file later (user adds a track), you could grab the .zip file from S3, and just update it. DotNetZip can update zip files, too. But in this case you would have to pay for the transfer cost into and out of S3.
DotNetZip can do all of this with in-memory handling of the zips - though that may not be feasible for large archives with lots of MP3s and lots of concurrent users.

Categories