access a zipped file without unzipping? - c#

My program produces a log of info every hour that the system is running, that contains various data like access times, data transfers and any faults/warnings experienced. unfortunately these log files can be anywhere from 10,000KB to 25,000KB in size, so I've begun zipping them individually once they're at least 24hr old, this way my system has only 24 unzipped log files at any one time.
The issue I need to resolve is that part of this software is a 'Diagnostics' window, where the user can load up log files from a selected date range based on file's creation time and view their contents in an easy to read format. I understand that in order for the files to show up in their search there must be an exception allowing .zip to be checked as well, but I cannot access any of the file's data to see if said .zip files fall into the date range.
My question is: is their a way for me to access the zipped file's information (and to further extent it's contents) without having to unzip the files, do the search, re-zip the files? that seems like too much work to unzip one hundred or more files if only 1 or 2 fall in your date range.

You should add a timestamp to the filename of each zipped file.
In general, when you zip a file you're putting the actual data of the file into a format that is unreadable. Most zipping algorithms (keep in mind that there are very many) work on a very bit-hacky level, which is why you really need to unzip the files to get your original data out. (There's no such thing as a free lunch.)
Luckily though, a file is not just a file! Because you're totally right, having to read a file to do things with it would be terrible! Imagine having to search a file system if you had to read each file to figure out where in the directory it was.
There are a number of ways to access the metadata associated with your file depending on what exact system you're on. For instance, in unix-style machines using the command ls -l will get you the last edited information.
That said, log files usually have names that start with a timestamp for this exact reason. If you want to keep your filenames pretty though, going through the last-edited date is probably the way to go.

A good zip library (e.g. SharpZipLib) ought to allow you to iterate over the files contained in the archive without extracting them. This will allow you to query the associated file dates. For example, using the aforementioned SharpZipLib, you would just need to inspect the DateTime property of the ZipEntry objects contained in the archive.

Related

Find files in a directory by Date

We are working with an external accounts program who save "printed documents" on a network share. Each "printed document" in the directory contains 3 files.
XML file - Contains information about who printed the document, etc
Datareport file - Contain the actual data for the report layout file
Layout file - The report layout
We have a customer who has 120,000 files in the directory.
Currently when the user want to see all the "printed documents" the software will loop all the files in the directory and then read each XML file and see if the report is for this user....... this takes 10 mins to read each time.
We are trying to create a faster solution.
The only idea I can think of is to loop the files and put the contents (file name, XML details) into a database table and record the "Last Scanned Date".
The next time i loop through the files i can loop through and dismiss any items that are less than the "Last Scanned Date" or use a Linq query!? (borrowed from another post)
DateTime LastCreatedDate = Properties.Settings.Default["LastDateTime"].ToDateTime();
var filePaths = Directory.GetFiles(#"\\Printed\Reports\", "*_*.xml").Select(p => new {Path = p, Date = System.IO.File.GetLastWriteTime(p)})
.OrderBy(x=>x.Date)
.Where(x=>x.Date>=LastCreatedDate);
Is there a quicker solution?
You could set up a Windows service which detects additions to the folder and then updates the database with the new entries. Thereafter, any queries on documents printed would be at the cost of a database query only.
Based on your use case, it looked like what you were asking for was to have a system where a user could ask for all of their printed documents. I didn't see where date was a part of the solution.
I can think of multiple quick solutions:
Have a subdirectory for each user. As new files enter the main directory, have the file parsed and copied to the appropriate user subdirectories (allows for a file being associated to multiple users). This has the benefit of limiting the number of files per directory.
Have a mapping (either through a DB, flat XML file, or flat XML file per user) which maps a file to the user. Then update the mapping with each new file, while also containing a listing of files which have already been processed so that you don't reprocess the file.
Research document management database, if a more robust solution is desired. If you want to be able to searching for a lot of different types of meta-data, then a document management database would be a good idea.
NOTE - For ideas 1 and 2, you could process the new files as either part of a service, a task, or whenever a user makes a request for documents.
Perhaps it's the parsing of the XML that is taking a long time? You could do a basic "grep" of all the files for the user name/id and then do the actual XML parsing on just the matching files.

Can you pre-compress data files to be inserted into a zip file at a later time to improve performance?

As part of our installer build, we have to zip thousands of large data files into about ten or twenty 'packages' with a few hundred (or even thousands of) files in each which are all dependent on being kept with the other files in the package. (They are versioned together if you will.)
Then during the actual install, the user selects which packages they want included on their system. This also lets them download updates to the packages from our site as one large, versioned file rather than asking them to download thousands of individual ones which could also lead to them being out of sync with others in the same package.
Since these are data files, some of them change regularly during the design and coding stages, meaning we then have to re-compress all files in that particular zip package, even if only one file has changed. This makes the packaging step of our installer build take well over an hour each time, with most of that going to re-compressing things that we haven't touched.
We've looked into leaving the zip packages alone, then replacing specific files inside them, but inserting and removing large files from the middle of a zip doesn't give us that much of a performance boost. (A little, but not enough that its worth it.)
I'm wondering if its possible to pre-process files down into a cached raw 'compressed state' that matches how it would be written to the zip package, but only the data itself, not the zip header info, etc.
My thinking is if that is possible, during our build step, we would first look for any data file that doesn't have a compressed cache associated with it, and if not, we would compress that file and write the result to the cache.
Next we would simply append all of the caches together in a file stream, adding any appropriate zip header needed for the files.
This would mean we are still recreating the entire zip during each build, but we are only recompressing data that has changed. The rest would just be written as-is which is very fast since it is a straight write-to-disk. And if a data file changes, its cache is destroyed, so next build-pass it would be recreated.
However, I'm not sure such a thing is possible. Is it, and if so, is there any documentation to show how one would go about attempting this?
Yes, that's possible. The most straightforward approach would be to zip each file individually into its own associated zip archive with one entry. When any file is modified, you replace its associated zip file to keep all of those up to date. Then you can write a simple program to take a set of those single entry zip files and merge them into a single zip file. You will need to refer to the documentation in the PKZip appnote. Take a look at that.
Now that you've read the appnote, what you need to do is use the local header, data, and central header from each individual zip file, write the local header and data as is sequentially to the new zip file, and save the central header and the offsets of the local headers in the new file. Then at the end of the new file save the current offset, write a new central directory using the central headers you saved, updating the offsets appropriately, and ending with a new end of central directory record with the offset of the start of the central directory.
Update:
I decided this was a useful enough thing to write. You can get it here.
You could zip each file before hand, and then "zip" them together with no compression at the end to quickly aggregate them into a distributable package. It won't be as efficient as compressing all the data at once, but should be faster to make modifications.
I cannot seem to locate an actual exe that implements this type of functionality. It appears that most existing tools I've tried that have the ability to merge/update will reprocess(compress) the data stream as you have already stated you saw.
However it seems what you describe can be done if you or someone wants to write it. If you take a look at this link for the ZIP file format specification, you can get an overview of the structure you would have to parse out and process. It looks like you can pretty quickly go from file to file gathering up and discarding the files of interest, then merging in your new/updated files. You would still need to rebuild a new central directory (refer to section 4.3.6 of the above linked document) within your new destination archive.
After a little more digging, the DotNetZip Library forum has a message asking about the same type of functionality which also gives a description just like I described above. It also links to this document which seems to indicate that support for that may be added to the DotNetZip library for you to further experiment with.

packaging files to be read without need for extraction

In my current project i'm dealing with a huge number of files (over tens of milliard files with low volume-between 1 and 30 KB) as resources which copying them for my customer is time consuming job. i'm searching for a packaging mechanism that can help me to package each 1000 or 10000 one of them into a single file,resulting more copy speed because in that case i'm dealing with much less count of files; and also reading them from my application should not need any extraction and also no compression while i'm writing or changing them (because of the performance and nature of application which is distributed and resources are being shared between clients),I have searched and i know about following ZIP libraries:
SharpZipLib
DotNetZip
System.IO.Packaging
But seems above libraries need to be -at least- iterated through files to access a file in the zip or package without extraction. i need to access the files via their address (folder structure hierarchy) in the zip or package file! following links are similar questions which are answered via Iterating through the zip file:
how-to-read-data-from-a-zip-file-without-having-to-unzip-the-entire-file
content-inside-zip-file
Has anyone any idea or solution about this issue?
By the way,i'm coding in C# and the project is windows form-based.
I would do my own Package Format. With GZipStream or something else. For each files, you compress them with GZipStream, after you get the bytes values and you need to create a header in your Package Format which contains for each files (name, starting position and length). With this data in your header, that will probably by at the beginning of your package. You can get the information for your wanted file and after you just seek to the position of the compressed data, you get the byte array with the specified length.
But if you modify one files, you will need to recalculate all index after the modified files.

What's the best structure to conserve file related information?

I am building an interface whose primary function would be to act as a file renaming tool (the underlying task here is to manually classify each file within a folder according to rules that describe their content). So far, I have implemented a customized file explorer and a preview window for the files.
I now have to find a way to inform a user if a file has already been renamed (this will show up in the file explorer's listView). The program should be able to read as well as modify that state as the files are renamed. I simply do not know what method is optimal to save this kind of information, as I am not fully used to C#'s potential yet. My initial solution involved text files, but again, I do not know if there should be only one text file for all files and folders or simply a text file per folder indicating the state of its contained items.
A colleague suggested that I use an Excel spreadsheet and then simply import the row or columns corresponding to my query. I tried to find more direct data structures, but again I would feel a lot more comfortable with some outside opinion.
So, what do you think would be the best way to store this kind of data?
PS: There are many thousands of files, all of them TIFF images, located on a remote server to which I have complete access.
I'm not sure what you're asking for, but if you simply want to keep some file's information such as name, date, size etc. you could use the FileInfo class. It is marked as serializable, so that you could easily write an array of them in an xml file by invoking the serialize method of an XmlSerializer.
I am not sure I understand you question. But what I gather you want to basically store the meta-data regarding each file. If this is the case I could make two suggestions.
Store the meta-data in a simple XML file. One XML file per folder if you have multiple folders, the XML file could be a hidden file. Then your custom application can load the file if it exists when you navigate to the folder and present the data to the user.
If you are using NTFS and you know this will always be the case, you can store the meta-data for the file in a file stream. This is not a .NET stream, but a extra stream of data that can be store and moved around with each file without impacting the actual files content. The nice thin about this is that no matter where you move the file, the meta-data will move with the file, as long as it is still on NTFS
Here is more info on the file streams
http://msdn.microsoft.com/en-us/library/aa364404(VS.85).aspx
You could create an object oriented structure and then serialize the root object to a binary file or to an XML file. You could represent just about any structure this way, so you wouldn't have to struggle with the
I do not know if there should be only one text file for all files and folders or simply a text file per folder indicating the state of its contained items.
design issues. You would just have one file containing all of the metadata that you need to store. If you want speedier opening/saving and smaller size, go with binary, and if you want something that other people could open and view and potentially write their own software against, you can use XML.
There's lots of variations on how to do this, but to get you started here is one article from a quick Google:
http://www.codeproject.com/KB/cs/objserial.aspx

How can I undelete a file using C#?

I'm trying to find some lost .jpg pictures. Here's a .bat file to setup a simplified version of my situation
md TestSetup
cd TestSetup
md a
cd a
echo "Can we find this later?" > a.abc
del a.abc
cd..
rd a
What code would be needed to open the text file again? I'm actually looking for .jpeg files that were treated in a similar manner
More details: I'm trying to recover picture files from a previous one-touch backup where the directories and files have been deleted and everything was saved in the backup with a single character name and every file has the same 3 letter extension. There is a current backup but they need to view the previous deleted ones (or at least the .jpg files).
Here's how I was trying to approach it: C# code
To the best of my knowledge, most file recovery tools actually read the low-level filesystem format on the disk and try to piece together deleted files. This works because, at least in FAT, a deleted file still resides in the sector specifying the directory (just with a different first character to identify it as "deleted"). New files may overwrite these deleted entries and therefore make the file unrecoverable. That's just a little bit of theory.
There is a current backup but they
need to view the previous deleted ones
(or at least the .jpg files).
Unless there's a backup for that file at the time that you want to restore from, I believe you're going to have a hard time getting that file without resorting to a low-level filesystem read. And even then, you may be out of luck if enough revisions have been made (or it's not a trivial filesystem like FAT).

Categories