Edit a .gziped log text file in C#

Edit a .gziped log text file in C# - c#

I receive logs which are .gz (zipped) and contain text files.
One field in the logs needs to be edited, present on every row, to contain some other data then what is currently present in the logs.
My thinking so far is to:
Unzip file
Read it
Edit it
Write it
Rezip it
But I guess there is a better way to do this, is there any on-the-fly reading/editing from .gz log files available in C#?
Thanks!

Unless you want to work at the bit-level, the method you suggest is the correct approach.
incase you are unfamiliar with the .Net libraries for this, here is a code project article.
http://www.codeproject.com/KB/files/GZipStream.aspx

Related

Uniquely Identify a file across different machines in a web application [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to be able to uniquely identify a file.
The file would be uploaded to a web application and could be uploaded from anywhere. The file could be mailed to someone, renamed, edited and then uploaded from a different machine altogether and need a way of identifying that this was the original file that was originally uploaded. I need a reliable way of identifying a file across different file systems.
I cannot use the filename as the identifier as the file could be renamed. I still need to be able to uniquely identify the file even though it was renamed.
I cannot use the the Hash on the file as the Hash would change if the file was edited.
I understand Linux has inode number property and windows has the IndexNumber. I can use the NtQueryInformationFile and get the indexNumber. The indexNumber was same when the file was edited and when the file was renamed. But then IndexNumber was different when the file was moved from one folder to another.
From all the reading I have done, it seems like the 'indexNumber' is not reliable for all documents. I almost have a feeling that there is no unique identifier for a file that would be constant across different folders, machines and that would remain unchanged when edited, renamed etc. But here I am StackOverflow. Any help is appreciated.
Edit: Here is the business problem I am trying to solve. A user uploads a file to our web application. Then inputs a bunch of metadata for the file. Similar to adding tags on the file. We keep the file in blob storage but the user still has his local copy that he mails to another user. He maybe edits and renames the file before mailing. When the other user uploads the file to our web application, is there any way we can identify that this was the original file so as to pre-populate the metadata that the original user had entered.

The simple answer to your question is that it can't be done.
Let me summarize what you're trying to accomplish.
If I have a file on my office computer, and upload that to your web application, you want to store that into your system as a new file. Then, if I copy the file from my office computer to my home computer, edit the file contents, rename the file, and then upload it into your web application, you want to identify that this is the same file as the one I previously uploaded.
It can't be done.
Not with a 100% guarantee that you can identify this.
When you are uploading files to a web application, what is sent is this:
The name of the file
The length of the file
The contents of the file
Things such as alternate data streams (NTFS), from the other answer here, or inode or similar identifiers, from the comments, are not sent. Your web application will not see them. Nor would these things be "across multiple computers".
So bottom line, this is impossible.
Your options are:
Let the user uniquely pick the file they want to overwrite, meaning that the user could pick unrelated files and thus be "wrong"
Work out a reasonable chance that you identified the right file, accepting the chance that you identified incorrectly
Embed a unique id into the file itself, however since the file contents can be edited (and the id can be changed) this is not guaranteed
... other options that doesn't have a 100% guarantee of being right
The first option is of course the easiest.
The second option could use systems such as what git is doing when it tracks renames, but even this will fail depending on how much the file was edited between the uploads. Git fail in this respect too, except that "failure" here simply means it doesn't show you the full history of a file, it doesn't break down and become unusable.
The third option might work if the file should be edited by a program similar to Word or Excel or Photoshop, etc. You could embed the ID and just make sure that program doesn't change it. It would probably have a higher and acceptable chance of being right, but it might still be possible to edit.
So you will have to decide what would be acceptable to you, but you cannot create a system in which you are guaranteed to identify the file, even if it was renamed and the contents changed. Because at that point you have no guarantee that the user is simply trying to upload a different file altogether.

On windows with the NTFS file system you could use alternate data streams or NTFS streams;
http://ntfs.com/ntfs-multiple.htm
stream: A sequence of bytes written to a file on the target file system. Every file stored on a volume that uses the file system contains at least one stream, which is normally used to store the primary contents of the file. Additional streams within the file can be used to store file attributes, application parameters, or other information specific to that file. Every file has a default data stream, which is unnamed by default. That data stream, and any other data stream associated with a file, can optionally be named.
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/8ac44452-328c-4d7b-a784-d72afd19bd9f
There is not a lot of official documentation however. But you can inject a GUID there to be able to track the file.
On limitation about this solution is that this only works for the NTFS filesystem, when the file is copied to e.g. a FAT file system, the information is lost.
You need to access native win32 api's however. Check for example this SO:
https://stackoverflow.com/a/605167/4122889
Or this random blog:
https://blogs.msmvps.com/bsonnino/2016/11/24/alternate-data-streams-in-c/

Can you pre-compress data files to be inserted into a zip file at a later time to improve performance?

As part of our installer build, we have to zip thousands of large data files into about ten or twenty 'packages' with a few hundred (or even thousands of) files in each which are all dependent on being kept with the other files in the package. (They are versioned together if you will.)
Then during the actual install, the user selects which packages they want included on their system. This also lets them download updates to the packages from our site as one large, versioned file rather than asking them to download thousands of individual ones which could also lead to them being out of sync with others in the same package.
Since these are data files, some of them change regularly during the design and coding stages, meaning we then have to re-compress all files in that particular zip package, even if only one file has changed. This makes the packaging step of our installer build take well over an hour each time, with most of that going to re-compressing things that we haven't touched.
We've looked into leaving the zip packages alone, then replacing specific files inside them, but inserting and removing large files from the middle of a zip doesn't give us that much of a performance boost. (A little, but not enough that its worth it.)
I'm wondering if its possible to pre-process files down into a cached raw 'compressed state' that matches how it would be written to the zip package, but only the data itself, not the zip header info, etc.
My thinking is if that is possible, during our build step, we would first look for any data file that doesn't have a compressed cache associated with it, and if not, we would compress that file and write the result to the cache.
Next we would simply append all of the caches together in a file stream, adding any appropriate zip header needed for the files.
This would mean we are still recreating the entire zip during each build, but we are only recompressing data that has changed. The rest would just be written as-is which is very fast since it is a straight write-to-disk. And if a data file changes, its cache is destroyed, so next build-pass it would be recreated.
However, I'm not sure such a thing is possible. Is it, and if so, is there any documentation to show how one would go about attempting this?

Yes, that's possible. The most straightforward approach would be to zip each file individually into its own associated zip archive with one entry. When any file is modified, you replace its associated zip file to keep all of those up to date. Then you can write a simple program to take a set of those single entry zip files and merge them into a single zip file. You will need to refer to the documentation in the PKZip appnote. Take a look at that.
Now that you've read the appnote, what you need to do is use the local header, data, and central header from each individual zip file, write the local header and data as is sequentially to the new zip file, and save the central header and the offsets of the local headers in the new file. Then at the end of the new file save the current offset, write a new central directory using the central headers you saved, updating the offsets appropriately, and ending with a new end of central directory record with the offset of the start of the central directory.
Update:
I decided this was a useful enough thing to write. You can get it here.

You could zip each file before hand, and then "zip" them together with no compression at the end to quickly aggregate them into a distributable package. It won't be as efficient as compressing all the data at once, but should be faster to make modifications.

I cannot seem to locate an actual exe that implements this type of functionality. It appears that most existing tools I've tried that have the ability to merge/update will reprocess(compress) the data stream as you have already stated you saw.
However it seems what you describe can be done if you or someone wants to write it. If you take a look at this link for the ZIP file format specification, you can get an overview of the structure you would have to parse out and process. It looks like you can pretty quickly go from file to file gathering up and discarding the files of interest, then merging in your new/updated files. You would still need to rebuild a new central directory (refer to section 4.3.6 of the above linked document) within your new destination archive.
After a little more digging, the DotNetZip Library forum has a message asking about the same type of functionality which also gives a description just like I described above. It also links to this document which seems to indicate that support for that may be added to the DotNetZip library for you to further experiment with.

How to detect real-time change of text files?

I am stating to write a little PC tool to read log files using c# or java. The log files will be in .txt format. An application is running and writing logs, and I want my tool to open the log at the same time and refresh automatically when a new line is written to the log file.
My challenge is, how do I detect the log file changes so that my tool will have real-time displaying ability? This is a general question but pseudo codes will be greatly appreciated!

You can use the FileSystemWatcher class (MSDN page). Be careful though, if you try to open the file while the other process is writing the file you will probably be denied access.

There's a bunch of this questions here, and answers too:
Reading file content changes in .NET
Your target is FileStream object, I suppose.

I would poll the file and check it's metadata for the last changed date, or if you're using .NET, you could use the FileSystemWatcher class

What's the best structure to conserve file related information?

I am building an interface whose primary function would be to act as a file renaming tool (the underlying task here is to manually classify each file within a folder according to rules that describe their content). So far, I have implemented a customized file explorer and a preview window for the files.
I now have to find a way to inform a user if a file has already been renamed (this will show up in the file explorer's listView). The program should be able to read as well as modify that state as the files are renamed. I simply do not know what method is optimal to save this kind of information, as I am not fully used to C#'s potential yet. My initial solution involved text files, but again, I do not know if there should be only one text file for all files and folders or simply a text file per folder indicating the state of its contained items.
A colleague suggested that I use an Excel spreadsheet and then simply import the row or columns corresponding to my query. I tried to find more direct data structures, but again I would feel a lot more comfortable with some outside opinion.
So, what do you think would be the best way to store this kind of data?
PS: There are many thousands of files, all of them TIFF images, located on a remote server to which I have complete access.

I'm not sure what you're asking for, but if you simply want to keep some file's information such as name, date, size etc. you could use the FileInfo class. It is marked as serializable, so that you could easily write an array of them in an xml file by invoking the serialize method of an XmlSerializer.

I am not sure I understand you question. But what I gather you want to basically store the meta-data regarding each file. If this is the case I could make two suggestions.
Store the meta-data in a simple XML file. One XML file per folder if you have multiple folders, the XML file could be a hidden file. Then your custom application can load the file if it exists when you navigate to the folder and present the data to the user.
If you are using NTFS and you know this will always be the case, you can store the meta-data for the file in a file stream. This is not a .NET stream, but a extra stream of data that can be store and moved around with each file without impacting the actual files content. The nice thin about this is that no matter where you move the file, the meta-data will move with the file, as long as it is still on NTFS
Here is more info on the file streams
http://msdn.microsoft.com/en-us/library/aa364404(VS.85).aspx

You could create an object oriented structure and then serialize the root object to a binary file or to an XML file. You could represent just about any structure this way, so you wouldn't have to struggle with the
I do not know if there should be only one text file for all files and folders or simply a text file per folder indicating the state of its contained items.
design issues. You would just have one file containing all of the metadata that you need to store. If you want speedier opening/saving and smaller size, go with binary, and if you want something that other people could open and view and potentially write their own software against, you can use XML.
There's lots of variations on how to do this, but to get you started here is one article from a quick Google:
http://www.codeproject.com/KB/cs/objserial.aspx

Is there an easy way to determine the type of a file without knowing the file's extension?

I have a table with a binary column which stores files of a number of different possible filetypes (PDF, BMP, JPEG, WAV, MP3, DOC, MPEG, AVI etc.), but no columns that store either the name or the type of the original file. Is there any easy way for me to process these rows and determine the type of each file stored in the binary column? Preferably it would be a utility that only reads the file headers, so that I don't have to fully extract each file to determine its type.
Clarification: I know that the approach here involves reading just the beginning of each file. I'm looking for a good resource (aka links) that can do this for me without too much fuss. Thanks.
Also, just C#/.NET on Windows, please. I'm not using Linux and can't use Cygwin (doesn't work on Windows CE, among other reasons).

you can use these tools to find the file format.
File Analyser
http://www.softpedia.com/get/Programming/Other-Programming-Files/File-Analyzer.shtml
What Format
http://www.jozy.nl/whatfmt.html
PE file format analyser
http://peid.has.it/
This website may be helpful for you.
http://mark0.net/onlinetrid.aspx
Note:
i have included the download links to make sure that you are getting the right tool name and information.
please verify the source before you download them.
i have used a tool in the past i think it is File Analyser, which will tell you the closest match.
happy tooling.

This is not a complete answer, but a place to start would be a "magic numbers" library. This examines the first few bytes of a file to determine a "magic number", which is compared against a known list of them. This is (at least part) of how the file command on Linux systems works.

Someone else asked a similar question and posted the code used to do exactly this. You should be able to take what is posted here, and slightly modify it so that it pulls from your database.
https://stackoverflow.com/questions/58510
In addition to that, it looks like someone has written a library based off of magic numbers to do this, however, it looks like the site requires registration, and some form of alternate access in order to download this lirbary. The documentation is avaliable for free without registration, that may be helpful.
http://software.topcoder.com/catalog/c_component.jsp?comp=13249160&ver=2

The easiest way I know is to use file command that it is also available in Windows with Cygwin .

A lot of filetypes have well defined headers that begin the file. You could check the first few bytes to check to see how the file begins.

Easiest way to do this would be through access to a *nix (or cygwin) system that has the 'file' command:
$ file visitors.*
visitors.html: HTML document text
visitors.png: PNG image data, 5360 x 2819, 8-bit colormap, non-interlaced
You could write a C# application that piped the first X bytes of each binary column to the file command (using - as the file name)

You need to use some p/invoke interop code to call the SHGetFileInfo method from the Win32 API. This article may also help.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.