SharpZipLib - progress through fileS during extract - c#

this has to be really easy, and it certainly seems to be a very frequently asked question, but I can't for the life of me find a 'straightforward' answer.
I want to create a ProgressBar that shows a Zip file being extracted by SharpZipLib.
The FastZip and FastZipEvents classes give progress on individual files but not on position within the overall Zip. That is, if there Zip contains 200 files, what file is currently being extracted. I don't care about the progress through individual files (e.g. 20KB through 43KB in Foo.txt).
I think I could fudge a way of doing this by first creating a ZipFile and to access the Count property. And then... using ZipInputStream or FastZip to extract and keep progress count myself but I think that means the Zip is effectively unzipped twice (once entirely into memory) and I don't like that.
Any clean way of doing this?

Regarding your last sentence: "I think that means the Zip is effectively unzipped twice".
Reading the content table of a zip file doesn't cost a lot at all (and doesn't access the contained files. You probably noticed that when you looked at a zip file with a "password" and only needed to enter the password when you tried to extract a file. You can look at the entries/content table just fine).
So I see nothing wrong with the approach of first checking the index/content table, storing the entry count (maybe even with compressed/uncompressed size?) and using the stream based api later.

FYI: DotNetZip has ExtractProgress event for this sort of thing. Code:
using (ZipFile zip = ZipFile.Read(ExistingZipFile))
{
zip.ExtractProgress = MyExtractProgress;
zip.ExtractAll(TargetDirectory);
}
The extractprogress handler looks like this:
private void MyExtractProgress(object sender, ExtractProgressEventArgs e)
{
switch (e.EventType)
{
case ZipProgressEventType.Extracting_BeforeExtractEntry:
....
case ZipProgressEventType.Extracting_EntryBytesWritten:
...
case ZipProgressEventType.Extracting_AfterExtractEntry:
....
}
}
You could use it to drive the familiar 2-progressbar UI, with one bar showing progress for the archive, and another bar showing progress for the individual file within the archive.

Related

System.IO.Compression - Counting the number of files using ZipFileArchive is very slow

In order to update a progress bar with the number of files to extract. My program is going over a list of Zip files and collects the number of files in them. The combined number is approximately 22000 files.
The code I am using:
foreach (string filepath in zipFiles)
{
ZipArchive zip = ZipFile.OpenRead(filepath);
archives.Add(zip);
filesCounter += zip.Entries.Count;
}
However it looks like the zip.Entries.Count is doing some kind of a traversal and it takes ages for this count to complete (Several Minutes and much, much more, if the internet connection is not great).
To have a sort of notion how much this can improve, I compared the above to the performance of 7-Zip.
I took one of the zip files that contain ~11000 files and folders:
2 Seconds to Open 7-Zip Archive.
1 Second to get the file properties
In the properties I can see 10016 files + 882 folder - meaning it takes 7-Zip ~3 seconds to know there are 10898 entries in the Zip file.
Any Idea, suggestion or any alternative method, that quickly counts the number of files, will be appreciated.
Using DotNetZip to count is actually much faster, but due to some internal bureaucratic issues, I can't use it.
I need to have a solution not involving third party libraries, I can still use Microsoft Standard Libraries.
My progress bar issue is solved, by taking a new approach to the matter.
I simply accumulate all ZIP files sizes, which serves as the max size. Now for each individual file that is extracted I add its compressed size to the progress. This way the progress bar does not show me the number of files, it shows me the uncompressed progress (E.g. If, in total, I have 4GB to Extract, when the progress bar is 1/4 green, I know I Extracted 1GB). Looks like a better representation of reality.
foreach (string filepath in zipFiles)
{
ZipArchive zip = ZipFile.OpenRead(filepath);
archives.Add(zip);
// Accumulating the Zip files sizes.
filesCounter += new FileInfo(filepath).Length;
}
// To utilize multiple processors it is possible to activate this loop
// in a thread for each ZipArchive -> currentZip!
// :
// :
foreach (ZipArchiveEntry entry in currentZip.Entries) {
// Doing my extract code here.
// :
// :
// Accumulate the compressed size of each file.
compressedFileSize += entry.CompressedLength
// Doing other stuff
// :
// :
}
So the issue with improving the performance of the zip.Entries.Count is still on, and I am still interested in knowing how to solve this specific issue (What does 7Zip do to be so quick - may be they use the DotNetZip or other C++ libraries)

Is it better practice to work with information within a file / folder or to load that information first?

I have a folder of files and a spreadsheet with a list of file names. I need to go through each file in the folder and see if that file name exists on the spreadsheet. It seems I can either load all the information (file names and spreadsheet list) into lists and then search from there or I can just loop through the files, get the name as I go, then look through the spreadsheet itself.
As far as I can tell, the benefit to loading them first is that it may make the search code a bit cleaner, but if there are too many files it would be redundant and slower. Working directly with the files and spreadsheet would negate that intermediate step, but the search code would be a little bit messier.
Is there any other clear trade off that I am missing? Is there a best practice for this?
Thanks.
Be careful as comparing two lists results in a O(n2) problem. This means that if you have 20 files, you will have to make 20 * 20 = 400 comparisons.
Therefore I suggest putting the filenames from the spreadsheet into a HashSet<string>. It has a constant access time of O(1). This reduces your problem to a O(n) problem.
// Gather the file names from the spreadsheet and insert them in a HashSet.
// (This is just simulated here.)
var fileNamesOnSpreadsheet = new HashSet<string>(StringComparer.OrdinalIgnoreCase) {
"filename 1", "filename 2", "filename 3", "another filename"
};
string folder = #"C:\Data";
foreach (string file in Directory.EnumerateFiles(folder)) {
if (fileNamesOnSpreadsheet.Contains(file)) {
// file found in spreadsheet
} else {
// file missing from spreadsheet
}
}
Note that Directory.EnumerateFiles get the filenames including their paths an extensions. If you have the bare filenames in the spreadsheet, you can remove the path with
string fileNameOnly = Path.GetFileName(file);
You can also remove the extension with
string fileNameOnly = Path.GetFileNameWithoutExtension(file);
Note that this solution reads the files from the folder only once and gets the file names from the spreadsheet only once. Reading information from file system is time-consuming and extracting information form the spreadsheet as well.
Directory.EnumerateFiles does not even store the filenames in a collection but instead delivers them continuously as the foreach-loop progresses.
So, this solution is very efficient.
See also:
Directory.EnumerateFiles Method
Big O notation - Wikipedia
For small numbers of names and a single search it takes so little time that optimizing the code is probably not worth it and you can do whatever is easiest for you.
For the sheet, it could make sense to load the names into a list* because you will search the list N times (once per file) and searching the list will be faster than searching the sheet. It might also make sense to have the list sorted so searches take log N time instead of N
as #JonathanWillcock1 notes, other in-memory data structures such as a dictionary might work even better, hiding the details of sorting and searching from you and making your code cleaner.
For the file names you only look at each name one once so iterating through a directory listing function is all that you need and copying it to a list would double the work.

System.IO.IOException C# when FileInfo and WriteAllLines

I want to clean some volume of my text log file if it size more then max:
FileInfo f = new FileInfo(filename);
if (f.Length > 30*1024*1024)
{
var lines = File.ReadLines(filename).Skip(10000);
File.WriteAllLines(filename, lines);
}
But I have exception
System.IO.IOException: The process cannot access the file '<path>' because it is being used by another process.
Questions:
Do I need close FileInfo object before future work with file?
Is there some more adequate method to rotate logs? (for example eficciant way to obtain number of lines instead of it byte size? )
File.ReadLines keep the file open until you dispose of the returned IEnumerable<string>.
So this has nothing to do with FileInfo.
If you need to write it back to the same file, fully enumerate the contents:
var lines = File.ReadLines(filename).Skip(10000).ToList();
You mention "rotating logs", have you considered rotating files instead? ie. write to a fixed file, when it gets "full" (by whatever criteria you deem full, like 1GB in size, one days worth of log entries, 100.000 lines, etc.), you rename the file and create a new, empty, one.
You would probably want to rename existing rotated files as well, so as to keep the number of rotated files low.

Having trouble saving multiple items to Isolated Storage

I have a noteapp, two pages:
MainPage.xaml — the creation of notes;
NoteList.xaml — a list of notes.
Notes are saved by means of IsolatedStorage, and appear in NoteList.xaml (listbox), but notes with the same name is not stored, how to fix it?
I need to be able to add notes with the same name (but with different content).
Thanks!
Are you using the note name as the file name? If so... don't do that. Save each file with a unique name. There are myriad ways of doing this. You could use a GUID or a timestamp, or you could append a timestamp to the end of the file name. If you were so inclined you could store all of the notes in a single formatted file-- perhaps XML.
What you need is a way to uniquely identify each note without using:
a. The note's name
b. The note's contents
While using a timestamp might make sense for your application right now (since a user probably cannot create two disparate notes simultaneously), using a timestamp to identify each note could lead to problems down the line if you wanted to implement say... a server side component to your application. What happens if in version 23 of your application (which obviously sells millions in the first months), you decide to allow users to collaborate on notes, and a Note is shared between two instances of your app where they happened to be created at the EXACT same time? You'd have problems.
A reasonable solution to finding a unique identifier for each Note in your application is through the use of the Guid.NewGuid method. You should do this when the user decides to "save" the note (or if your app saves the note the moment it's created, or at some set interval to allow for instant "drafts".
Now that we've sufficiently determined a method of uniquely identifying each Note that your application will allow a user to create, we need to think about how that data should be stored.
A great way to do this is through the use of XmlSerializer, or better yet using the third party library Json.Net. But for the sake of simplicity, I recommend doing something a bit easier.
A simpler method (using good ole' plain text) would be the following:
1: {Note.Name}
2: {Guid.ToString()}
3: {Note.Contents}
4: {Some delimiter}
When you are reading the file from IsolatedStorage, you would read through the file line by line, considering each "chunk" of lines between the start of the file and each {Some delimiter} and the end of the file to be the data for one "Note".
Keep in mind there are some restrictions with this format. Mainly, you have to keep the user from having the last part of their note's contents be equal to the {Some delimiter} (which you are free to arbitrarily define btw). To this end, it may be helpful to use a string of characters the user is not likely to enter, such as "##&&ENDOFNOTE&&##" Regardless of how unlikely it is the user will type that in, you need to check to make sure before you save to IsolatedStorage that the end of the Note does not contain this string, because it will break your file format.
If you want a simple solution that works, use the above method. If you want a good solution that's scalable, use JSON or XML and figure out a file format that makes sense to you. I highly encourage you to look into JSON, it's value reaches so much further than this isolated scenario.
I've had a need to write notes to IsolatedStorage. What I did was to them them to a file.IsolatedStorageFile I write date on which the note was written and then note. From the list box i store them to two arrays. Then before exiting the app, write them to a file.
try
{
using (IsolatedStorageFile storagefile = IsolatedStorageFile.GetUserStoreForApplication())
{
if (storagefile.FileExists("NotesFile"))
{
using (IsolatedStorageFileStream fileStream = storagefile.OpenFile("NotesFile", FileMode.Open, FileAccess.ReadWrite))
{
StreamWriter writer = new StreamWriter(fileStream);
for (int i = 0; i < m_noteCount; i++)
{
//writer.Write(m_arrNoteDate[i].ToShortDateString());
writer.Write(m_arrNoteDate[i].ToString("d", CultureInfo.InvariantCulture));
writer.Write(" ");
writer.Write(m_arrNoteString[i]);
writer.WriteLine("~`");
}
writer.Close();
}
}

Find appended text from txt file

i want to write a code in a way,if there is a text file placed in a specified path, one of the users edited the file and entered new text and saved it.now,i want to get the text which is appended last time.
here am having file size for both before and after append the text
my text file size is 1204kb from that i need to take the end of 200kb text alone is it possible
This can only be done if you're monitoring the file size in real-time, since files do not maintain their own histories.
If watching the files as they are modified is a possibility, you could perhaps use a FileSystemWatcher and calculate the increase in file size upon any modification. You could then read the bytes appended since the file last changes, which would be very straightforward.
Do you know how big the file was before the user appended the text? If not, there's no way of telling... files don't maintain a revision history (in most file systems, anyway).
You can keep track of the file pointer . Eg If you are using C language then you can go to the end of the file using fseek(fp,SEEK_END) and then use ftell(fp) which will give you the current position of the file pointer . After the user edits and saves the file , when you rerun the code you can check with the new position original position . If the new position is greater than the original position offset those number of bytes with the file pointer
As #Jon Skeet alludes to in his answer, the only way to tell specifically what text that was "appended", is by knowing how large the file was before it was changed. The rest of the characters is thus what was "appended".
Note that I quote appended above since I get two conflicting meanings from your question; edited and appended.
If the user only appends text, which is taken to mean "add more text only at the end", then the previous-size approach should in theory work.
However, if the user freely edits the text, by adding text in random spots, and perhaps even removing or changing existing text, then you need a whole 'nother approach to this.
If it's the latter, I might have something you could use, a binary patching implementation that can also be used to figure out from an older copy of the same file what was changed in a newer copy. It isn't easy to use, and might not give you exactly what you want, but as I said, it's hard to tell exactly what your question is.
If your program is running the entire time, you could grab a copy of the file in memory. Then in a separate thread periodically read the new file and compare the two.
If you want your program to be notified when file is changed, use FileSystemWatcher. However, it will only notify you, when file is changed while your program is running and will not provide you with appended text. You will get only information about which file was changed.
FileSystemWatcher watcher = new FileSystemWatcher(Environment.CurrentDirectory, "test.txt");
while (true)
{
var changedResult =
watcher.WaitForChanged(WatcherChangeTypes.Changed);
Console.WriteLine(changedResult.Name);
}
Or:
FileSystemWatcher watcher = new FileSystemWatcher(Environment.CurrentDirectory, "test.txt");
watcher.Changed += watcher_Changed;
static void watcher_Changed(object sender, FileSystemEventArgs e)
{
Console.WriteLine(e.FullPath);
Console.WriteLine(e.ChangeType);
}
Best solution imo is to write a small app which has to be used to change the file in question. This application can then insert additional info into the file which allows you to keep the entire revision history.

Categories