Find a string in a zipped file without unzipping the file

Find a string in a zipped file without unzipping the file - c#

Is there a way to search for a string within a file(s) within a zipped folder WITHOUT unzipping the files?
My situation is I have over 1 million files zipped by months of the year.
For example 2008_01, 2008_02, etc.
I need to extract/unzip only the files with specific serial numbers within the files.
The only thing I can find is unzipping the data to a temporary location to perform that search, but it takes me 45-60 minutes just to unzip the data manually. So I assume the code would take just as long to perform that task, plus I don't have that much available space.
Please Help.

Unfortunately, there isn't a way to do this. The zip format maintains an uncompressed manifest that shows file names and directory structure, but the contents of the files themselves are compressed, and therefore any string inside a file won't match your search until the file is decompressed.
This same limitation exists with just about any general-purpose file compression format (7zip, gzip, rar, etc.). You're essentially reclaiming disk space at the expense of CPU cycles.

Using some extension methods, you can scan through the Zip files. I don't think you can gain anything by trying to scan a single zip in parallel, but you could probably scan multiple zip files in parallel.
public static class ZipArchiveEntryExt {
public static IEnumerable<string> GetLines(this ZipArchiveEntry e) {
using (var stream = e.Open()) {
using (var sr = new StreamReader(stream)) {
string line;
while ((line = sr.ReadLine()) != null)
yield return line;
}
}
}
}
public static class ZipArchiveExt {
public static IEnumerable<string> FilesContain(this ZipArchive arch, string target) {
foreach (var entry in arch.Entries.Where(e => !e.FullName.EndsWith("/")))
if (entry.GetLines().Any(line => line.Contains(target)))
yield return entry.FullName;
}
public static void ExtractFilesContaining(this ZipArchive arch, string target, string extractPath) {
if (!extractPath.EndsWith(Path.DirectorySeparatorChar.ToString(), StringComparison.Ordinal))
extractPath += Path.DirectorySeparatorChar;
foreach (var entry in arch.Entries.Where(e => !e.FullName.EndsWith("/")))
if (entry.GetLines().Any(line => line.Contains(target)))
entry.ExtractToFile(Path.Combine(extractPath, entry.Name));
}
}
With these, you can search a zip file with:
var arch = ZipFile.OpenRead(zipPath);
var targetString = "Copyright";
var filesToExtract = arch.FilesContain(targetString);
You could also extract them to a particular path (assuming no filename conflicts) with:
var arch = ZipFile.OpenRead(zipPath);
var targetString = "Copyright";
arch.ExtractFilesContaining(targetString, #"C:\Temp");
You could modify ExtractFilesContaining to e.g. add the year-month to the file names to help avoid conflicts.

Related

How to extract multi-volume archive within Azure Blob Storage?

I have a multi-volume archive stored in Azure Blob Storage that is split into a series of zips titled like this: Archive-Name.zip.001, Archive-Name.zip.002, etc. . . Archive-Name.zip.010. Each file is 250 MB and contains hundreds of PDFs.
Currently we were trying to iterate through each archive part and extract the PDFs. This works except when the past PDF in an archive has been split between two archive parts, ZipFile in C# is unable to process the split file and throws an exception.
We tried reading all the archive parts into a single MemoryStream and then extracting the files, however then we are finding the memory streams exceed 2GBs which is the limit - so this method does not work either.
It is not feasible to download the archive into a machines memory, extract, then upload the PDFs to a new file. The extraction needs to be done in Azure where the program will run.
This is the code we are currently using - it is unable to handle PDFs split between two archive parts.
public static void UnzipTaxForms(TextWriter log, string type, string fiscalYear)
{
var folderName = "folderName";
var outPutContainer = GetContainer("containerName");
CreateIfNotExists(outPutContainer);
var fileItems = ListFileItems(folderName);
fileItems = fileItems.Where(i => i.Name.Contains(".zip")).ToList();
foreach (var file in fileItems)
{
using (var ziped = ZipFile.Read(GetMemoryStreamFromFile(folderName, file.Name)))
{
foreach (var zipEntry in ziped)
{
using (var outPutStream = new MemoryStream())
{
zipEntry.Extract(outPutStream);
var blockblob = outPutContainer.GetBlockBlobReference(zipEntry.FileName);
outPutStream.Seek(0, SeekOrigin.Begin);
blockblob.UploadFromStream(outPutStream);
}
}
}
}
}
Another note. We are unable to change the way the multi-volume archive is generated. Any help would be appreciated.

File.OpenRead() Directories , Sub Directories , Files, and then Write Folders and files to another directory

I am looking to read folders and files from a directory structure like so..
e.g
C:\RootFolder
SubFolder1
SubFolder2
File1
File2
SubFolder3
File3
Files....
Files....
I would like to read both, files and folders and write to another directory I cant use copy , because the directory I want to write to is remote and not local.
I read the files here.... Id love to be able to read folders and files and write both to another directory.
public static IEnumerable<FileInfo> GetFiles(string dir)
{
return Directory.EnumerateFiles(dir, "*", SearchOption.AllDirectories)
.Select(path =>
{
var stream = File.OpenRead(path);
return new FileInfo(Path.GetFileName(path), stream);
})
.DisposeEach(c => c.Content);
}
this function writes files to a remote sftp site.
public Task Write(IEnumerable<FileInfo> files)
{
return Task.Run(() =>
{
using (var sftp = new SftpClient(this.sshInfo))
{
sftp.Connect();
sftp.ChangeDirectory(this.remoteDirectory);
foreach (var file in files)
{
sftp.UploadFile(file.Content, file.RelativePath);
}
}
});
}
In this function I write the read files from the above function.
private async static Task SendBatch(Config config, Batch batch, IRemoteFileWriter writer)
{
var sendingDir = GetClientSendingDirectory(config, batch.ClientName);
Directory.CreateDirectory(Path.GetDirectoryName(sendingDir));
Directory.Move(batch.LocalDirectory, sendingDir);
Directory.CreateDirectory(batch.LocalDirectory);
//Use RemoteFileWriter...
var files = GetFiles(sendingDir);
await writer.Write(files).ContinueWith(t =>
{
if(t.IsCompleted)
{
var zipArchivePath = GetArchiveDirectory(config, batch);
ZipFile.CreateFromDirectory(
sendingDir,
zipArchivePath + " " +
DateTime.Now.ToString("yyyy-MM-dd hh mm ss.Zip")
);
}
});
}
Thank you!

You are getting UnauthorizedAccessException: Access to the path 'C:\temp' is denied. because you can't open a stream from a folder as it doesn't contain bytes.
From what I can understand you are looking to copy the files from one folder to another.
This answer seems to cover what you are doing. https://stackoverflow.com/a/3822913/3634581
Once you have copied the directories you can then create the zip file.
If you don't need to copy the files and just create the Zip I would recommend that since it will reduce the disk IO and speed up the process.
ZipArchive (https://msdn.microsoft.com/en-us/library/system.io.compression.ziparchive(v=vs.110).aspx) can be used to create a zip file straight to a stream.

I figured it out here is the solution
public Task Write(IEnumerable<FileInfo> files)
{
return Task.Run(() =>
{
using (var sftp = new SftpClient(this.sshInfo))
{
sftp.Connect();
sftp.ChangeDirectory(this.remoteDirectory);
foreach (var file in files)
{
var parts = Path.GetDirectoryName(file.RelativePath)
.Split(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar, StringSplitOptions.RemoveEmptyEntries);
sftp.ChangeDirectory(this.remoteDirectory);
foreach (var p in parts)
{
try
{
sftp.ChangeDirectory(p);
}
catch (SftpPathNotFoundException)
{
sftp.CreateDirectory(p);
sftp.ChangeDirectory(p);
}
}
sftp.UploadFile(file.Content, Path.GetFileName(file.RelativePath));
}
}
});
}
****Key point to the solution was
this
var parts = Path.GetDirectoryName(file.RelativePath)
.Split(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar, StringSplitOptions.RemoveEmptyEntries);
We call *Path.GetDirectoryName
on the file itself to get the directory that correlates to the file.
We split the file directory to get the folder name and finally create the folder name we obtained from the split and upload the file to it.
I hope this helps others who may encounter such issue.

Copy files from directory recursively smallest first

I want to copy all files in a directory to a destination folder using Visual C# .NET version 3.5, which can be done pretty easily (taken from this answer):
private static void Copy(string sourceDir, string targetDir)
{
Directory.CreateDirectory(targetDir);
foreach (var file in Directory.GetFiles(sourceDir))
File.Copy(file, Path.Combine(targetDir, Path.GetFileName(file)));
foreach (var directory in Directory.GetDirectories(sourceDir))
Copy(directory, Path.Combine(targetDir, Path.GetFileName(directory)));
}
Now, there's one little problem here: I want it to sort all files smallest first, so if the source path is a removable drive, which gets plugged out after some time, it would still have most possible data copied. With upper algorithm, if it takes a directory containing big files first and continues with a directory containing many smaller ones afterwards, there is the chance that the user will plug his drive out while the software is still copying the big file, and nothing will stay on the drive except the incomplete big file.
My idea was to do multiple loops: First, every filepath would be put in a dictionary including its size, then this dictionary would get sorted, and then every file would be copied from source to destination (including folder creation).
I'm afraid this is not a very neat solution, since looping two times about the same just doesn't seem right to me. Also, I'm not sure if my dictionary can store that much information, if the source folder has got too many different files and subfolders.
Are there any better options to choose from?

You could use a simpler method based on the fact that you can get all the files in a directory subtree just asking for them without using recursion.
The missing piece of the problem is the file size. This information could be obtained using the DirectoryInfo class and the FileInfo class while the ordering is just a Linq instruction applied to the sequence of files as in the following example.
private static void Copy(string sourceDir, string targetDir)
{
DirectoryInfo di = new DirectoryInfo(sourceDir);
foreach (FileInfo fi in di.GetFiles("*.*", SearchOption.AllDirectories).OrderBy(d => d.Length))
{
string leftOver = fi.DirectoryName.Replace(sourceDir, "");
string destFolder = Path.Combine(targetDir, leftOver);
// Because the files are listed in order by size
// we could copy a file deep down in the subtree before the
// ones at the top of the sourceDir
// Luckily CreateDirectory doesn't throw if the directory exists
// and automatically creates all the intermediate folders required
Directory.CreateDirectory(destFolder);
// Just write the intended copy parameters as a proof of concept
Console.WriteLine($"{fi.Name} with size = {fi.Length} -> Copy from {fi.DirectoryName} to {Path.Combine(destFolder, fi.Name)}");
}
}
In this example I have changed the File.Copy method with a Console.WriteLine just to have a proof of concept without copying anything, but the substitution is trivial.
Notice also that it is better to use EnumerateFiles instead of GetFiles as explained in the MSDN documentation

I hope this helps!
First; get all files from the source directory, with recursion being optional. Then proceed to copy all files ordered by size to the target directory.
void CopyFiles(string sourceDir, string targetDir, bool recursive = false)
{
foreach (var file in GetFiles(sourceDir, recursive).OrderBy(f => f.Length))
{
var subDir = file.DirectoryName
.Replace(sourceDir, String.Empty)
.TrimStart(Path.DirectorySeparatorChar);
var fullTargetDir = Path.Combine(targetDir, subDir);
if (!Directory.Exists(fullTargetDir))
Directory.CreateDirectory(fullTargetDir);
file.CopyTo(Path.Combine(fullTargetDir, file.Name));
}
}
IEnumerable<FileInfo> GetFiles(string directory, bool recursive)
{
var files = new List<FileInfo>();
if (recursive)
{
foreach (var subDirectory in Directory.GetDirectories(directory))
files.AddRange(GetFiles(subDirectory, true));
}
foreach (var file in Directory.GetFiles(directory))
files.Add(new FileInfo(file));
return files;
}

Enumerate zipped contents of unzipped folder

I am trying to enumerate the zipped folders that are inside an unzipped folder using Directory.GetDirectories(folderPath).
The problem I have is that it does not seem to be finding the zipped folders, when I come to iterate over the string[], it is empty.
Is Directory.GetDirectories() the wrong way to go about this and if so what method serves this purpose?
Filepath example: C:\...\...\daily\daily\{series of zipped folder}
public void CheckZippedDailyFolder(string folderPath)
{
if(folderPath.IsNullOrEmpty())
throw new Exception("Folder path required");
foreach (var folder in Directory.GetDirectories(folderPath))
{
var unzippedFolder = Compression.Unzip(folder + ".zip", folderPath);
using (TextReader reader = File.OpenText(unzippedFolder + #"\" + new DirectoryInfo(folderPath).Name))
{
var csv = new CsvReader(reader);
var field = csv.GetField(0);
Console.WriteLine(field);
}
}
}

GetDirectories is the wrong thing to use. Explorer lies to you; zip files are actually files with an extension .zip, not real directories on the file system level.
Look at:
https://msdn.microsoft.com/en-us/library/system.io.compression.ziparchive.entries%28v=vs.110%29.aspx (ZipArchive.Entries) and/or
https://msdn.microsoft.com/en-us/library/system.io.compression.zipfile%28v=vs.110%29.aspx (ZipFile) to see how to deal with them.

c# zip file - Extract file last

Quick question: I need to extract zip file and have a certain file extract last.
More info: I know how to extract a zip file with c# (fw 4.5).
The problem I'm having now is that I have a zip file and inside it there is always a file name (for example) "myFlag.xml" and a few more files.
Since I need to support some old applications that listen to the folder I'm extracting to, I want to make sure that the XML file will always be extract the last.
Is there some thing like "exclude" for the zip function that can extract all but a certain file so I can do that and then extract only the file alone?
Thanks.

You could probably try a foreach loop on the ZipArchive, and exclude everything that doesn't match your parameters, then, after the loop is done, extract the last file.
Something like this:
private void TestUnzip_Foreach()
{
using (ZipArchive z = ZipFile.Open("zipfile.zip", ZipArchiveMode.Read))
{
string LastFile = "lastFileName.ext";
int curPos = 0;
int lastFilePosition = 0;
foreach (ZipArchiveEntry entry in z.Entries)
{
if (entry.Name != LastFile)
{
entry.ExtractToFile(#"C:\somewhere\" + entry.FullName);
}
else
{
lastFilePosition = curPos;
}
curPos++;
}
z.Entries[lastFilePosition].ExtractToFile(#"C:\somewhere_else\" + LastFile);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find a string in a zipped file without unzipping the file - c#

Related

How to extract multi-volume archive within Azure Blob Storage?

File.OpenRead() Directories , Sub Directories , Files, and then Write Folders and files to another directory

Copy files from directory recursively smallest first

Enumerate zipped contents of unzipped folder

c# zip file - Extract file last

Categories

Resources