Read a part of a file contained in a zip archive - c#

This question is similar to How to read data from a zip file without having to unzip the entire file excepted I'd like to go a bit further and read only a part of a file, i.e. get the file stream, and seek to a position for which I know the offset in bytes. I don't know if the zip format allows that in the first place.
I've tried to seek in the stream returned by ZipArchive.Entries[...].Open(), but it throws, saying the operation is not supported.
I can of course just read (and discard) the contents up to the point I'm interested in, but this is slow for big files.
EDIT: An example to make clear what I want to do:
Let's say I have a file archive.zip containing several files, one of them is bigfile.bin. I already know how to decompress bigfile.bin without decompressing the other files of archive.zip, no problem there. My question is: can I skip 10 000 000 bytes of bigfile.bin and start reading what remains? Those 10 000 000 bytes would be measured in the decompressed stream of course.
using (var archive = new ZipArchive("archive.zip"))
{
using (var data = archive.Entries.Single(e => e.Name == "bigfile.bin").Open())
{
data.Seek(10000000, SeekOrigin.Begin); // this is what I want to do but it doesn't work
data.Read(/*etc*/);
}
}

Related

Difference between loading a file from a path and from a stream C#

This is a follow up question to this question:
Difference between file path and file stream?
I didn't fully understand everything answered in the linked question.
I am using the Microsoft.SqlServer.Dac.BacPackage which contains a Load method with 2 overloads - one that receives a string path and one that receives a Stream.
This is the documentation of the Load method:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.sqlserver.dac.bacpackage.load?view=sql-dacfx-150
What exactly is the difference between the two? Am I correct in assuming that the overloading of the string path saves all the file in the memory first, while the stream isn't? Are there other differences?
No, the file will not usually be fully loaded all at once.
A string path parameter normally means it will just open the file as a FileStream and pass it to the other version of the function. There is no reason why the stream should fully load the file into memory unless requested.
A Stream parameter means you open the file and pass the resulting Stream. You could also pass any other type of Stream, such as a network stream, a zip or decryption stream, a memory-backed stream, anything really.
Short answer:
The fact that you have two methods, one that accepts a filename and one that accepts a stream is just for convenience. Internally, the one with the filename will open the file as a stream and call the other method.
Longer answer
You can consider a stream as a sequence of bytes. The reason to use a stream instead of a byte[] or List<byte>, is, that if the sequence is really, really large, and you don't need to have access to all bytes at once, it would be a waste to put all bytes in memory before processing them.
For instance, if you want to calculate the checksum for all bytes in a file: you don't need to put all data in memory before you can start calculating the sum. In fact, anything that efficiently can deliver you the bytes one by one would suffice.
That is the reason why people would want to read a file as a stream.
The reason why people want a stream as input for their data, is that they want to give the caller the opportunity to specify the source of their data: callers can provide a stream that reads from a file, but also a stream with data from the internet, or from a database, or from a textBox, the procedure does not care, as long as it can read the bytes one by one or sometimes per chunk of bytes:
using (Stream fileStream = File.Open(fileName)
{
ProcessInputData(fileStream);
}
Or:
byte[] bytesToProcess = ...
using (Stream memoryStream = new MemoryStream(bytesToProcess))
{
ProcessInputData(memoryStream);
}
Or:
string operatorInput = this.textBox1.Text;
using (Stream memoryStream = new MemoryStream(operatorInput))
{
ProcessInputData(memoryStream);
}
Conclusioin
Methods use streams in their interface to indicate that they don't need all data in memory at once. One-by-one, or per chunk is enough. The caller is free to decide where the data comes from.

Appending bytes to an existing file C#

I am working on a steganography software in C#, more precisely for video files. My approach is to append the extra information at the end of a video file. However, I must read the whole video file in memory first. I used the File.ReadAllBytes() function in C# to read a video file (video around 200MB) into a byte array. Then I create a new array with the video's bytes and my data's bytes. But, this sometimes causes an OutOfMemoryException. And when it doesn't it is very slow. Is there a more efficient way to append bytes to an existing file in C# which will solve this issue? Thank you.
Open the file with FileMode.Append
var stream = new FileStream(path, FileMode.Append)
FileMode Enumeration
FileMode.Append:
Opens the file if it exists and seeks to the end of the file, or
creates a new file. This requires FileIOPermissionAccess.Append
permission. FileMode.Append can be used only in conjunction with
FileAccess.Write. Trying to seek to a position before the end of the
file throws an IOException exception, and any attempt to read fails
and throws a NotSupportedException exception.
Sure, it's easy:
using (var stream = File.Open(path, FileMode.Append))
{
stream.Write(extraData);
}
No need to read the file first.
I wouldn't class this as steganography though - that would involve making subtle changes to the video frames such that it's still a valid video and looks the same to the human eye, but the extra data is encoded within those frames so it can be extracted later.
attempt this method, I am unsure if it will yield faster results, but logically it should.
: https://stackoverflow.com/a/6862460/2835725

DotNetZip Creates corrupt archives (bad CRC)

There's a strange problem with DotNetZip that I can't seem to find a solution to.
I've searched for a few hours now and I just can't find anything on this, so here goes.
var ms = new MemoryStream();
using (var archive = new Ionic.Zip.ZipFile()) {
foreach (var file in files) {
// string byte[]
var entry = archive.AddEntry(file.Name, file.Data);
entry.ModifiedTime = DateTime.Now.AddYears(10); // Just for testing
}
archive.Save(ms);
}
return ms.GetBuffer();
I need to add the modified time, which is rather crucial, but right now I just have a dummy timestamp.
When I open the file with WinRAR, it says "Unexpected end of archive". Each individual file has checksum 00000000, and WinRAR says "The archive is either in unknown format or damaged". I can repair it, which brings it down 20% in size and makes everything OK. But that's not really useful..
When I make a breakpoint after adding all the entries, I can see in zip.Entries that all the entries have that same bad CRC, but all the data seems to be there.
So it shouldn't be the way I save the archive that's the problem.
I use my file collection elsewhere without problems, which adds to DotNetZip being weird. Well either that or I misunderstand something :)
GetBuffer is certainly wrong. It returns the internal buffer of the MemoryStream, which is often bigger than the actual content.
To return an array that only contains the actual content, use ToArray().
Or you could carefully handle the incompletely filled buffer in the consuming code. This would reduce GC pressure, since you don't need to allocate a whole new array for the return value.
If the zip-archive is large, I'd also consider saving to a file directly, instead of assembling the archive in-memory.

Reading a MemoryStream which contains multiple files

If I have a single MemoryStream of which I know I sent multiple files (example 5 files) to this MemoryStream. Is it possible to read from this MemoryStream and be able to break apart file by file?
My gut is telling me no since when we Read, we are reading byte by byte... Any help and a possible snippet would be great. I haven't been able to find anything on google or here :(
You can't directly, not if you don't delimit the files in some way or know the exact size of each file as it was put into the buffer.
You can use a compressed file such as a zip file to transfer multiple files instead.
A stream is just a line of bytes. If you put the files next to each other in the stream, you need to know how to separate them. That means you must know the length of the files, or you should have used some separator. Some (most) file types have a kind of header, but looking for this in an entire stream may not be waterproof either, since the header of a file could just as well be data in another file.
So, if you need to write files to such a stream, it is wise to add some extra information. For instance, start with a version number, then, write the size of the first file, write the file itself and then write the size of the next file, etc....
By starting with a version number, you can make alterations to this format. In the future you may decide you need to store the file name as well. In that case, you can increase version number, make up a new format, and still be able to read streams that you created earlier.
This is of course especially useful if you store these streams too.
Since you're sending them, you'll have to send them into the stream in such a way that you'll know how to pull them out. The most common way of doing this is to use a length specification. For example, to write the files to the stream:
write an integer to the stream to indicate the number of files
Then for each file,
write an integer (or a long if the files are large) to indicate the number of bytes in the file
write the file
To read the files back,
read an integer (n) to determine the number of files in the stream
Then, iterating n times,
read an integer (or long if that's what you chose) to determine the number of bytes in the file
read the file
You could use an IEnumerable<Stream> instead.
You need to implement this yourself, what you would want to do is write in some sort of 'delimited' into the stream. As you're reading, look for that delimited, and you'll know when you have hit a new file.
Here's a quick and dirty example:
byte[] delimiter = System.Encoding.Default.GetBytes("++MyDelimited++");
ms.Write(myFirstFile);
ms.Write(delimiter);
ms.Write(mySecondFile);
....
int len;
do {
len = ms.ReadByte(buffer, lastOffest, delimiter.Length);
if(buffer == delimiter)
{
// Close and open a new file stream
}
// Write buffer to output stream
} while(len > 0);

Is there a way to make this faster? MemoryStream vs FileStream

I am working with iTextSharp, and need to generate hundreds of thousands of RTF documents - the resulting files are between 5KB and 500KB.
I am listing 2 approaches below - the original approach wasn't necessarily slow, but I figured why write and retrieve to/from file to get the output string I need. I saw this other approach using MemoryStream, but it actually slowed things down. I essentially just need the outputted RTF content, so that I can run some filters on that RTF to clean up unnecessary formatting. The queries bringing back the data are very quick instant seeming . To generate a 1000 files (actually 2000 files are created in process) with original approach files takes about 15 minutes, the same with second approach takes about 25-30 minutes. The resulting files that I've run are averaging around 80KB.
Is there something wrong with the second approach? Seems like it should be faster than the first one, not slower.
Original approach:
RtfWriter2.GetInstance(doc, new FileStream(RTFFilePathName, FileMode.Create));
doc.Open();
//Add Tables and stuff here
doc.Close(); //It saves a file here to (RTFPathFileName)
StreamReader srRTF = new StreamReader(RTFFilePathName);
string rtfText = srRTF.ReadToEnd();
srRTF.Close();
//Do additional things with rtfText before writing to my final file
New approach, trying to speed it up but this is actually half as fast:
MemoryStream stream = new MemoryStream();
RtfWriter2.GetInstance(doc, stream);
doc.Open();
//Add Tables and stuff here
doc.Close();
string rtfText =
ASCIIEncoding.ASCII.GetString(stream.GetBuffer());
stream.Close();
//Do additional things with rtfText before writing to my final file
The second approach I am trying I found here:
iTextSharp - How to generate a RTF document in the ClipBoard instead of a file
How big your resulting stream is? MemoryStream performs a lot of memory copy operations while growing, so for large results it may take significantly longer to write data by small chunks compared with FileStream.
To verify if it is the problem set inital size of MemoryStream to some large value around resulting size and re-run the code.
To fix it you can pre-grow memory stream initially (if you know approximate output) or write your own stream that uses different scheme when growing. Also using temporary file might be good enough for your purposes as is.
Like Alexei said, its probably caused by fact, yo are creating MemoryStream every time, and every time it continously re-alocates memory as it grows. Try creating only 1 stream and reset it to begining before every write.
Also I think stream.GetBuffer() again returns new memory, so try using same StreamReader with your MemoryStream.
And it seems your code can be easily paralelised, so you can try run it using Paralel Extesions or using TreadPool.
And it seems little weird, you are writing your text as bytes in stream, then reading this stream as bytes and converting to text. Wouldnt it be possible to save your document directly as text?
A MemoryStream is not associated with a file, and has no concept of a filename. Basically, you can't do that.
You certainly can't cast between them; you can only cast upwards an downwards - not sideways; to visualise:
Stream
|
| |
FileStream MemoryStream
You can cast a MemoryStream to a Stream trivially, and a Stream to a MemoryStream via a type-check; but never a FileStream to a MemoryStream. That is like saying a dog is an animal, and an elephant is an animal, so we can cast a dog to an elephant.
You could subclass MemoryStream and add a Name property (that you supply a value for), but there would still be no commonality between a FileStream and a YourCustomMemoryStream, and FileStream doesn't implement a pre-existing interface to get a Name; so the caller would have to explicitly handle both separately, or use duck-typing (maybe via dynamic or reflection).
Another option (perhaps easier) might be: write your data to a temporary file; use a FileStream from there; then (later) delete the file.
I know this is old but there is a lot of misinformation in this thread.
It's all about buffer size. The internal buffers are significantly smaller with a memory stream vs a file stream. Smaller buffers cause more read\writes.
Just intilaize your memory stream with either a file stream or a byte array with a size of around 80k. Close the doc, set stream position to 0 and read to end the contents.
On a side note, get buffer will return the whole allocated buffer. So if you only wrote 1 byte and the buffer is 4k, you will have a lot of garbage in your string.

Categories