DotNetZip Creates corrupt archives (bad CRC) - c#

There's a strange problem with DotNetZip that I can't seem to find a solution to.
I've searched for a few hours now and I just can't find anything on this, so here goes.
var ms = new MemoryStream();
using (var archive = new Ionic.Zip.ZipFile()) {
foreach (var file in files) {
// string byte[]
var entry = archive.AddEntry(file.Name, file.Data);
entry.ModifiedTime = DateTime.Now.AddYears(10); // Just for testing
}
archive.Save(ms);
}
return ms.GetBuffer();
I need to add the modified time, which is rather crucial, but right now I just have a dummy timestamp.
When I open the file with WinRAR, it says "Unexpected end of archive". Each individual file has checksum 00000000, and WinRAR says "The archive is either in unknown format or damaged". I can repair it, which brings it down 20% in size and makes everything OK. But that's not really useful..
When I make a breakpoint after adding all the entries, I can see in zip.Entries that all the entries have that same bad CRC, but all the data seems to be there.
So it shouldn't be the way I save the archive that's the problem.
I use my file collection elsewhere without problems, which adds to DotNetZip being weird. Well either that or I misunderstand something :)

GetBuffer is certainly wrong. It returns the internal buffer of the MemoryStream, which is often bigger than the actual content.
To return an array that only contains the actual content, use ToArray().
Or you could carefully handle the incompletely filled buffer in the consuming code. This would reduce GC pressure, since you don't need to allocate a whole new array for the return value.
If the zip-archive is large, I'd also consider saving to a file directly, instead of assembling the archive in-memory.

Related

Reading pictures from DB using memorystreams and images often crashes with Out of Memory

I'm trying to figure out if there is something seriously wrong with the following code. It reads the binary from the database, stores it as a picture and associates with an object of an Animal record.
For each row (record of an animal):
byte[] ba = (byte[])x.ItemArray[1]; //reading binary from a DB row
using (MemoryStream m=new MemoryStream(ba))
{
Image i = Image.FromStream(m); //exception thrown occassionally
c.Photo = i;
listOfAnimals.Add(c);
}
First of all, with 18 pictures loaded (the JPG files have 105 Mb in total), the running app uses 2 gb of memory. With no pictures loaded, it is only 500 Mb.
Often the exception gets raised in the marked point, the source of which is System Drawing.
Could anyone help me optimize the code or tell me what the problem is? I must have used some wrong functions...
According to Image.FromStream Method
OutOfMemoryException
The stream does not have a valid image format.
Remarks
You must keep the stream open for the lifetime of the Image.
The stream is reset to zero if this method is called successively with the same stream.
For more information see: Loading an image from a stream without keeping the stream open and Returning Image using Image.FromStream
Try the following:
Create a method to convert byte[] to image
ConvertByteArrayToImage
public static Image ConvertByteArrayToImage(byte[] buffer)
{
using (MemoryStream ms = new MemoryStream(buffer))
{
return Image.FromStream(ms);
}
}
Then:
byte[] ba = (byte[])x.ItemArray[1]; //reading binary from a DB row
c.Photo = ConvertByteArrayToImage(ba);
listOfAnimals.Add(c);
Checking the documentation, a possible reason for out of memory exceptions are that the stream is not a valid image. If this is the case it should fail reliably for a given image, so check if any particular source image is causing this issue.
Another possibility should be that you simply run out of memory. Jpeg typically gets a 10:1 compression level, so 105Mib of compressed data could use +1Gib of memory. I would recommend switching to x64 if at all possible, I see be little reason not to do so today.
There could also be a memory leak, the best way to investigate this would be with a memory profiler. This might be in just about any part of your code, so it is difficult to know without profiling.
You might also need to care about memory fragmentation. Large datablocks are stored in the large object heap, and this is not automatically defragmented. So after running a while you might still have memory available, just not in any continuous block. Again, switching to x64 would mostly solve this problem.
Also, as mjwills comments, please do not store large files in the database. I just spent several hours recovering a huge database, something that would have been much faster if images where stored as files instead.

How do I access the data in a Avro.snz file with C#

I have an Avro.snz file whose
avro.codecs is snappy
This can be opened with com.databricks.avro in Spark but it seems snappy is unsupported by Apache.Avro and Confluent.Avro, they only have deflate and null. Although they can get me the Schema, I cannot get at the data.
The next method gets and error. Ironsnappy is unable to decompress the file too, it says the input is
using (Avro.File.IFileReader<generic> reader = Avro.File.DataFileReader<generic>.OpenReader(avro_path))
{
schema = reader.GetSchema();
Console.WriteLine(reader.HasNext()); //true
var hi = reader.Next(); // error
Console.WriteLine(hi.ElementAt(0).ToString()); // error
}
I'm starting to wonder if there is anything in the Azure HDInsight library, but I cant seem to find the nuget package that gives me a way to read Avro with support for Snappy compression.
I'm open to any solution, even if that means downloading the source for Apache.Avro and adding in Snappy support manually, but to be honest, I'm sort of a newbie and have no idea how compression even works let alone add support to a library.
Can anyone help?
Update:
Just adding the snappy codec to Apache.Avro and changing the DeflateStream to Ironsnappy stream failed. It gave Corrupt input again. Is there anything anywhere that can open Snappy compressed Avro files with C#?
Or how do I determine what part of the Avro is snappy compressed and pass that to Ironsnappy.
Ok, so not even any comments on this. But I eventually solved my problem. Here is how I solved it.
I tried Apache.Avro and Confluent version as well, but their .net version has no snappy support darn. But I can get the schema as that is uncompressed apparently.
Since Parquet.Net uses IronSnappy, I built/added out the snappy codec in Apache.Avro by basically cloning its deflate code and changing a few names. Failed. Corrupt input Ironsnappy says.
I research Avro and see that it is seperated by an uncompressed Schema, followed by the name of the compression codec of the data, then the data itself, which are divided into blocks. Well, I have no idea where a block starts and ends. Somehow the binary in the file gives that info somehow, but I still have no idea, I couldn't get it with a hex editor even. I think Apache.Avro takes a long or a varint somehow, and the hex editor I used doesn't give me that info.
I found the avro-tools.jar tool inside Apache.Avro. To make it easier to use, I made it an executable with launch4j totally superfluous move but whatever. Then I used that cat my avro into 1 row, uncompressed and snappy. I used that as my base and followed the flow of Apache.Avro in the debugger. While also tracking the index of bytes and such with the hex editor and the debugger in C#.
With 1 row, it is guaranteed 1 block. So I ran a loop on the byte start index and end index. I found my Snappy block and was able to decompress it with IronSnappy. I modified the codec portion of my Apache.Avro snappy codec code to make it work with 1 block. (which was basically whatever block Apache.Avro took minus 4 bytes which I assume is the Snappy CRC check which I ignored.
It fails with multi blocks. I found its because Apache.Avro always throws the deflate codec a 4096 byte array after the first block. I reduced it to read size and did the minus 4 size thing again. It worked.
Success! So basically it was copy over deflate as a template for snappy, reduce block byte by 4, then make sure to resize the byte array to block byte size before getting Ironsnappy to decompress.
public override byte[] Decompress(byte[] compressedData)
{
int snappySize = compressedData.Length - 4;
byte[] compressedSnappy_Data = new byte[snappySize];
System.Array.Copy(compressedData, compressedSnappy_Data, snappySize);
byte[] result = IronSnappy.Snappy.Decode(compressedSnappy_Data);
return result;
}
if (_codec.GetHashCode() == DataFileConstants.SnappyCodecHash)
{
byte[] snappyBlock = new byte[(int)_currentBlock.BlockSize];
System.Array.Copy(_currentBlock.Data, snappyBlock, (int)_currentBlock.BlockSize);
_currentBlock.Data = snappyBlock;
}
I didn't bother with actually using the checksum as I don't know how or need to? At least not right now. And I totally ignored the compress function.
but if you really want my compress function here it is
public override byte[] Compress(byte[] uncompressedData)
{
return new byte[0];
}
The simplest solution would be to use:
ResultModel resultObject = AvroConvert.Deserialize<ResultModel>(byte[] avroObject);
From https://github.com/AdrianStrugala/AvroConvert
null
deflate
snappy
gzip
codes are supported

Appending bytes to an existing file C#

I am working on a steganography software in C#, more precisely for video files. My approach is to append the extra information at the end of a video file. However, I must read the whole video file in memory first. I used the File.ReadAllBytes() function in C# to read a video file (video around 200MB) into a byte array. Then I create a new array with the video's bytes and my data's bytes. But, this sometimes causes an OutOfMemoryException. And when it doesn't it is very slow. Is there a more efficient way to append bytes to an existing file in C# which will solve this issue? Thank you.
Open the file with FileMode.Append
var stream = new FileStream(path, FileMode.Append)
FileMode Enumeration
FileMode.Append:
Opens the file if it exists and seeks to the end of the file, or
creates a new file. This requires FileIOPermissionAccess.Append
permission. FileMode.Append can be used only in conjunction with
FileAccess.Write. Trying to seek to a position before the end of the
file throws an IOException exception, and any attempt to read fails
and throws a NotSupportedException exception.
Sure, it's easy:
using (var stream = File.Open(path, FileMode.Append))
{
stream.Write(extraData);
}
No need to read the file first.
I wouldn't class this as steganography though - that would involve making subtle changes to the video frames such that it's still a valid video and looks the same to the human eye, but the extra data is encoded within those frames so it can be extracted later.
attempt this method, I am unsure if it will yield faster results, but logically it should.
: https://stackoverflow.com/a/6862460/2835725

Reading a MemoryStream which contains multiple files

If I have a single MemoryStream of which I know I sent multiple files (example 5 files) to this MemoryStream. Is it possible to read from this MemoryStream and be able to break apart file by file?
My gut is telling me no since when we Read, we are reading byte by byte... Any help and a possible snippet would be great. I haven't been able to find anything on google or here :(
You can't directly, not if you don't delimit the files in some way or know the exact size of each file as it was put into the buffer.
You can use a compressed file such as a zip file to transfer multiple files instead.
A stream is just a line of bytes. If you put the files next to each other in the stream, you need to know how to separate them. That means you must know the length of the files, or you should have used some separator. Some (most) file types have a kind of header, but looking for this in an entire stream may not be waterproof either, since the header of a file could just as well be data in another file.
So, if you need to write files to such a stream, it is wise to add some extra information. For instance, start with a version number, then, write the size of the first file, write the file itself and then write the size of the next file, etc....
By starting with a version number, you can make alterations to this format. In the future you may decide you need to store the file name as well. In that case, you can increase version number, make up a new format, and still be able to read streams that you created earlier.
This is of course especially useful if you store these streams too.
Since you're sending them, you'll have to send them into the stream in such a way that you'll know how to pull them out. The most common way of doing this is to use a length specification. For example, to write the files to the stream:
write an integer to the stream to indicate the number of files
Then for each file,
write an integer (or a long if the files are large) to indicate the number of bytes in the file
write the file
To read the files back,
read an integer (n) to determine the number of files in the stream
Then, iterating n times,
read an integer (or long if that's what you chose) to determine the number of bytes in the file
read the file
You could use an IEnumerable<Stream> instead.
You need to implement this yourself, what you would want to do is write in some sort of 'delimited' into the stream. As you're reading, look for that delimited, and you'll know when you have hit a new file.
Here's a quick and dirty example:
byte[] delimiter = System.Encoding.Default.GetBytes("++MyDelimited++");
ms.Write(myFirstFile);
ms.Write(delimiter);
ms.Write(mySecondFile);
....
int len;
do {
len = ms.ReadByte(buffer, lastOffest, delimiter.Length);
if(buffer == delimiter)
{
// Close and open a new file stream
}
// Write buffer to output stream
} while(len > 0);

Is there a way to make this faster? MemoryStream vs FileStream

I am working with iTextSharp, and need to generate hundreds of thousands of RTF documents - the resulting files are between 5KB and 500KB.
I am listing 2 approaches below - the original approach wasn't necessarily slow, but I figured why write and retrieve to/from file to get the output string I need. I saw this other approach using MemoryStream, but it actually slowed things down. I essentially just need the outputted RTF content, so that I can run some filters on that RTF to clean up unnecessary formatting. The queries bringing back the data are very quick instant seeming . To generate a 1000 files (actually 2000 files are created in process) with original approach files takes about 15 minutes, the same with second approach takes about 25-30 minutes. The resulting files that I've run are averaging around 80KB.
Is there something wrong with the second approach? Seems like it should be faster than the first one, not slower.
Original approach:
RtfWriter2.GetInstance(doc, new FileStream(RTFFilePathName, FileMode.Create));
doc.Open();
//Add Tables and stuff here
doc.Close(); //It saves a file here to (RTFPathFileName)
StreamReader srRTF = new StreamReader(RTFFilePathName);
string rtfText = srRTF.ReadToEnd();
srRTF.Close();
//Do additional things with rtfText before writing to my final file
New approach, trying to speed it up but this is actually half as fast:
MemoryStream stream = new MemoryStream();
RtfWriter2.GetInstance(doc, stream);
doc.Open();
//Add Tables and stuff here
doc.Close();
string rtfText =
ASCIIEncoding.ASCII.GetString(stream.GetBuffer());
stream.Close();
//Do additional things with rtfText before writing to my final file
The second approach I am trying I found here:
iTextSharp - How to generate a RTF document in the ClipBoard instead of a file
How big your resulting stream is? MemoryStream performs a lot of memory copy operations while growing, so for large results it may take significantly longer to write data by small chunks compared with FileStream.
To verify if it is the problem set inital size of MemoryStream to some large value around resulting size and re-run the code.
To fix it you can pre-grow memory stream initially (if you know approximate output) or write your own stream that uses different scheme when growing. Also using temporary file might be good enough for your purposes as is.
Like Alexei said, its probably caused by fact, yo are creating MemoryStream every time, and every time it continously re-alocates memory as it grows. Try creating only 1 stream and reset it to begining before every write.
Also I think stream.GetBuffer() again returns new memory, so try using same StreamReader with your MemoryStream.
And it seems your code can be easily paralelised, so you can try run it using Paralel Extesions or using TreadPool.
And it seems little weird, you are writing your text as bytes in stream, then reading this stream as bytes and converting to text. Wouldnt it be possible to save your document directly as text?
A MemoryStream is not associated with a file, and has no concept of a filename. Basically, you can't do that.
You certainly can't cast between them; you can only cast upwards an downwards - not sideways; to visualise:
Stream
|
| |
FileStream MemoryStream
You can cast a MemoryStream to a Stream trivially, and a Stream to a MemoryStream via a type-check; but never a FileStream to a MemoryStream. That is like saying a dog is an animal, and an elephant is an animal, so we can cast a dog to an elephant.
You could subclass MemoryStream and add a Name property (that you supply a value for), but there would still be no commonality between a FileStream and a YourCustomMemoryStream, and FileStream doesn't implement a pre-existing interface to get a Name; so the caller would have to explicitly handle both separately, or use duck-typing (maybe via dynamic or reflection).
Another option (perhaps easier) might be: write your data to a temporary file; use a FileStream from there; then (later) delete the file.
I know this is old but there is a lot of misinformation in this thread.
It's all about buffer size. The internal buffers are significantly smaller with a memory stream vs a file stream. Smaller buffers cause more read\writes.
Just intilaize your memory stream with either a file stream or a byte array with a size of around 80k. Close the doc, set stream position to 0 and read to end the contents.
On a side note, get buffer will return the whole allocated buffer. So if you only wrote 1 byte and the buffer is 4k, you will have a lot of garbage in your string.

Categories