Zlib compression imcompatible between Ionic.Zip, System.IO.Compression and SharpCompress - c#

I am working on porting a file format (OMF) into C#. Part of the storage in the file is an array of zlib compressed data. An existing version of the file formatter uses a static method from Ionic.Zip to read the file, as follows:
public static byte[] Uncompress(this byte[] value)
{
// Uncompress
return ZlibStream.UncompressBuffer(value);
}
The project I am working on currently already uses SharpCompress and using 2 different compression libraries seemed wasteful, so I figured I would rewrite it to use SharpCompress. SharpCompress does not have the UncompressBuffer static function that Ionic does, so I implemented it as follows which seemed to be a pretty standard approach in my reading:
using (var originalStream = new MemoryStream(value))
{
using (var decompressedStream = new MemoryStream())
{
using (var decompressor = new ZlibStream(originalStream, CompressionMode.Decompress))
{
decompressor.CopyTo(decompressedStream);
return decompressedStream.ToArray();
}
}
}
I have also tried a similiar approach using the System.IO.Compression.DeflateStream class, following the MSDocs provided pattern. However, in both cases, at the CopyTo function call, I get an exception indicating there is an issue with the data:
For Zlib: 'Zlib Exception: Bad state (incorrect data check)'
For Windows: 'Block length does not match with its complement'
It could be something I am missing that differentiates the function of the UncompressBuffer function from this method of decompression, but it seems like the UncompressBuffer function works with internal portions of the Zlib Class.
What am I doing wrong here? Is there a difference between the implementations of the 2 zip libraries that makes them incompatible?

The code below runs, which suggests that there are at least baseline round trip capabilities with Ionic.Zip and SharpCompress. This suggests that there might be some specific subtleties with the payload you are trying to decompress.
class Program
{
static void Main(string[] args)
{
var rnd = new Random(0);
var raw = new byte[1024];
rnd.NextBytes(raw);
raw = System.Text.Encoding.UTF8.GetBytes(System.Convert.ToBase64String(raw));
var ionicCompressed = Ionic.Zlib.ZlibStream.CompressBuffer(raw);
var sharpCompressed = DoSharpCompress(raw);
var ionicDecompressIonic = Ionic.Zlib.ZlibStream.UncompressBuffer(sharpCompressed);
var ionicDecompressSharp = Ionic.Zlib.ZlibStream.UncompressBuffer(ionicCompressed);
var sharpDecompressSharp = DoSharpDecompress(sharpCompressed);
var sharpDecompressIonic = DoSharpDecompress(ionicCompressed);
AssertEqual(ionicDecompressIonic, ionicDecompressSharp);
AssertEqual(sharpDecompressSharp, sharpDecompressIonic);
AssertEqual(ionicDecompressSharp, sharpDecompressIonic);
AssertEqual(raw, sharpDecompressIonic);
Console.WriteLine(System.Text.Encoding.UTF8.GetString(raw));
Console.WriteLine(System.Text.Encoding.UTF8.GetString(sharpDecompressIonic));
Console.WriteLine(System.Text.Encoding.UTF8.GetString(raw) == System.Text.Encoding.UTF8.GetString(sharpDecompressIonic));
Console.ReadLine();
}
static byte[] DoSharpCompress(byte[] uncompressed)
{
var sc1 = new SharpCompress.Compressors.Deflate.ZlibStream(new MemoryStream(uncompressed), SharpCompress.Compressors.CompressionMode.Compress);
var sc2 = new MemoryStream();
sc1.CopyTo(sc2);
return sc2.ToArray();
}
static byte[] DoSharpDecompress(byte[] compressed)
{
var sc1 = new SharpCompress.Compressors.Deflate.ZlibStream(new MemoryStream(compressed), SharpCompress.Compressors.CompressionMode.Decompress);
var sc2 = new MemoryStream();
sc1.CopyTo(sc2);
return sc2.ToArray();
}
static bool AssertEqual(byte[] a, byte[] b)
{
if (!a.SequenceEqual(b))
throw new Exception();
return true;
}
}

Related

How can you benefit from the new System.Buffers (Span, Memory) when doing File I/O and parsing

I'm currently looking at the new System.Buffers with Span, ReadOnlySpan, Memory, ReadOnlyMemory, ...
I understand when passing e.g. a ReadOnlySpan (ROS) this could reduce heap allocations in many cases and make code perform better. Most examples regarding Span show you the .AsSpan(), .Slice(...) examples but that's it.
Once I have my data (e.g. byte[]) then I can create a Span or ReadOnlySpan from it and pass that to several methods/classes inside my library.
But how can File I/O be writting using System.Buffers (Span/Memory/..)?
I've tried to created two small (partial) examples to demonstrate the situation.
// Example 1:
using (var br = new BinaryReader(File.OpenRead(pathToFile)) {
ReadFile(br);
}
private void ReadFile(BinaryReader br) {
ParseHeader(...);
}
private void ParseHeader(BinaryReader br) {
br.ReadBytes(...);
br.ReadInt32();
// ...
}
and
// Example 2:
public Foo GetFileAsFoo(string path) {
using (var s = new FileStream(path, FileMode.Open, FileAccess.Read) {
return ReadAndGetFoo(s);
}
}
public Foo ReadAndGetFoo(Stream file) {
// copy to memorystream as filestream file I/O is slow
var ms = new MemoryStream();
file.CopyTo(ms);
ms.Position = 0;
Parser p = new Parser(ms);
p.Read();
return p.GetFoo();
}
public class Parser {
private readonly Stream _s;
public Parser(Stream stream) {
_s = stream;
}
int Peek() {
if (_s.Position >= _s.Length) return -1;
int r = _s.ReadByte();
_s.Seek(-1, SeekOrigin.Current);
return r;
}
public void Read() {
// logic here
}
public Foo GetFoo() {
// ...
return _Foo;
}
// other methods to parse
}
The question for the first example is mainly on how can I get a ReadOnlySpan/Memory/...(?) from:
using (var br = new BinaryReader(File.OpenRead(pathToFile))
I'm aware of System.Buffers.Binary.BinaryPrimites to replace the BinaryReader, but this requires a ReadOnlySpan. How would I get my data from File.OpenRead as span in the first place, similar as I do with the BinaryReader?
What options are available?
I guess there is no class in the BinaryPrimities that keeps track of the 'position' similar as a BinaryReader does?
https://learn.microsoft.com/en-us/dotnet/api/system.buffers.binary.binaryprimitives?view=net-5.0
The second example instead of working with a BinaryReader is working on a Stream.
To keep it efficient, I first do a copy to a memorystream to reduce the I/O. (I know this one-time copy is slow and should be avoided).
How could this File as Stream be read the using System.Buffers (Span/Memory/...?) ?
(byte per byte is being read and parsed using Stream.ReadByte())
Hope to learn something!

split binary file into chunks or Parts upload / download

I am using couchdb for some reason as a content management to upload files as binary data, there is no GridFs support like mongoDB to upload large files, so I need to upload files as chunks then retrieve them as one file.
here is my code
public string InsertDataToCouchDb(string dbName, string id, string filename, byte[] image)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
// HERE I NEED TO UPLOAD MY IMAGE BYTE[] AS CHUNKS
var artist = new couchdb
{
_id = id,
filename = filename,
Image = image
};
var response = db.Entities.PutAsync(artist);
return response.Result.Content._id;
}
}
public byte[] FetchDataFromCouchDb(string dbName, string id)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
//HERE I NEED TO RETRIVE MY FULL IMAGE[] FROM CHUNKS
var test = db.Documents.GetAsync(id, null);
var doc = db.Serializer.Deserialize<couchdb>(test.Result.Content);
return doc.Image;
}
}
THANK YOU
Putting image data in a CouchDB document is a terrible idea. Just don't. This is the purpose of CouchDB attachments.
The potential of bloating the database with redundant blob data via document updates alone will surely have major, negative consequences for anything other than a toy database.
Further there seems to be a lack of understanding how async/await works as the code in the OP is invoking async methods, e.g. db.Entities.PutAsync(artist), without an await - the call surely will fail every time (if the compiler even allows the code). I highly recommend grok'ing the Microsoft document Asynchronous programming with async and await.
Now as for "chunking": If the image data is so large that it needs to be otherwise streamed, the business of passing it around via a byte array looks bad. If the images are relatively small, just use Attachment.PutAsync as it stands.
Although Attachment.PutAsync at MyCouch v7.6 does not support streams (effectively chunking) there exists the Support Streams for attachments #177 PR, which does, and it looks pretty good.
Here's a one page C# .Net Core console app that uploads a given file as an attachment to a specific document using the very efficient streaming provided by PR 177. Although the code uses PR 177, it most importantly uses Attachments for blob data. Replacing a stream with a byte array is rather straightforward.
MyCouch + PR 177
In a console get MyCouch sources and then apply PR 177
$ git clone https://github.com/danielwertheim/mycouch.git
$ cd mycouch
$ git pull origin 15a1079502a1728acfbfea89a7e255d0c8725e07
(I don't know git so there's probably a far better way to get a PR)
MyCouchUploader
With VS2019
Create a new .Net Core console app project and solution named "MyCouchUploader"
Add the MyCouch project pulled with PR 177 to the solution
Add the MyCouch project as MyCouchUploader dependency
Add the Nuget package "Microsoft.AspNetCore.StaticFiles" as a MyCouchUploader dependency
Replace the content of Program.cs with the following code:
using Microsoft.AspNetCore.StaticFiles;
using MyCouch;
using MyCouch.Requests;
using MyCouch.Responses;
using System;
using System.IO;
using System.Linq;
using System.Net;
using System.Security.Cryptography;
using System.Threading.Tasks;
namespace MyCouchUploader
{
class Program
{
static async Task Main(string[] args)
{
// args: scheme, database, file path of asset to upload.
if (args.Length < 3)
{
Console.WriteLine("\nUsage: MyCouchUploader scheme dbname filepath\n");
return;
}
var opts = new
{
scheme = args[0],
dbName = args[1],
filePath = args[2]
};
Action<Response> check = (response) =>
{
if (!response.IsSuccess) throw new Exception(response.Reason);
};
try
{
// canned doc id for this app
const string docId = "SO-68998781";
const string attachmentName = "Image";
DbConnectionInfo cnxn = new DbConnectionInfo(opts.scheme, opts.dbName)
{ // timely fail if scheme is bad
Timeout = TimeSpan.FromMilliseconds(3000)
};
MyCouchClient client = new MyCouchClient(cnxn);
// ensure db is there
GetDatabaseResponse info = await client.Database.GetAsync();
check(info);
// delete doc for succcessive program runs
DocumentResponse doc = await client.Documents.GetAsync(docId);
if (doc.StatusCode == HttpStatusCode.OK)
{
DocumentHeaderResponse del = await client.Documents.DeleteAsync(docId, doc.Rev);
check(del);
}
// sniff file for content type
FileExtensionContentTypeProvider provider = new FileExtensionContentTypeProvider();
if (!provider.TryGetContentType(opts.filePath, out string contentType))
{
contentType = "application/octet-stream";
}
// create a hash for silly verification
using var md5 = MD5.Create();
using Stream stream = File.OpenRead(opts.filePath);
byte[] fileHash = md5.ComputeHash(stream);
stream.Position = 0;
// Use PR 177, sea-locks:stream-attachments.
DocumentHeaderResponse put = await client.Attachments.PutAsync(new PutAttachmentStreamRequest(
docId,
attachmentName,
contentType,
stream // :-D
));
check(put);
// verify
AttachmentResponse verify = await client.Attachments.GetAsync(docId, attachmentName);
check(verify);
if (fileHash.SequenceEqual(md5.ComputeHash(verify.Content)))
{
Console.WriteLine("Atttachment verified.");
}
else
{
throw new Exception(String.Format("Attachment failed verification with status code {0}", verify.StatusCode));
}
}
catch (Exception e)
{
Console.WriteLine("Fail! {0}", e.Message);
}
}
}
}
To run:
$ MyCouchdbUploader http://name:password#localhost:5984 dbname path-to-local-image-file
Use Fauxton to visually verify the attachment for the doc.

File API seems to always write corrupt files when used in a loop, except for the last file

I know the title is long, but it describes the problem exactly. I didn't know how else to explain it because this is totally out there.
I have a utility written in C# targeting .NET Core 2.1 that downloads and decrypts (AES encryption) files originally uploaded by our clients from our encrypted store, so they can be reprocessed through some of our services in the case that they fail. This utility is run via CLI using database IDs for the files as arguments, for example download.bat 101 102 103 would download 3 files with the corresponding IDs. I'm receiving byte data through a message queue (really not much more than a TCP socket) which describes a .TIF image.
I have a good reason to believe that the byte data is not ever corrupted on the server. That reason is when I run the utility with only one ID parameter, such as download.bat 101, then it works just fine. Furthermore, when I run it with multiple IDs, the last file that is downloaded by the utility is always intact, but the rest are always corrupted.
This odd behavior has persisted across two different implementations for writing the byte data to a file. Those implementations are below.
File.ReadAllBytes implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
File.WriteAllBytes(destination, outputStream.ToArray());
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
FileStream implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
using (FileStream fs = new FileStream(destination, FileMode.Create))
{
var bytes = outputStream.ToArray();
fs.Write(bytes, 0, envelope.Payload.Length);
_logger.LogStatement($"File byte content: [{string.Join(", ", bytes.Take(16))}]", LogLevel.Trace);
fs.Flush();
}
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
This method is called from a for loop which first receives the messages I described earlier and then feeds their payloads to the above method:
using (var requestSocket = new RequestSocket(fileServiceEndpoint))
{
// Envelopes is constructed beforehand
foreach (var envelope in envelopes)
{
var timer = Stopwatch.StartNew();
requestSocket.SendMoreFrame(messageTypeBytes);
requestSocket.SendMoreFrame(SerializationHelper.SerializeObjectToBuffer(envelope));
if (!requestSocket.TrySendFrame(_timeout, signedPayloadBytes, signedPayloadBytes.Length))
{
var message = $"Timeout exceeded while processing [{envelope.ActionType}] request.";
_logger.LogStatement(message, LogLevel.Error);
throw new Exception(message);
}
var responseReceived = requestSocket.TryReceiveFrameBytes(_timeout, out byte[] responseBytes);
...
var responseEnvelope = SerializationHelper.DeserializeObject<FileServiceResponseEnvelope>(responseBytes);
...
_logger.LogStatement($"Received response with payload of [{responseEnvelope.Payload.Length} bytes].", LogLevel.Info);
var destDir = downloadDetails.GetDestinationPath(responseEnvelope.FileId);
if (!Directory.Exists(destDir))
Directory.CreateDirectory(destDir);
var dest = Path.Combine(destDir, idsToFileNames[responseEnvelope.FileId]);
WriteMessageContents(responseEnvelope, dest, encryptionKey, macInitialVector);
}
}
I also know that TIFs have a very specific header, which looks something like this in raw bytes:
[73, 73, 42, 0, 8, 0, 0, 0, 20, 0...
It always begins with "II" (73, 73) or "MM" (77, 77) followed by 42 (probably a Hitchhiker's reference). I analyzed the bytes written by the utility. The last file always has a header that resembles this one. The rest are always random bytes; seemingly jumbled or mis-ordered image binary data. Any insight on this would be greatly appreciated because I can't wrap my mind around what I would even need to do to diagnose this.
UPDATE
I was able to figure out this problem with the help of elgonzo in the comments. Sometimes it isn't a direct answer that helps, but someone picking your brain until you look in the right place.
All right, as I suspected this was a dumb mistake (I had severe doubts that the File API was simply this flawed for so long). I just needed help thinking through it. There was an additional bit of code which I didn't post that was biting me, when I was retrieving the metadata for the file so that I could then request the file from our storage box.
byte[] encryptionKey = null;
byte[] macInitialVector = null;
...
using (var conn = new SqlConnection(ConnectionString))
using (var cmd = new SqlCommand(uploadedFileQuery, conn))
{
conn.Open();
var reader = cmd.ExecuteReader();
while (reader.Read())
{
FileServiceMessageEnvelope readAllEnvelope = null;
var originalFileName = reader["UploadedFileClientName"].ToString();
var fileId = Convert.ToInt64(reader["UploadedFileId"].ToString());
//var originalFileExtension = originalFileName.Substring(originalFileName.IndexOf('.'));
//_logger.LogStatement($"Scooped extension: {originalFileExtension}", LogLevel.Trace);
envelopes.Add(readAllEnvelope = new FileServiceMessageEnvelope
{
ActionType = FileServiceActionTypeEnum.ReadAll,
FileType = FileTypeEnum.UploadedFile,
FileName = reader["UploadedFileServerName"].ToString(),
FileId = fileId,
WorkerAuthorization = null,
BinaryTimestamp = DateTime.Now.ToBinary(),
Position = 0,
Count = Convert.ToInt32(reader["UploadedFileSize"]),
SignerFqdn = _messengerConfig.FullyQualifiedDomainName
});
readAllEnvelope.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayload = new SecureMessage { Payload = new byte[0] };
signedPayload.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayloadBytes = SerializationHelper.SerializeObjectToBuffer(signedPayload);
encryptionKey = (byte[])reader["UploadedFileEncryptionKey"];
macInitialVector = (byte[])reader["UploadedFileEncryptionMacInitialVector"];
}
conn.Close();
}
Eagle-eyed observers might realize that I have not properly coupled the encryptionKey and macInitialVector to the correct record, since each file has a unique key and vector. This means I was using the key for one of the files to decrypt all of them which is why they were all corrupt except for one file -- they were not properly decrypted. I solved this issue by coupling them together with the ID in a simple POCO and retrieving the appropriate key and vector for each file upon decryption.

compressing a string in C# and uncompressing in python

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request.
Ideally I want to use the standard library at both ends, so I am trying to use gzip.
My C# code is:
public static string Compress(string s) {
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream()) {
using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
The python code is:
s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird.
If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.
I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?
I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:
decompressed_data = gz.read().decode('utf-16')
There, decompressed_data should be Unicode and you can treat it as such for further work.
UPDATE: This worked for me:
C Sharp
static void Main(string[] args)
{
FileStream f = new FileStream("test", FileMode.CreateNew);
using (StreamWriter w = new StreamWriter(f))
{
w.Write(Compress("hello"));
}
}
public static string Compress(string s)
{
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Python
import base64
import cStringIO
import gzip
f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
print decompressed_data.decode('utf-16')
Without decode('utf-16) it printed in the console:
>>>h e l l o
with it it did well:
>>>hello
Good luck, hope this helps!
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}
That's because you're using Encoding.Unicode to convert the string to bytes to start with.
If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.
If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)
I'm also somewhat suspicious of this line:
buff = cStringIO.StringIO(s)
s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

ReadInt16 in C# returns something different than what writeShort in Java wrote

I'm writing a short to a file, using the following code in Java:
RandomAccessFile file = new RandomAccessFile("C:\\Users\\PC\\Desktop\\myFile.bin", "rw");
file.writeShort(11734);
file.close();
When I read it back in Java, I get the same (11734) number back. However, when I read the number in C# using the following code:
string p = "C:\\Users\\PC\\Desktop\\myFile.bin";
short s = new BinaryReader(File.OpenRead(p)).ReadInt16();
The variable s contains -10707.
How can this happen, and is there a way to retrieve the number I wrote to a file in Java, in C#?
BigEndian/LittleEndian problem. see below
byte[] b = BitConverter.GetBytes((short)11734);
var s = BitConverter.ToInt16(new byte[] {b[1],b[0] }, 0);
s will be -10707
You can use IPAddress.HostToNetworkOrder to convert from one form to another.
var sh1 = IPAddress.HostToNetworkOrder((short)11734); //-10707
var sh2 = IPAddress.HostToNetworkOrder((short)-10707); //11734
You can also crete your own BinaryReader
public class MyBinaryReader : BinaryReader
{
public MyBinaryReader(Stream s) : base(s)
{
}
public override short ReadInt16()
{
return IPAddress.HostToNetworkOrder(base.ReadInt16());
}
}
As others have said, this is an endianness problem.
My MiscUtil library includes EndianBinaryReader and EndianBitConverter which let you use the familiar API but with the flexibility of specifying the endianness:
using System;
using System.IO;
using MiscUtil.Conversion;
using MiscUtil.IO;
static class Test
{
static void Main()
{
using (var stream = File.OpenRead("myfile.bin"))
{
var converter = new BigEndianBitConverter();
var reader = new EndianBinaryReader(converter, stream);
Console.WriteLine(reader.ReadInt16());
}
}
}
Take note that in Java you also have the option to employ a ByteBuffer.allocate(size).order(LITTLE_ENDIAN).asShortBuffer().
your byte order is wrong
11734 = 2DD6
-10707 = D62D
http://en.wikipedia.org/wiki/Endianness
Java always uses network-order (aka big-endian) when it writes out shorts, ints and longs. My guess is that C# reads it little-endian.

Categories