compressing a string in C# and uncompressing in python

compressing a string in C# and uncompressing in python - c#

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request.
Ideally I want to use the standard library at both ends, so I am trying to use gzip.
My C# code is:
public static string Compress(string s) {
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream()) {
using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
The python code is:
s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird.
If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.
I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?

I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:
decompressed_data = gz.read().decode('utf-16')
There, decompressed_data should be Unicode and you can treat it as such for further work.
UPDATE: This worked for me:
C Sharp
static void Main(string[] args)
{
FileStream f = new FileStream("test", FileMode.CreateNew);
using (StreamWriter w = new StreamWriter(f))
{
w.Write(Compress("hello"));
}
}
public static string Compress(string s)
{
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Python
import base64
import cStringIO
import gzip
f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
print decompressed_data.decode('utf-16')
Without decode('utf-16) it printed in the console:
>>>h e l l o
with it it did well:
>>>hello
Good luck, hope this helps!

It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}
That's because you're using Encoding.Unicode to convert the string to bytes to start with.
If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.
If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)
I'm also somewhat suspicious of this line:
buff = cStringIO.StringIO(s)
s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

Related

Zlib compression imcompatible between Ionic.Zip, System.IO.Compression and SharpCompress

I am working on porting a file format (OMF) into C#. Part of the storage in the file is an array of zlib compressed data. An existing version of the file formatter uses a static method from Ionic.Zip to read the file, as follows:
public static byte[] Uncompress(this byte[] value)
{
// Uncompress
return ZlibStream.UncompressBuffer(value);
}
The project I am working on currently already uses SharpCompress and using 2 different compression libraries seemed wasteful, so I figured I would rewrite it to use SharpCompress. SharpCompress does not have the UncompressBuffer static function that Ionic does, so I implemented it as follows which seemed to be a pretty standard approach in my reading:
using (var originalStream = new MemoryStream(value))
{
using (var decompressedStream = new MemoryStream())
{
using (var decompressor = new ZlibStream(originalStream, CompressionMode.Decompress))
{
decompressor.CopyTo(decompressedStream);
return decompressedStream.ToArray();
}
}
}
I have also tried a similiar approach using the System.IO.Compression.DeflateStream class, following the MSDocs provided pattern. However, in both cases, at the CopyTo function call, I get an exception indicating there is an issue with the data:
For Zlib: 'Zlib Exception: Bad state (incorrect data check)'
For Windows: 'Block length does not match with its complement'
It could be something I am missing that differentiates the function of the UncompressBuffer function from this method of decompression, but it seems like the UncompressBuffer function works with internal portions of the Zlib Class.
What am I doing wrong here? Is there a difference between the implementations of the 2 zip libraries that makes them incompatible?

The code below runs, which suggests that there are at least baseline round trip capabilities with Ionic.Zip and SharpCompress. This suggests that there might be some specific subtleties with the payload you are trying to decompress.
class Program
{
static void Main(string[] args)
{
var rnd = new Random(0);
var raw = new byte[1024];
rnd.NextBytes(raw);
raw = System.Text.Encoding.UTF8.GetBytes(System.Convert.ToBase64String(raw));
var ionicCompressed = Ionic.Zlib.ZlibStream.CompressBuffer(raw);
var sharpCompressed = DoSharpCompress(raw);
var ionicDecompressIonic = Ionic.Zlib.ZlibStream.UncompressBuffer(sharpCompressed);
var ionicDecompressSharp = Ionic.Zlib.ZlibStream.UncompressBuffer(ionicCompressed);
var sharpDecompressSharp = DoSharpDecompress(sharpCompressed);
var sharpDecompressIonic = DoSharpDecompress(ionicCompressed);
AssertEqual(ionicDecompressIonic, ionicDecompressSharp);
AssertEqual(sharpDecompressSharp, sharpDecompressIonic);
AssertEqual(ionicDecompressSharp, sharpDecompressIonic);
AssertEqual(raw, sharpDecompressIonic);
Console.WriteLine(System.Text.Encoding.UTF8.GetString(raw));
Console.WriteLine(System.Text.Encoding.UTF8.GetString(sharpDecompressIonic));
Console.WriteLine(System.Text.Encoding.UTF8.GetString(raw) == System.Text.Encoding.UTF8.GetString(sharpDecompressIonic));
Console.ReadLine();
}
static byte[] DoSharpCompress(byte[] uncompressed)
{
var sc1 = new SharpCompress.Compressors.Deflate.ZlibStream(new MemoryStream(uncompressed), SharpCompress.Compressors.CompressionMode.Compress);
var sc2 = new MemoryStream();
sc1.CopyTo(sc2);
return sc2.ToArray();
}
static byte[] DoSharpDecompress(byte[] compressed)
{
var sc1 = new SharpCompress.Compressors.Deflate.ZlibStream(new MemoryStream(compressed), SharpCompress.Compressors.CompressionMode.Decompress);
var sc2 = new MemoryStream();
sc1.CopyTo(sc2);
return sc2.ToArray();
}
static bool AssertEqual(byte[] a, byte[] b)
{
if (!a.SequenceEqual(b))
throw new Exception();
return true;
}
}

C# - Read bytes from file from a specific string

I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RAΣ����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}

When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.

UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}

Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.

File API seems to always write corrupt files when used in a loop, except for the last file

I know the title is long, but it describes the problem exactly. I didn't know how else to explain it because this is totally out there.
I have a utility written in C# targeting .NET Core 2.1 that downloads and decrypts (AES encryption) files originally uploaded by our clients from our encrypted store, so they can be reprocessed through some of our services in the case that they fail. This utility is run via CLI using database IDs for the files as arguments, for example download.bat 101 102 103 would download 3 files with the corresponding IDs. I'm receiving byte data through a message queue (really not much more than a TCP socket) which describes a .TIF image.
I have a good reason to believe that the byte data is not ever corrupted on the server. That reason is when I run the utility with only one ID parameter, such as download.bat 101, then it works just fine. Furthermore, when I run it with multiple IDs, the last file that is downloaded by the utility is always intact, but the rest are always corrupted.
This odd behavior has persisted across two different implementations for writing the byte data to a file. Those implementations are below.
File.ReadAllBytes implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
File.WriteAllBytes(destination, outputStream.ToArray());
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
FileStream implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
using (FileStream fs = new FileStream(destination, FileMode.Create))
{
var bytes = outputStream.ToArray();
fs.Write(bytes, 0, envelope.Payload.Length);
_logger.LogStatement($"File byte content: [{string.Join(", ", bytes.Take(16))}]", LogLevel.Trace);
fs.Flush();
}
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
This method is called from a for loop which first receives the messages I described earlier and then feeds their payloads to the above method:
using (var requestSocket = new RequestSocket(fileServiceEndpoint))
{
// Envelopes is constructed beforehand
foreach (var envelope in envelopes)
{
var timer = Stopwatch.StartNew();
requestSocket.SendMoreFrame(messageTypeBytes);
requestSocket.SendMoreFrame(SerializationHelper.SerializeObjectToBuffer(envelope));
if (!requestSocket.TrySendFrame(_timeout, signedPayloadBytes, signedPayloadBytes.Length))
{
var message = $"Timeout exceeded while processing [{envelope.ActionType}] request.";
_logger.LogStatement(message, LogLevel.Error);
throw new Exception(message);
}
var responseReceived = requestSocket.TryReceiveFrameBytes(_timeout, out byte[] responseBytes);
...
var responseEnvelope = SerializationHelper.DeserializeObject<FileServiceResponseEnvelope>(responseBytes);
...
_logger.LogStatement($"Received response with payload of [{responseEnvelope.Payload.Length} bytes].", LogLevel.Info);
var destDir = downloadDetails.GetDestinationPath(responseEnvelope.FileId);
if (!Directory.Exists(destDir))
Directory.CreateDirectory(destDir);
var dest = Path.Combine(destDir, idsToFileNames[responseEnvelope.FileId]);
WriteMessageContents(responseEnvelope, dest, encryptionKey, macInitialVector);
}
}
I also know that TIFs have a very specific header, which looks something like this in raw bytes:
[73, 73, 42, 0, 8, 0, 0, 0, 20, 0...
It always begins with "II" (73, 73) or "MM" (77, 77) followed by 42 (probably a Hitchhiker's reference). I analyzed the bytes written by the utility. The last file always has a header that resembles this one. The rest are always random bytes; seemingly jumbled or mis-ordered image binary data. Any insight on this would be greatly appreciated because I can't wrap my mind around what I would even need to do to diagnose this.
UPDATE
I was able to figure out this problem with the help of elgonzo in the comments. Sometimes it isn't a direct answer that helps, but someone picking your brain until you look in the right place.

All right, as I suspected this was a dumb mistake (I had severe doubts that the File API was simply this flawed for so long). I just needed help thinking through it. There was an additional bit of code which I didn't post that was biting me, when I was retrieving the metadata for the file so that I could then request the file from our storage box.
byte[] encryptionKey = null;
byte[] macInitialVector = null;
...
using (var conn = new SqlConnection(ConnectionString))
using (var cmd = new SqlCommand(uploadedFileQuery, conn))
{
conn.Open();
var reader = cmd.ExecuteReader();
while (reader.Read())
{
FileServiceMessageEnvelope readAllEnvelope = null;
var originalFileName = reader["UploadedFileClientName"].ToString();
var fileId = Convert.ToInt64(reader["UploadedFileId"].ToString());
//var originalFileExtension = originalFileName.Substring(originalFileName.IndexOf('.'));
//_logger.LogStatement($"Scooped extension: {originalFileExtension}", LogLevel.Trace);
envelopes.Add(readAllEnvelope = new FileServiceMessageEnvelope
{
ActionType = FileServiceActionTypeEnum.ReadAll,
FileType = FileTypeEnum.UploadedFile,
FileName = reader["UploadedFileServerName"].ToString(),
FileId = fileId,
WorkerAuthorization = null,
BinaryTimestamp = DateTime.Now.ToBinary(),
Position = 0,
Count = Convert.ToInt32(reader["UploadedFileSize"]),
SignerFqdn = _messengerConfig.FullyQualifiedDomainName
});
readAllEnvelope.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayload = new SecureMessage { Payload = new byte[0] };
signedPayload.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayloadBytes = SerializationHelper.SerializeObjectToBuffer(signedPayload);
encryptionKey = (byte[])reader["UploadedFileEncryptionKey"];
macInitialVector = (byte[])reader["UploadedFileEncryptionMacInitialVector"];
}
conn.Close();
}
Eagle-eyed observers might realize that I have not properly coupled the encryptionKey and macInitialVector to the correct record, since each file has a unique key and vector. This means I was using the key for one of the files to decrypt all of them which is why they were all corrupt except for one file -- they were not properly decrypted. I solved this issue by coupling them together with the ID in a simple POCO and retrieving the appropriate key and vector for each file upon decryption.

Convert a wav file to 8000Hz 16Bit Mono Wav

I need to convert a wav file to 8000Hz 16Bit Mono Wav. I already have a code, which works well with NAudio library, but I want to use MemoryStream instead of temporary file.
using System.IO;
using NAudio.Wave;
static void Main()
{
var input = File.ReadAllBytes("C:/input.wav");
var output = ConvertWavTo8000Hz16BitMonoWav(input);
File.WriteAllBytes("C:/output.wav", output);
}
public static byte[] ConvertWavTo8000Hz16BitMonoWav(byte[] inArray)
{
using (var mem = new MemoryStream(inArray))
using (var reader = new WaveFileReader(mem))
using (var converter = WaveFormatConversionStream.CreatePcmStream(reader))
using (var upsampler = new WaveFormatConversionStream(new WaveFormat(8000, 16, 1), converter))
{
// todo: without saving to file using MemoryStream or similar
WaveFileWriter.CreateWaveFile("C:/tmp_pcm_8000_16_mono.wav", upsampler);
return File.ReadAllBytes("C:/tmp_pcm_8000_16_mono.wav");
}
}

Not sure if this is the optimal way, but it works...
public static byte[] ConvertWavTo8000Hz16BitMonoWav(byte[] inArray)
{
using (var mem = new MemoryStream(inArray))
{
using (var reader = new WaveFileReader(mem))
{
using (var converter = WaveFormatConversionStream.CreatePcmStream(reader))
{
using (var upsampler = new WaveFormatConversionStream(new WaveFormat(8000, 16, 1), converter))
{
byte[] data;
using (var m = new MemoryStream())
{
upsampler.CopyTo(m);
data = m.ToArray();
}
using (var m = new MemoryStream())
{
// to create a propper WAV header (44 bytes), which begins with RIFF
var w = new WaveFileWriter(m, upsampler.WaveFormat);
// append WAV data body
w.Write(data,0,data.Length);
return m.ToArray();
}
}
}
}
}
}

It might be added and sorry I can't comment yet due to lack of points. That NAudio ALWAYS writes 46 byte headers which in certain situations can cause crashes. I want to add this in case someone encouters this while searching for a clue why naudio wav files only start crashing certain programs.
I encoutered this problem after figuring out how to convert and/or sample wav with NAudio and was stuck after for 2 days now and only figured it out with a hex editor.
(The 2 extra bytes are located at byte 37 and 38 right before the data subchunck [d,a,t,a,size<4bytes>].
Here is a comparison of two wave file headers left is saved by NAudio 46 bytes; right by Audacity 44 bytes
You can check this back by looking at the NAudio src in WaveFormat.cs at line 310 where instead of 16 bytes for the fmt chunck 18+extra are reserved (+extra because there are some wav files which even contain bigger headers than 46 bytes) but NAudio always seems to write 46 byte headers and never 44 (MS standard). It may also be noted that in fact NAudio is able to read 44 byte headers (line 210 in WaveFormat.cs)

Deflating a stream with SharpZipLib has different behaviors in C# vs. VB.NET

I'm having troubles in writing a static Deflate extension method, that i would use to deflate a string, using BZip2 alghorithm of the SharpZipLib library (runtime version: v2.0.50727).
I'm doing it using .NET framework 4.
This is my VB.NET code:
Public Function Deflate(ByVal text As String)
Try
Dim compressedData As Byte() = Convert.FromBase64String(text)
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed text data size: ", text.Length.ToString()))
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed byte data size: ", compressedData.Length.ToString()))
Using compressedStream As MemoryStream = New MemoryStream(compressedData)
Using decompressionStream As BZip2OutputStream = New BZip2OutputStream(compressedStream)
Dim cleanData() As Byte
Using decompressedStream As MemoryStream = New MemoryStream()
decompressionStream.CopyTo(decompressedStream) // HERE THE ERROR!
cleanData = decompressedStream.ToArray()
End Using
decompressionStream.Close()
compressedStream.Close()
Dim cleanText As String = Encoding.UTF8.GetString(cleanData, 0, cleanData.Length)
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression text data size: ", cleanText.Length.ToString()))
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression byte data size: ", cleanData.Length.ToString()))
Return cleanText
End Using
End Using
Catch
Return String.Empty
End Try
End Function
The strange thing is that I wrote a C# counterpart of the same method, and it works perfectly!!! This is the code:
public static string Deflate(this string text)
{
try
{
byte[] compressedData = Convert.FromBase64String(text);
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed text data size: ", text.Length.ToString()));
System.Diagnostics.Debug.WriteLine(String.Concat("Compressed byte data size: ", compressedData.Length.ToString()));
using (MemoryStream compressedStream = new MemoryStream(compressedData))
using (BZip2InputStream decompressionStream = new BZip2InputStream(compressedStream))
{
byte[] cleanData;
using (MemoryStream decompressedStream = new MemoryStream())
{
decompressionStream.CopyTo(decompressedStream);
cleanData = decompressedStream.ToArray();
}
decompressionStream.Close();
compressedStream.Close();
string cleanText = Encoding.UTF8.GetString(cleanData, 0, cleanData.Length);
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression text data size: ", cleanText.Length.ToString()));
System.Diagnostics.Debug.WriteLine(String.Concat("After decompression byte data size: ", cleanData.Length.ToString()));
return cleanText;
}
}
catch(Exception e)
{
return String.Empty;
}
}
In VB.NET version I get this error: "Stream does not support reading." (see the code to understand where it comes!)
Where is the mistake?!! I cannot understand what's the difference between the two methods...
Thank you very much!

A game of spot the difference shows that in the first you are using BZip2OutputStream whereas the second is BZip2InputStream.
It seems reasonable that the output stream is used to write to and so as it says is not readable.
For what its worth there are a lot of good comparison tools out there. They won't cope with syntax different but the way the matching works it shows up quite well when you are using totally different objects (in this case at least). The one I personally use and recommend is Beyond Compare

You switched BZip2OutputStream and BZip2InputStream

In one version you are using a BZip2InputStream and in the other a BZip2OutputStream.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

compressing a string in C# and uncompressing in python - c#

Related

Zlib compression imcompatible between Ionic.Zip, System.IO.Compression and SharpCompress

C# - Read bytes from file from a specific string

File API seems to always write corrupt files when used in a loop, except for the last file

Convert a wav file to 8000Hz 16Bit Mono Wav

Deflating a stream with SharpZipLib has different behaviors in C# vs. VB.NET

Categories

Resources