Generate Running Hash (or Checksum) in C#? - c#

Preface:
I am doing a data-import that has a verify-commit phase. The idea is that: the first phase allows taking data from various sources and then running various insert/update/validate operations on a database. The commit is rolled back but a "verification hash/checksum" is generated. The commit phase is the same, but, if the "verification hash/checksum" is the same then the operations will be committed. (The database will be running under the appropriate isolation levels.)
Restrictions:
Input reading and operations are forward-read-once only
Do not want to pre-create a stream (e.g. writing to MemoryStream not desirable) as there may be a lot of data. (It would work on our servers/load, but pretend memory is limited.)
Do not want to "create my own". (I am aware of available code like CRC-32 by Damien which I could use/modify but would prefer something "standard".)
And what I (think I am) looking for:
A way to generate a Hash (e.g. SHA1 or MD5?) or a Checksum (e.g. CRC32 but hopefully more) based on input + operations. (The input/operations could themselves be hashed to values more fitting to the checksum generation but it would be nice just to be able to "write to steam".)
So, the question is:
How to generate a Running Hash (or Checksum) in C#?
Also, while there are CRC32 implementations that can be modified for a Running operation, what about running SHAx or MD5 hashes?
Am I missing some sort of handy Stream approach than could be used as an adapter?
(Critiques are welcome, but please also answer the above as applicable. Also, I would prefer not to deal with threads. ;-)

You can call HashAlgorithm.TransformBlock multiple times, and then calling TransformFinalBlock will give you the result of all blocks.
Chunk up your input (by reading x amount of bytes from a steam) and call TransformBlock with each chunk.
EDIT (from the msdn example):
public static void PrintHashMultiBlock(byte[] input, int size)
{
SHA256Managed sha = new SHA256Managed();
int offset = 0;
while (input.Length - offset >= size)
offset += sha.TransformBlock(input, offset, size, input, offset);
sha.TransformFinalBlock(input, offset, input.Length - offset);
Console.WriteLine("MultiBlock {0:00}: {1}", size, BytesToStr(sha.Hash));
}
Sorry I don't have any example readily available, though for you, you're basically replacing input with your own chunk, then the size would be the number of bytes in that chunk. You will have to keep track of the offset yourself.

Hashes have a build and a finalization phase. You can shove arbitrary amounts of data in during the build phase. The data can be split up as you like. Finally, you finish the hash operation and get your hash.
You can use a writable CryptoStream to write your data. This is the easiest way.

You can generate an MD5 hash using the MD5CryptoServiceProvider's ComputeHash method. It takes a stream as input.
Create a memory or file stream, write your hash inputs to that, and then call the ComputeHash method when you are done.
var myStream = new MemoryStream();
// Blah blah, write to the stream...
myStream.Position = 0;
using (var csp = new MD5CryptoServiceProvider()) {
var myHash = csp.ComputeHash(myStream);
}
EDIT: One possibility to avoid building up massive Streams is calling this over and over in a loop and XORing the results:
// Assuming we had this somewhere:
Byte[] myRunningHash = new Byte[16];
// Later on, from above:
for (var i = 0; i < 16; i++) // I believe MD5 are 16-byte arrays. Edit accordingly.
myRunningHash[i] = myRunningHash[i] ^ [myHash[i];
EDIT #2: Finally, building on #usr's answer below, you can probably use HashCore and HashFinal:
using (var csp = new MD5CryptoServiceProvider()) {
// My example here uses a foreach loop, but an
// event-driven stream-like approach is
// probably more what you are doing here.
foreach (byte[] someData in myDataThings)
csp.HashCore(someData, 0, someData.Length);
var myHash = csp.HashFinal();
}

this is the canonical way:
using System;
using System.Security.Cryptography;
using System.Text;
public void CreateHash(string sSourceData)
{
byte[] sourceBytes;
byte[] hashBytes;
//create Bytearray from source data
sourceBytes = ASCIIEncoding.ASCII.GetBytes(sSourceData);
// calculate 16 Byte Hashcode
hashBytes = new MD5CryptoServiceProvider().ComputeHash(sourceBytes);
string sOutput = ByteArrayToHexString(hashBytes);
}
static string ByteArrayToHexString(byte[] arrInput)
{
int i;
StringBuilder sOutput = new StringBuilder(arrInput.Length);
for (i = 0; i < arrInput.Length - 1; i++)
{
sOutput.Append(arrInput[i].ToString("X2"));
}
return sOutput.ToString();
}

Related

How do I produce a Kinesis Data Stream HashKey from a PartitionKey?

I'm consuming data from a Kinesis Data Stream using a C# client. There are multiple instances of the C# client, one for each shard in the stream, each concurrently retrieves Kinesis Records using the GetRecords method.
The PutRecords call is being performed on an external system which is specifying an IMEI as the 'PartitionKey' value. The HashKey (and hence shard selection) is being performed by the KinesisClient using (from the docs) 'An MD5 hash function ... to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards'.
I have an external system update which is broadcasting IMEI related data to all of the running clients. My challenge is that I need to determine which client is processing the data for the IMEI, hence I need to apply the same 'MD5 hash function' as the KinesisClient is applying to the data. My intention is then to compare the hash to the HashKey range of the shard the client is processing, allowing the client to determine whether it is interested in the IMEI data, or not.
I've tried to work this out and have this code:
byte[] hash;
string imei = "123456789012345";
using (MD5 md5 = MD5.Create()) {
hash = md5.ComputeHash(Encoding.UTF8.GetBytes(imei));
}
Console.WriteLine(new BigInteger(hash));
However, this gives me a negative value and my shard HashKey ranges are positive.
I really need the C# code which the KinesisClient is using to turn a PartitionKey into a HashKey, but I can't find it. Can anyone help me to work this out please?
UPDATE: I'm making progress after finding the links I mentioned in the comments below. My code currently looks like this:
private static MD5 md5 = MD5.Create();
private static List<BigInteger> powersOfTen = null;
private static BigInteger MAXHASHKEY =BigInteger.Parse("340282366920938463463374607431768211455");
public String CreateExplicitHashKey(String partitionKey) {
byte[] pkDigest = md5.ComputeHash(Encoding.UTF8.GetBytes(partitionKey));
byte[] hashKeyBytes = new byte[16];
for (int i = 0; i < pkDigest.Length; i++) {
hashKeyBytes[16 - i - 1] = (byte)((int)pkDigest[i] & 0xFF);
}
BigInteger hashKey = new BigInteger(hashKeyBytes, true, false);
if (powersOfTen == null) {
powersOfTen = new List<BigInteger>();
powersOfTen.Add(1);
for (BigInteger i = 10; i < MAXHASHKEY; i *= i) {
powersOfTen.Add(i);
}
}
return BuildString(hashKey, powersOfTen.Count - 1).ToString().TrimStart('0');
}
private static string BuildString(BigInteger n, int m) {
if (m == 0) return n.ToString();
BigInteger remainder;
BigInteger quotient = BigInteger.DivRem(n, powersOfTen[m], out remainder);
return BuildString(quotient, m - 1) + BuildString(remainder, m - 1);
}
I've been able to verify the conversion using the examples offered by Will Haley over here. Now I'm seeking to verify the code to ensure it is doing the exact same conversion as is performed by the Kinesis PutRecord/PutRecords client methods, so I can be 100% confident.
Originally I'd expected to find a function for this in the Kinesis client library, but couldn't find one. I've made this suggestion over on GitHub

Does converting between byte[] and MemoryStream cause overhead?

I want to know if there's overhead when converting between byte arrays and Streams (specifically MemoryStreams when using MemoryStream.ToArray() and MemoryStream(byte[]). I assume it's temporary doubling memory usage.
For example, I read as a stream, convert to bytes, and then convert to stream again.
But getting rid of that byte conversion will require a bit of a rewrite. I don't want to waste time rewriting it if it doesn't make a difference.
So, yes.. you are correct in assuming that ToArray duplicates the memory in the stream.
If you want do not want to do this (for efficiency reasons), you could modify the bytes directly in the stream. Take a look at this:
// create some bytes: 0,1,2,3,4,5,6,7...
var originalBytes = Enumerable.Range(0, 256).Select(Convert.ToByte).ToArray();
using(var ms = new MemoryStream(originalBytes)) // ms is referencing bytes array, not duplicating it
{
// var duplicatedBytes = ms.ToArray(); // copy of originalBytes array
// If you don't want to duplicate the bytes but want to
// modify the buffer directly, you could do this:
var bufRef = ms.GetBuffer();
for(var i = 0; i < bufRef.Length; ++i)
{
bufRef[i] = Convert.ToByte(bufRef[i] ^ 0x55);
}
// or this:
/*
ms.TryGetBuffer(out var buf);
for (var i = 0; i < buf.Count; ++i)
{
buf[i] = Convert.ToByte(buf[i] ^ 0x55);
}*/
// or this:
/*
for (var i = 0; i < ms.Length; ++i)
{
ms.Position = i;
var b = ms.ReadByte();
ms.Position = i;
ms.WriteByte(Convert.ToByte(b ^ 0x55));
}*/
}
// originalBytes will now be 85,84,87,86...
ETA:
Edited to add in Blindy's examples. Thanks! -- Totally forgot about GetBuffer and had no idea about TryGetBuffer
Does MemoryStream(byte[]) cause a memory copy?
No, it's a non-resizable stream, and as such no copy is necessary.
Does MemoryStream.ToArray() cause a memory copy?
Yes, by design it creates a copy of the active buffer. This is to cover the resizable case, where the buffer used by the stream is not the same buffer that was initially provided due to reallocations to increase/decrease its size.
Alternatives to MemoryStream.ToArray() that don't cause memory copy?
Sure, you have MemoryStream.TryGetBuffer (out ArraySegment<byte> buffer), which returns a segment pointing to the internal buffer, whether or not it's resizable. If it's non-resizable, it's a segment into your original array.
You also have MemoryStream.GetBuffer, which returns the entire internal buffer. Note that in the resizable case, this will be a lot larger than the actual used stream space, and you'll have to adjust for that in code.
And lastly, you don't always actually need a byte array, sometimes you just need to write it to another stream (a file, a socket, a compression stream, an Http response, etc). For this, you have MemoryStream.CopyTo[Async], which also doesn't perform any copies.

How to read file bytes from byte offset?

If I am given a .cmp file and a byte offset 0x598, how can I read a file from this offset?
I can ofcourse read file bytes like this
byte[] fileBytes = File.ReadAllBytes("upgradefile.cmp");
But how can I read it from byte offset 0x598
To explain a bit more, actually from this offset the actual data starts that I have to read and before this byte offset it is just header data, so basically I have to read file from that offset till end.
Try code like this:
using (BinaryReader reader = new BinaryReader(File.Open("upgradefile.cmp", FileMode.Open)))
{
long offset = 0x598;
if (reader.BaseStream.Length > offset)
{
reader.BaseStream.Seek(offset, SeekOrigin.Begin);
byte[]fileBytes = reader.ReadBytes((int) (reader.BaseStream.Length - offset));
}
}
If you are not familiar with Streams, Linq, or whatever, I have simplest solution for you:
Read entire file into memory (I hope you deal with small files):
byte[] fileBytes = File.ReadAllBytes("upgradefile.cmp");
Calculate how many bytes are present in array after given offset:
long startOffset = 0x598; // this is just hexadecimal representation for human, it can be decimal or whatever
long howManyBytesToRead = fileBytes.Length - startOffset;
Then just copy data to new array:
byte[] newArray = new byte[howManyBytesToRead];
long pos = 0;
for (int i = startOffset; i < fileBytes.Length; i++)
{
newArray[pos] = fileBytes[i];
pos = pos + 1;
}
If you understand how it works you can look at Array.Copy method in Microsoft documentation.
By not using ReadAllBytes.
Get a stream, move to potition, read rest of files.
You basically complain that a convenience method made to allow a one line read of a whole file is not what you want - ignoring that it is just that, a convenience method. The normal way to deal with files is opening them and using a Stream.

What is the best way to prep data for serial transmission?

I am working on a C# program which will communicate with a VFD using the Mitsubishi communication protocol.
I am preparing several methods to create an array of bytes to be sent out.
Right now, I have typed up more of a brute-force method of preparing and sending the bytes.
public void A(Int16 Instruction, byte WAIT, Int32 Data )
{
byte[] A_Bytes = new byte[13];
A_Bytes[0] = C_ENQ;
A_Bytes[1] = 0x00;
A_Bytes[2] = 0x00;
A_Bytes[3] = BitConverter.GetBytes(Instruction)[0];
A_Bytes[4] = BitConverter.GetBytes(Instruction)[1];
A_Bytes[5] = WAIT;
A_Bytes[6] = BitConverter.GetBytes(Data)[0];
A_Bytes[7] = BitConverter.GetBytes(Data)[1];
A_Bytes[8] = BitConverter.GetBytes(Data)[2];
A_Bytes[9] = BitConverter.GetBytes(Data)[3];
Int16 SUM = 0;
for(int i = 0; i<10; i++)
{
SUM += A_Bytes[i];
}
A_Bytes[10] = BitConverter.GetBytes(SUM)[0];
A_Bytes[11] = BitConverter.GetBytes(SUM)[1];
A_Bytes[12] = C_CR;
itsPort.Write(A_Bytes, 0, 13);
}
However, something seems very inefficient about this. Especially the fact that I call GetBytes() so often.
Is this a good method, or is there a vastly shorter/faster one?
MAJOR UPDATE:
turns out, the mitsubishi structure is a little wonky in how it does all this.
Instead of working with bytes, it works with ascii chars. so while ENQ is still 0x05, an instruction code of E1, for instance, is actually 0x45 and 0x31.
This might actually make things easier.
Even without changing your algorithm, this can be made a bit more efficient and a bit more c#-like. If concating two array bothers you, that is of course optional.
var instructionBytes = BitConverter.GetBytes(instruction);
var dataBytes = BitConverter.GetBytes(data);
var contentBytes = new byte[] {
C_ENQ, 0x00, 0x00, instructionBytes[0], instructionBytes[1], wait,
dataBytes[0], dataBytes[1], dataBytes[2], dataBytes[3]
};
short sum = 0;
foreach(var byteValue in contentBytes)
{
sum += byteValue;
}
var sumBytes = BitConverter.GetBytes(sum);
var messageBytes = contentBytes.Concat(new byte[] { sumBytes[0], sumBytes[1], C_CR } );
itsPort.Write(messageBytes, 0, messageBytes.Length);
What I would suggest though, if you find yourself writing a lot of code like this, is to consider wrapping this up into a Message class. This code would form the basis of your constructor. You could then vary behavior (make things longer, shorter etc) with inheritance (or composition) and deal with the message as an object rather than a byte array.
Incidentally, you may see margin gains from using BinaryWriter rather than BitConverter (maybe?), but it's more hassle to use. (byte)(sum >> 8) is another option as well, which I think is the fastest actually and probably makes the most sense in your use case.

How to fill byte array with junk?

I am using this:
byte[] buffer = new byte[10240];
As I understand this initialize the buffer array of 10kb filled with 0s.
Whats the fastest way to fill this array (or initialize it) with junk data every time?
I need to use that array like >5000 times and fill it every time with different junk data, that's why I am looking for a fast method to do it. The array size will also have to change every time.
Answering 'the fastest way' is impossible without describing what the properties of your junk data have to be. Why isn't all zeroes valid junk data?
That said, this is a fast way to fill your array with meaningless numbers.
Random r = new Random();
r.NextBytes(buffer);
You might also look at implementing your own Linear congruential generator if Random isn't fast enough for you. They're simple to implement and fast, but won't give high quality random numbers. (It's unclear to me if you need those or not.)
If you are happy with the data being random, but being created form a random seed buffer, then you could do the following:
public class RandomBufferGenerator
{
private readonly Random _random = new Random();
private readonly byte[] _seedBuffer;
public RandomBufferGenerator(int maxBufferSize)
{
_seedBuffer = new byte[maxBufferSize];
_random.NextBytes(_seedBuffer);
}
public byte[] GenerateBufferFromSeed(int size)
{
int randomWindow = _random.Next(0, size);
byte[] buffer = new byte[size];
Buffer.BlockCopy(_seedBuffer, randomWindow, buffer, 0, size - randomWindow);
Buffer.BlockCopy(_seedBuffer, 0, buffer, size - randomWindow, randomWindow);
return buffer;
}
}
I found it to be approx 60-70 times faster then generating a random buffer from scratch each time.
START: From seed buffer.
00:00:00.009 END : From seed buffer. (Items = 5,000; Per Second = 500,776.20)
START: From scratch.
00:00:00.604 END : From scratch. (Items = 5,000; Per Second = 8,276.95)
Update
The general idea is to create a RandomBufferGenerator once, and then use this instance to generate random buffers, e.g.:
RandomBufferGenerator generator = new RandomBufferGenerator(MaxBufferSize);
byte[] randomBuffer1 = generator.GenerateBufferFromSeed(10 * 1024);
byte[] randomBuffer2 = generator.GenerateBufferFromSeed(5 * 1024);
...
Look at the System.Random.NextBytes() method
As another option to consider, Marshall.AllocHGlobal will allocate unmanaged memory. It doesn't zero out the memory, you get what happened to be there so it's very fast. Of course you now have to work with this memory using unsafe code, and if you need to pull it into the managed space you are better off with Random.NextBytes.
How junky should be the data? Do you mean random? If that is the case just use the Random class.

Categories