I was looking at the good example in this MSDN page: http://msdn.microsoft.com/en-us/library/system.security.cryptography.x509certificates.x509certificate2.aspx
Scroll down half way to the example and look into the method:
// Decrypt a file using a private key.
private static void DecryptFile(string inFile, RSACryptoServiceProvider rsaPrivateKey)
You will notice the reader is reading only 3 bytes at time while it is trying to read an int off the stream:
inFs.Seek(0, SeekOrigin.Begin);
inFs.Read(LenK, 0, 3);// <---- this should be 4
inFs.Seek(4, SeekOrigin.Begin);// <--- this line masks the bug for smaller ints
inFs.Read(LenIV, 0, 3); // <---- this should be 4
Since the next line is Seeking to position "4", the bug is getting masked. Am I getting it right or, is it intentional i.e. some sort of weird optimization since we know that (for this example) length of AES Key and IV is going to be small enough to be accomodated in 3 bytes so read only 3 and then skip to 4, thus save reading 1 byte off the disk?
If an optimization.... Really??
I very much doubt it's an optimisation. Disk reads tend to be in chunks substantially larger than four bytes, and caching would pretty much invalidate this type of optimisation, except in the very rare case where the four bytes may cross two different disk "sectors" (or whatever the disk read resolution is).
That sort of paradigm tends to be seen where (for example) only three bytes are used and the implementation stores other information there.
Not saying that's the case here but you may want to look into the history of certain large companies in using fields for their own purposes, despite what standards say, a la "embrace, extend, extinguish" :-)
Related
Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256.
In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.
public static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Then I read this topic and somehow change my code according what they said to :
public static string GetChecksumBuffered(Stream stream)
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
But It doesn't have such a affection and takes about 9 mins.
Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !
Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.
My Questions are :
What causes such different between the above code and Linux sha256sum in time ?
What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)
Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?
How can I make my implementation as fast as sha256sum in C#?
public string SHA256CheckSum(string filePath)
{
using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = File.OpenRead(filePath))
return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
}
}
My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.
Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.
The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.
A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.
No idea, beyond those mentioned above. You're doing it right.
For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?
Edit in response to question in comment
The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.
The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.
As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.
Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?
Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:
SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code.
HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. As a result, you're not really going to see much difference using a BufferedStream for this call.
HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream.
Your best bet would be to use a third party implementation that's optimized for your use case.
using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = System.IO.File.OpenRead(filePath))
{
string result = "";
foreach (var hash in SHA256.ComputeHash(fileStream))
{
result += hash.ToString("x2");
}
return result;
}
}
For Reference: https://www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/
I'm just about to build a simple chat application using One Time Pad.
I've already made the algorithm, and to encrypt the messages, I need some sort of key material that is the same on both sides. The distribution of the key material is supposed to happen with physical contact (e.g. USB dongle). So I would like to make some very large random key files, that the two clients can use to communicate. So my questions are:
I need a very secure random number/string generator, do you know any good ones that I can use in C#?
And how do I, when I use such big files, avoid to load the whole file into memory, as I plan to read a chunk of the key material (e.g. 1 MB), and remove it from the file afterwards when read, so the same key won't be used twice.
I Should probably start with this: I assume this is a for fun or for exercise project - not an attempt to create something truly secure.
As owlstead says: Use RNGCryptoServiceProvider.
Removing the used key material from the file is much easier if you use it in reverse. If you need to encrypt 1024 bytes, read the last 1024 bytes from the file and truncate it. Simplified:
byte[] Encrypt(byte[] plain){
using (FileStream keyFile = new FileStream(FileName, FileMode.Open))
{
keyFile.Seek(-plain.Length, SeekOrigin.End);
byte[] key = new byte[plain.Length];
keyFile.Read(key, 0, plain.Length);
byte[] encrypted = new byte[plain.Length];
for(int i=0;i<plain.Length;i++){
encrypted[i] = (byte) (plain[i] ^ key[plain.Length - 1 - i]);
}
keyFile.SetLength(keyFile.Length - plain.Length);
return encrypted;
}
}
You are trying to solve an issue that has already been solved: using AES is almost certainly as safe as using the One Time Pad. In that case your key becomes 16 bytes (although you will need a NONCE or IV depending on the mode).
You need to use a secure random number generator RNGCryptoServiceProvider.
Read the file into memory using MemoryMappedFile.
Just (atomically) store the offset in the file instead. E.g. use the first 8 bytes in the same file to store an ulong type, so you can sync when needed. You may write zero's in the file in addition to that for the bytes used if you really want. Note that e.g. with SSD's you may not actually overwrite the physical data.
What you have designed is almost certainly not a One Time Pad. The generation of large quantities of truly random bytes is a far from trivial task. You should really be spending a few thousand dollars on a hardware card if you really must go down that route. Even the crypto quality C# RNG is not up to generating that sort of volume of true random data. It can generate a short key securely enough, but does not have enough entropy input to generate large quantities of true random data. As soon as the entropy runs out, its output reverts to pseudo-random and you no longer have a One Time Pad.
As #owlstead said, use AES, in either CBC or CTR mode. That is secure, easily available and is not designed by an amateur.
Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.
Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.
Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.
Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3
This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.
If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.
Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)
I'm looking at the C# library called BitStream, which allows you to write and read any number of bits to a standard C# Stream object. I noticed what seemed to me a strange design decision:
When adding bits to an empty byte, the bits are added to the MSB of the byte. For example:
var s = new BitStream();
s.Write(true);
Debug.Assert(s.ToByteArray()[0] == 0x80); // and not 0x01
var s = new BitStream();
s.Write(0x7,0,4);
s.Write(0x3,0,4);
Debug.Assert(s.ToByteArray()[0] == 0x73); // and not 0x37
However, when referencing bits in a number as the input, the first bit of the input number is the LSB. For example
//s.Write(int input,int bit_offset, int count_bits)
//when referencing the LSB and the next bit we'll write
s.Write(data,0,2); //and not s.Write(data,data_bits_number,data_bits_number-2)
It seems inconsistent to me. Since in this case, when "gradually" copying a byte like in the previous example (the first four bits, and then the last four bits), we will not get the original byte. We need to copy it "backwards" (first the last four bits, then the first four bits).
Is there a reason for that design that I'm missing? Any other implementation of bits stream with this behaviour? What are the design considerations for that?
It seems that ffmpeg bitstream behaves in a way I consider consistent. Look at the amount it shifts the byte before ORing it with the src pointer in the put_bits function.
As a side note:
The first byte added, is the first byte in the byte array. For example
var s = new BitStream();
s.Write(0x1,0,4);
s.Write(0x2,0,4);
s.Write(0x3,0,4);
Debug.Assert(s.ToByteArray()[0] == 0x12); // and not s.ToByteArray()[1] == 0x12
Here are some additional considerations:
In the case of the boolean - only one bit is required to represent true or false. When that bit gets added to the beginning of the stream, the bit stream is "1." When you extend that stream to byte length it forces the padding of zero bits to the end of the stream, even though those bits did not exist in the stream to begin with. Position in the stream is important information just like the values of the bits, and a bit stream of "1000000" or 0x80 safeguards the expectation that subsequent readers of the stream may have that the first bit they read is the first bit that was added.
Second, other data types like integers require more bits to represent so they are going to take up more room in the stream than booleans. Mixing different size data types in the same stream can be very tricky when they aren't aligned on byte boundaries.
Finally, if you are on Intel x86 your CPU architecture is "little-endian" which means LSB first like you are describing. If you need to store values in the stream as big-endian you'll need to add a conversion layer in your code - similar to what you've shown above where you push one byte at a time into the stream in the order you want. This is annoying, but commonly required if you need to interop with big-endian Unix boxes or as may be required by a protocol specification.
Hope that helps!
Is there a reason for that design that I'm missing? Any other implementation of bits stream with this behaviour? What are the design considerations for that?
I doubt there was any significant meaning behind the descision. Technically it just does not matter so long as the writer and reader agree on the ordering.
I agree with Elazar.
As he/she points out, this is a case where the reader and writer do NOT agree on the bit ordering. In fact, they're incompatible.