Predicting the length of an encrypted string

Predicting the length of an encrypted string - c#

I am using this for encryption: http://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged.aspx
Is there a way I can predict what the encrypted text will look like? I am converting the encrypted output to text so I can store it in the db.
I just want to make sure the size of the database column is large enough.
I am limiting the text input to be 20 characters.

Are you using SQL Server 2005 or above? If so you could just use VARCHAR(MAX) or NVARCHAR(MAX) for the column type.
If you want to be a bit more precise...
The maximum block size for RijndaelManaged is 256 bits (32 bytes).
Your maximum input size is 20 characters, so even if we assume a worst-case scenario of 4 bytes per character, that'll only amount to 80 bytes, which will then be padded up to a maximum of 96 bytes for the encryption process.
If you use Base64 encoding on the encrypted output that will create 128 characters from the 96 encrypted bytes. If you use hex encoding then that will create 192 characters from the 96 encrypted bytes (plus maybe a couple of extra characters if you're prefixing the hex string with "0x"). In either case a column width of 200 characters should give you more than enough headroom.
(NB: These are just off-the-top-of-my-head calculations. I haven't verified that they're actually correct!)

For an unknown encryption algorithm with no information to be found online, I would write a little test program that encrypted a random set of strings of maximum length, find the longest length in the output, then multiply by a safety factor based on how likely the length of input is to change, and how accurate the result of the test program was.
Really generally speaking though, you're probably going to be in the 1.5x - 2x input length range.

For this specific algorithm, the length of ciphertext will be,
((length+16)/16)*16
This is to meet the block size and padding requirement.
I suggest you also add an random IV to the ciphertext so that will take another 16 bytes.
However, if you want put this as char in database, you have to encode it. That will increase it even more.
For base64, multiply it by 4/3. For hex, double it.

Encryption will never increase the size of data beyond the minimum padding required.
If it does 'expand' the data, it is probably not a very good encryption algorithm.

Related

c#: How to represent a large data packet easily?

I have an array of bytes which I want to represent with some kind of structure so that using these bytes will be easier.
Currently the bytes come in to an array in my application as :
Bytes 0-3 Header,
Bytes 4-10 A representation of an ascii number,
Bytes 11 to 17 Another representation of an ascii number,
Bytes 18 to 1000, binary data.
Is there a way to represent this so that I can for example just use something like MyArray.Header, MyArray.Number1, MyArray.Number2, MyArray.Data
I think a Struct could handle this but I am not sure how to define it or how to use it afterwards. The bytes will be coming in via a network continuously.
Thanks for any help.

C# Creating your own hash algorithm - 99 documents, 0.0001 collision?

Looking at Wolfram: collision to bits - graph with 99 documents, I'd need a 25.5bit hashing algorithm to have a 0.0001 chance for a collision.
I looked at CRC-24 and I was wondering if it could be improved to use even less characters. I have a big list of characters that can be used for the hash: Basically all Unicode characters except for 4 or 5 characters.
Now how do you create your own hash algorithm based on a set of usable characters in C#?
EDIT:
I try to precise the issue: I have 99 strings. I want to cut them to 64 chars max. length. This can create duplicates. But they need to be unique while maintaining their meaning. The idea was to create a hash as small as possible and replace the last characters with the hash created of the original string. The hash of course should have a low probability for collision and be as short as possible. How I understand, the more symbols can be used in the hash (as in a-z0-9 or a-zA-Z0-9 or even all unicode characters), the less characters can the hash have before there are collisions. I looked at sha-1 and just trimming it, or crc-32 but they are not using the "full potential" of unicode characters.

Maximum UTF-8 string size given UTF-16 size

What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length in C# / .NET)?
I see 3 possibilities:
# of UTF-16 code units x 2
# of UTF-16 code units x 3
# of UTF-16 code units x 4
A UTF-16 code point is represented by either 1 or 2 code units, so we just need to consider the worst case scenario of a string filled with one or the other. If a UTF-16 string is composed entirely of 2 code unit code points, then we know the UTF-8 representation will be at most the same size, since the code points take up a maximum of 4 bytes in both representations, thus worst case is option (1) above.
So the interesting case to consider, which I don't know the answer to, is the maximum number of bytes that a single code unit UTF-16 code point can require in UTF-8 representation.
If all single code unit UTF-16 code points can be represented with 3 UTF-8 bytes, which my gut tells me makes the most sense, then option (2) will be the worst case scenario. If there are any that require 4 bytes then option (3) will be the answer.
Does anyone have insight into which is correct? I'm really hoping for (1) or (2) as (3) is going to make things a lot harder :/
UPDATE
From what I can gather, UTF-16 encodes all characters in the BMP in a single code unit, and all other planes are encoded in 2 code units.
It seems that UTF-8 can encode the entire BMP within 3 bytes and uses 4 bytes for encoding the other planes.
Thus it seems to me that option (2) above is the correct answer, and this should work:
string str = "Some string";
int maxUtf8EncodedSize = str.Length * 3;
Does that seem like it checks out?

The worst case for a single UTF-16 word is U+FFFF which in UTF-16 is encoded just as-is (0xFFFF) Cyberchef. In UTF-8 it is encoded to ef bf bf (three bytes).
The worst case for two UTF-16 words (a "surrogate pair") is U+10FFFF which in UTF-16 is encoded as 0xDBFF DFFF. In UTF-8 it is encoded to f3 cf bf bf (four bytes).
Therefore the worst case is a load of U+FFFF's which will convert a UTF-16 string of length 2N bytes to a UTF-8 string of length 3N bytes.
So yes, you are correct. I don't think you need to consider stuff like glyphs because that sort of thing is done after decoding from UTF8/16 to code points.

Properly formed UTF-8 can be up to 4 bytes per Unicode codepoint.
UTF-16-encoded characters can be up to 2 16-bit sequences per Unicode codepoint.
Characters outside the basic multilingual plane (including emoji and languages that were added to more recent versions of Unicode) are represented in up to 21 bits, which in the UTF-8 format results in 4 byte sequences, which turn out to also take up 4 bytes in UTF-16.
However, there are some environments that do things weirdly. Since UTF-16 characters outside the basic multilingual plane take up to 2 16-bit sequences (they're detectible because they're always 16 bit sequences in the range U+D800 to U+DFFF), some mistaken UTF-8 implementations, usually referred to as CESU-8, that convert those UTF-8 sequences into two 3-byte UTF-8 sequences, for a total of six bytes per UTF-32 codepoint. (I believe some early Oracle DB implementations did this, and I'm sure they weren't the only ones).
There's one more minor wrench in things, which is that some glyphs are classified as combining characters, and multiple UTF-16 (or UTF-32) sequences are used when determining what gets displayed on the screen, but I don't think that applies in your case.
Based on your edit, it looks like you're trying to estimate the maximum length of .Net encoding conversion. String Length measures the total number of Chars, which are a count of UTF-16 codepoints. As a worst-case estimate, therefore, I believe you can safely estimate count(Char) * 3, because the non-BMP characters will be count(Char) * 2 yielding 4 bytes as UTF-8.
If you want to get the total number of UTF-32 codepoints represented, you should be able to do something like
var maximumUtf8Bytes = System.Globalization.StringInfo(myString).LengthInTextElements * 4;
(My C# is a bit rusty as I haven't used a .Net environment much in the last few years, but I think that does the trick).

Any 32 or 64 bit hash function?

I want to create a method in c# which will return a string of max 10-12 characters. I have tried using SHA1 and MD5 but they are 160 and 128 bits respectively and generates 32 characters string which doesn't fulfill my requirements. Security is not the issue. I just need a small string that will remain unique

You can truncate the string (the hash) to the length you want. You'll only make it weaker (as an extreme example, if you truncate it to one byte, you'll probably have a collision after 16 elements are hashed, thanks to the birthday problem). Each part of a good hash is as much "good" as every other part. So take the first x characters/bytes and live happy. See for example a discussion about this in security. There is an explanation here about how much secure will be a truncated hash.

Succinct way to write a mixture of chars and bytes?

I'm trying to write an index file that follows the format of a preexisting (and immutable) text file.
The file is fixed length, with 11 bytes of string (in ASCII) followed by 4 bytes of long for a total of 15 bytes per line.
Perhaps I'm being a bit dim, but is there an simple way to do this? I get the feeling I need to open up two streams to write one line - one for the string and one for the bytes - but that feels wrong.
Any hints?

You can use BitConverter to convert between an int/long and an array of bytes. This way you would be able to write eleven bytes followed by four bytes, followed by eleven more bytes, and so on.
byte[] intBytes = BitConverter.GetBytes(intValue); // returns 4-byte array
Converting to bytes: BitConverter.GetBytes(int).
Converting back to int: BitConverter.ToInt32(byte\[\], int)
If you are developing a cross-platform solution, keep in mind the following note from the documentation (thanks to uriDium for the comment):
The order of bytes in the array returned by the GetBytes method depends on whether the computer architecture is little-endian or big-endian.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.