Creating URL ShortCode in C# - c#

I am using this article to create a short code for a URL.
I've been working on this for a while and the pseudo code is just not making any sense to me. He states in "loop1" that I'm supposed to look from the first 4 bytes to the 4th 4 bytes, and then cast the bytes to an integer, then convert that to bits. I end up with 32 bits for each 4 bytes, but he's using 5 bytes in the "loop3" which isn't divisible by 32. I am not understanding what he's trying to say.
Then I noticed that he closes "loop2" at the bottom after you've written the short code to the database. That's not making any sense to me because I would be writing the same short code to the database over and over again.
Then I have "loop1" which is going to loop into infinity, again I'm not seeing why I would need to update the database to infinity.
I have tried to follow his example and ran it through the debugger line-by-line, but it's not making sense.
Here is the code I have so far, according to what I've been able to understand:
private void button1_Click(object sender, EventArgs e)
{
string codeMap = "abcdefghijklmnopqrstuvwxyz012345"; // 32 bytes
// Compute MD5 Hash
MD5 md5 = MD5.Create();
byte[] inputBytes = Encoding.ASCII.GetBytes(txtURL.Text);
byte[] hash = md5.ComputeHash(inputBytes);
// Loop from the first 4 bytes to the 4th 4 bytes
byte[] FourBytes = new byte[4];
for (int i = 0; i <= 3; i++)
{
FourBytes[i] = hash[i];
//int CastedBytes = FourBytes[i];
BitArray binary = new BitArray(FourBytes);
int CastedBytes = 0;
for(int ii = 0; i <=5; i++)
{
CastedBytes = CastedBytes + ii;
}
}
Can someone help me figure out what I'm doing wrong, so I can get this program working? I just need to convert URLs into short 6-digit unique codes.
Thanks.

Your MD5 hash is 128 bits. The idea is to represent those 128 bits in 6 characters, ideally without losing any information.
The codeMap contains 32 characters
string codeMap = "abcdefghijklmnopqrstuvwxyz012345"
Note that 2^5 is also 32. The third loop is using 5 bits of the hash at a time, and converting those 5 bits to a character in the codeMap. For example, for the bit pattern
00001 00011 00100
b d e
The algorithm uses 6 sets of 5 bits, so 30 bits in total. 2 bits are "wasted".
Note though that the 128 bit MD5 is being taken 4 bytes at a time, and those 4 bytes are converted to an integer. That is one approach to consuming the bits of the MD5, but certainly not the only one. It involves bit masking and bit shifting.
You may find it more straightforward to use a BitArray for the implementation. While this is probably slightly less efficient, it will not likely matter. If you go that path, initialize the BitArray with the bits of your MD5 hash, and then just take 5 bits at a time, converting them to a number in the range 0..31 to use as an index into codeMap.
This bit from the article is misleading
6 characters of short code can used to map 32^6 (1,073,741,824) URLs so it is unlikely to be used up in the near future
Due to the possibility of hash collisions, the system can manage far fewer than 1 billion URLs without a significant risk of the same short URL being assigned to two long URLs. See the Birthday Problem for more.

Unless you are expecting to have a hugely popular URL shortener, just use base 16 or base 64 off of a database auto increment column.
Base 16 would provide 16 million unique URLs. Base 64 would provide ~2^^36.

Related

Any 32 or 64 bit hash function?

I want to create a method in c# which will return a string of max 10-12 characters. I have tried using SHA1 and MD5 but they are 160 and 128 bits respectively and generates 32 characters string which doesn't fulfill my requirements. Security is not the issue. I just need a small string that will remain unique
You can truncate the string (the hash) to the length you want. You'll only make it weaker (as an extreme example, if you truncate it to one byte, you'll probably have a collision after 16 elements are hashed, thanks to the birthday problem). Each part of a good hash is as much "good" as every other part. So take the first x characters/bytes and live happy. See for example a discussion about this in security. There is an explanation here about how much secure will be a truncated hash.

Is a substring of a CSPRN also a CSPRN?

I want to generate a 4 character long CSPRN (Cryptographically Secure Psuedo Random Number) string. I know I can create an 8 character one by creating a 5 byte long random array and encoding as base32:
string CSPRN = "";
System.Security.Cryptography.RandomNumberGenerator rng = new System.Security.Cryptography.RNGCryptoServiceProvider();
byte[] tokenData = new byte[5];
rng.GetBytes(tokenData);
CSPRN = Base32.ToBase32String(tokenData); //should produce a string 5bytes*1.6charsperbyte = 8 chars long.
If I now take a substring of the first 4 characters of CSPRN - is it still a CSPRN?
My best guess is that it is, but wondering if there is any "gotcha"s from taking a substring rather than generating a smaller number.
Yes, it is. First lets look at your first security claim. Base 32 is a 5 bit encoding, so if you generate 5 times anything, say a byte, the string you generate contains full entropy per character.
Now the amount of entropy per character doesn't suddenly drop if you take it out of the string of course. So if you take 4 characters each should still contain the full entropy, giving you simply 4 characters of full of entropy within the base 32 alphabet.

Locally unique identifier

Question: When you have a .NET GUID for inserting in a database, it's structure is like this:
60 bits of timestamp,
48 bits of computer identifier,
14 bits of uniquifier, and
6 bits are fixed,
----
128 bits total
Now I have problem with a GUID, because it's a 128 bit number, and some of the DBs I'm using only support 64 bit numbers.
Now I don't want to solve the dilemma by using an autoincrement bigint value, since I want to be able to do offline replication.
So I got the idea of creating a locally unique identifier class, which is basically a GUID downsized to a 64 bit value.
I came up with this:
day 9 bit (12*31=372 d)
year 8 bit (2266-2010 = 256 y)
seconds 17 bit (24*60*60=86400 s)
hostname 12 bit (2^12=4096)
random 18 bit (2^18=262144)
------------------------
64 bits total
My question now is: The timestamp is pretty much fixed at 34 bits, leaving me with 64-34=30 bits for the hostname + random number.
Now my question:
1) Would you rather increase the hostname-hash bitsize and decrease the random bitsize, or increase the random bitsize and decrease the hostname-hash bitsize.
2) Exists there a hash algorithm that reduces every string to n-Bits?
n being ideally = 12 or as near as possible.
Actually, .NET-generated GUIDs are 6 fixed bits and 122 bits of randomness.
You could consider just using 64 bits of randomness, with an increased chance of collision due to the smaller bit length. It would work better than a hash.
If space isn't a concern, then why don't you just use 2 columns that are 64bits wide, then split the guid in half using 8bytes for each then just convert those to your 64bit numbers and store it in 2 columns, then if you ever do need to upsize to another system, you'll still be unique you'll just need to factor in the rejoining of the 2 columns.
Why write your own? Why not just generate a uniformly random number? It will do the job nicely. Just grab the first X digits where X is whatever size you want... say 64-bits.
See here for info about RAND() vs. NEWID() in SQL Server, which is really just an indictment of GUIDs vs. random number generators. Also, see here if you need something more random than System.Random.

Bit/byte conversion

How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)
On 32-bit systems:
4 bytes = Type pointer (Every object has one of these)
4 bytes = Lock (One of these too!)
4 bytes = Length (Need the length)
2 * Length bytes = Data (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits
So 10 chars would = 256 bits = 32 bytes
I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).
Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8
otherwise the result depends on the encoding, you can write:
int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80
or
int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160
If you are talking pure Unicode-16 then:
10 characters = 20 bytes = 160 bits
This really needs a context in order to be answered properly.
It all comes down to how you define character and how to you store the data.
For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.
Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.
If you are talking about database or files than each database and file system has its own overhead.

Most significant bit

I haven't dealt with programming against hardware devices in a long while and have forgotten pretty much all the basics.
I have a spec of what I should send in a byte and each bit is defined from the most significant bit (bit7) to the least significant (bit 0). How do i build this byte? From MSB to LSB, or vice versa?
If these bits are being 'packeted' (which they usually are), then the order of bits is the native order, 0 being the LSB, and 7 being the MSB. Bits are not usually sent one-by-one, but as bytes (usually more than one byte...).
According to wikipedia, bit ordering can sometimes be from 7->0, but this is probably the rare case.
If you're going to write the whole byte at the same time, i.e. do a parallel transfer as opposed to a serial, the order of the bits doesn't matter.
If the transfer is serial, then you must find out which order the device expects the bits in, it's impossible to tell from the outside.
To just assemble a byte from eight bits, just use bitwise-OR to "add" bits, one at a time:
byte value = 0;
value |= (1 << n); // 'n' is the index, with 0 as the LSB, of the bit to set.
If the spec says MSB, then build it MSB. Otherwise if the spec says LSB, then build it LSB. Otherwise, ask for more information.

Categories