C#: String -> MD5 -> Hex - c#

in languages like PHP or Python there are convenient functions to turn an input string into an output string that is the HEXed representation of it.
I find it a very common and useful task (password storing and checking, checksum of file content..), but in .NET, as far as I know, you can only work on byte streams.
A function to do the work is easy to put on (eg http://blog.stevex.net/index.php/c-code-snippet-creating-an-md5-hash-string/), but I'd like to know if I'm missing something, using the wrong pattern or there is simply no such thing in .NET.
Thanks

The method you linked to seems right, a slightly different method is showed on the MSDN C# FAQ
A comment suggests you can use:
System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile(string, "MD5");

Yes you can only work with bytes (as far as I know). But you can turn those bytes easily into their hex representation by looping through them and doing something like:
myByte.ToString("x2");
And you can get the bytes that make up the string using:
System.Text.Encoding.UTF8.GetBytes(myString);
So it could be done in a couple lines.

One problem is with the very concept of "the HEXed representation of [a string]".
A string is a sequence of characters. How those characters are represented as individual bits depends on the encoding. The "native" encoding to .NET is UTF-16, but usually a more compact representation is achieved (while preserving the ability to encode any string) using UTF-8.
You can use Encoding.GetBytes to get the encoded version of a string once you've chosen an appropriate encoding - but the fact that there is that choice to make is the reason that there aren't many APIs which go straight from string to base64/hex or which perform encryption/hashing directly on strings. Any such APIs which do exist will almost certainly be doing the "encode to a byte array, perform appropriate binary operation, decode opaque binary data to hex/base64".
(That makes me wonder whether it wouldn't be worth writing a utility class which could take an encoding, a Func<byte[], byte[]> and an output format such as hex/base64 - that could represent an arbitrary binary operation applied to a string.)

Related

Advantage in using SerialPort.ReadByte over ReadChar?

Of all the example codes I have read online regarding SerialPorts all uses ReadByte then convert to Character instead of using ReadChar in the first place.
Is there a advantage in doing this?
The SerialPort.Encoding property is often misunderstood. The default is ASCIIEncoding, it will produce ? for byte values 0x80..0xFF. So they don't like getting these question marks. If you see such code then converting the byte to char directly then they are getting it really wrong, Unicode has lots of unprintable codepoints in that byte range and the odds that the device actually meant to send these characters are zero. A string tends to be regarded as easier to handle than a byte[], it is.
When you use ReadChar it is based on the encoding you are using, like #Preston Guillot said. According to the docu of ReadChar:
This method reads one complete character based on the encoding.
Use caution when using ReadByte and ReadChar together. Switching
between reading bytes and reading characters can cause extra data to
be read and/or other unintended behavior. If it is necessary to switch
between reading text and reading binary data from the stream, select a
protocol that carefully defines the boundary between text and binary
data, such as manually reading bytes and decoding the data.

How to double-decode UTF-8 bytes C#

I have a problem.
Unicode 2019 is this character:
’
It is a right single quote.
It gets encoded as UTF8.
But I fear it gets double-encoded.
>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'
>>> u'\xe2\x80\x99'.encode('utf-8')
'\xc3\xa2\xc2\x80\xc2\x99'
>>> u'\xc3\xa2\xc2\x80\xc2\x99'.encode('utf-8')
'\xc3\x83\xc2\xa2\xc3\x82\xc2\x80\xc3\x82\xc2\x99'
>>> print(u'\u2019')
’
>>> print('\xe2\x80\x99')
’
>>> print('\xc3\xa2\xc2\x80\xc2\x99')
’
>>> '\xc3\xa2\xc2\x80\xc2\x99'.decode('utf-8')
u'\xe2\x80\x99'
>>> '\xe2\x80\x99'.decode('utf-8')
u'\u2019'
This is the principle used above.
How can I do the bolded parts, in C#?
How can I take a UTF8-Encoded string, conver to byte array, convert THAT to a string in, and then do decode again?
I tried this method, but the output is not suitable in ISO-8859-1, it seems...
string firstLevel = "’";
byte[] decodedBytes = Encoding.UTF8.GetBytes(firstLevel);
Console.WriteLine(Encoding.UTF8.GetChars(decodedBytes));
// ’
Console.WriteLine(decodeUTF8String(firstLevel));
//â�,��"�
//I was hoping for this:
//’
Understanding Update:
Jon's helped me with my most basic question: going from "’" to "’ and thence to "’" But I want to honor the recommendations at the heart of his answer:
understand what is happening
fix the original sin
I made an effort at number 1.
Encoding/Decoding
I get so confused with terms like these.
I confuse them with terms like Encrypting/Decrypting, simply because of "En..." and "De..."
I forget what they translate from, and what they translate to.
I confuse these start points and end points; could it be related to other vague terms like hex, character entities, code points, and character maps.
I wanted to settle the definition at a basic level.
Encoding and Decoding in the context of this question is:
Decode
Corresponds to C# {Encoding}.'''GetString'''(bytesArray)
Corresponds to Python stringObject.'''decode'''({Encoding})
Takes bytes as input, and converts to string representation as output, according to some conversion scheme called an "encoding", represented by {Encoding} above.
Bytes -> String
Encode
Corresponds to C# {Encoding}.'''GetBytes'''(stringObject)
Corresponds to Python stringObject.'''encode'''({Encoding})
The reverse of Decode.
String -> Bytes (except for Python)
Bytes vs Strings in Python
So Encode and Decode take us back and forth between bytes and strings.
While Python helped me understand what was going wrong, it could also confuse my understanding of the "fundamentals" of Encoding/Decoding.
Jon said:
It's a shame that Python hides [the difference between binary data and text data] to a large extent
I think this is what PEP means when it says:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs.
Python 3.* does not overload strings in this way.:
Python 2.7
>>> #Encoding example. As a generalization, "Encoding" produce bytes.
>>> #In Python 2.7, strings are overloaded to serve as bytes
>>> type(u'\u2019'.encode('utf-8'))
<type 'str'>
Python 3.*
>>> #In Python 3.*, bytes and strings are distinct
>>> type('\u2019'.encode('utf-8'))
<class 'bytes'>
Another important (related) difference between Python 2 and 3, is their default encoding:
>>>import sys
>>>sys.getdefaultencoding()
Python 2
'ascii'
Python 3
'utf-8'
And while Python 2 says 'ascii', I think it means a specific type of ASCII;
It does '''not''' mean ISO-8859-1, which supports range(256), which is what Jon uses to decode (discussed below)
It means ASCII, the plainest variety, which are only range(128)
And while Python 3 no longer overloads string as both bytes, and strings, the interpreter still makes it easy to ignore what's happening and move between types. i.e.
just put a 'u' before a string in Python 2.* and it's a Unicode literal
just put a 'b' before a string in Python 3.* and it's a Bytes literal
Encoding and C
Jon points out that C# uses UTF-16, to correct my "UTF-8 Encoded String" comment, above;
Every string is effectively UTF-16.
My understanding of is: if C# has a string object "s", the computer memory actually has bytes corresponding to that character in the UTF-16 map. That is, (including byte-order-mark??) feff0073.
He also uses ISO-8859-1 in the hack method I requested.
I'm not sure why.
My head is hurting at the moment, so I'll return when I have some perspective.
I'll return to this post. I hope I'm explaining properly. I'll make it a Wiki?
You need to understand that fundamentally this is due to someone misunderstanding the difference between binary data and text data. It's a shame that Python hides that difference to a large extent - it's quite hard to accidentally perform this particular form of double-encoding in C#. Still, this code should work for you:
using System;
using System.Text;
class Test
{
static void Main()
{
// Avoid encoding issues in the source file itself...
string firstLevel = "\u00c3\u00a2\u00c2\u0080\u00c2\u0099";
string secondLevel = HackDecode(firstLevel);
string thirdLevel = HackDecode(secondLevel);
Console.WriteLine("{0:x}", (int) thirdLevel[0]); // 2019
}
// Converts a string to a byte array using ISO-8859-1, then *decodes*
// it using UTF-8. Any use of this method indicates broken data to start
// with. Ideally, the source of the error should be fixed.
static string HackDecode(string input)
{
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(input);
return Encoding.UTF8.GetString(bytes);
}
}

C# big-endian UCS-2

The project I'm currently working on needs to interface with a client system that we don't make, so we have no control over how data is sent either way. The problem is that were working in C#, which doesn't seem to have any support for UCS-2 and very little support for big-endian. (as far as i can tell)
What I would like to know, is if there's anything i looked over in .net, or something that someone else has made and released that we can use. If not I will take a crack at encoding/decoding it in a custom method, if that's even possible.
But thanks for your time either way.
EDIT:
BigEndianUnicode does work to correctly decode the string, the problem was in receiving other data as big endian, so far using IPAddress.HostToNetworkOrder() as suggested elsewhere has allowed me to decode half of the string (Merli? is what comes up and it should be Merlin33069)
Im combing the short code to see if theres another length variable i missed
RESOLUTION:
after working out that the bigendian variables was the main problem, i went back through and reviewed the details and it seems that the length of the strings was sent in character counts, not byte counts (in utf it would seem a char is two bytes) all i needed to do was double it, and it worked out. thank you all for your help.
string x = "abc";
byte[] data = Encoding.BigEndianUnicode.GetBytes(x);
In other direction:
string decodedX = Encoding.BigEndianUnicode.GetString(data);
It is not exactly UCS-2 but it is enough for most cases.
UPD: Unicode FAQ
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
EDIT: Now we know that the problem isn't in the encoding of the text data but in the encoding of the length. There are a few options:
Reverse the bytes and then use the built-in BitConverter code (which I assume is what you're using now; that or BinaryReader)
Perform the conversion yourself using repeated "add and shift" operations
Use my EndianBitConverter or EndianBinaryReader classes from MiscUtil, which are like BitConverter and BinaryReader, but let you specify the endianness.
You may be looking for Encoding.BigEndianUnicode. That's the big-endian UTF-16 encoding, which isn't strictly speaking the same as UCS-2 (as pointed out by Marc) but should be fine unless you give it strings including characters outside the BMP (i.e. above U+FFFF), which can't be represented in UCS-2 but are represented in UTF-16.
From the Wikipedia page:
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.
I find it highly unlikely that the client system is sending you characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway).
UCS-2 is so close to UTF-16 that Encoding.BigEndianUnicode will almost always suffice.
The issue (comments) around reading the length prefix (as big-endian) is more correctly resolved via shift operations, which will do the right thing on all systems. For example:
Read4BytesIntoBuffer(buffer);
int len =(buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | (buffer[3]);
This will then work the same (at parsing a big-endian 4 byte int) on any system, regardless of local endianness.

Really simple short string compression

Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?
I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.
I think the key question here is "Why do you want to compress URLs?"
Trying to shorten long urls for the address bar?
You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.
For example:
http://mydomain.com/folder1/folder2/page1.aspx
Could be shorted to:
http://mydomain.com/2d4f1c8a
Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.
Storing lots of URLs in memory or on disk?
Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.
As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.
DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:
string[] orig = {
"folder1/folder2/page1.aspx",
"folderBB/folderAA/page2.aspx",
};
public void Run()
{
foreach (string s in orig)
{
System.Console.WriteLine("original : {0}", s);
byte[] compressed = DeflateStream.CompressString(s);
System.Console.WriteLine("compressed : {0}", ByteArrayToHexString(compressed));
string uncompressed = DeflateStream.UncompressString(compressed);
System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
}
}
Using that code, here are my test results:
original : folder1/folder2/page1.aspx
compressed : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx
original : folderBB/folderAA/page2.aspx
compressed : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx
So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.
You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.
EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.
I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.
I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).
see http://blog.alivate.com.au/packed-url/
It would be great if someone from a big tech company built this out properly and published it for all to use. Google championed Protocol buffers. This tool can save a lot of disk space for someone like Google, while still being scannable. Or perhaps the great captain himself? https://twitter.com/capnproto
Technically, I would call this a binary (bitwise) serialisation scheme for the data that underlies a URL. Treat the URL as text-representation of conceptual data, then serialize that conceptual data model with a specialised serializer. The outcome is a more compressed version of the original of course. This is very different to how a general-purpose compression algorithm works.
What's your goal?
A shorter URL? Try URL shorteners like http://tinyurl.com/ or http://is.gd/
Storage space? Check out System.IO.Compression. (Or SharpZipLib)
You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations
This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.
And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.
I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/
Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....
Have you tried just using gzip?
No idea if it would work effectively with such short strings, but I'd say its probably your best bet.
The open source library SharpZipLib is easy to use and will provide you with compression tools

Smart ASCII string representation

I have an app that converts binary file into ASCII file. With profiler I found that I spend 25% of time doing Encoding.GetBytes() which is called from BinaryWriter.Write(wchar[]). It is completely correct since I have many constructs similar to this one:
m_writer.Write("some fancy long text".ToCharArray());
Do you have any smart idea how to avoid this encoding conversion?
I now that one idea would be to to do something similar to this:
const byte[] SOME_FANCY_LONG_TEXT = Encoding.ASCII.GetBytes("some fancy ...");
// ... and later
m_writer.Write(SOME_FANCY_LONG_TEXT);
but I have to many such entries to do it manually.
If you're creating a text file, why are you using BinaryWriter at all? Just use a TextWriter. BinaryWriter is meant for binary streams where you want to write primitives, strings etc in a simple way.
(Is all your text definitely going to be ASCII, by the way? You might want to consider using UTF-8 instead.)

Categories