Reformatting hex string input - c#

I'm trying to parse a user-entered hex string so that I can emit a correctly formatted version to disk.
The "correct" format, in this case, looks like A1 F5 E1 C9 - space separated bytes with uppercase hex letters. However, the user input might not have spaces (A1F5E1C9), might have line breaks (A1 F5\nE1 C9), might have leading or trailing space (\nA1 F5 E1 C9\n\n\n), and might have dashes instead of spaces (A1-F5-E1-C9). It might have any combination of those variations, as well. Some of the numbers this will be used on are public keys, which can be quite long.
How can I handle reformatting this? The two semi-solutions I've been able to come up with so far are
BigInteger.Parse(value.Trim()
.Replace(" ", "")
.Replace(#"\n", "")
.Replace(#"\r", ""),
NumberStyles.HexNumber).ToString("X2");
which still doesn't produce a spaced-out string, or
string.Join(" ", Regex.Matches(a, #"([0-9A-Fa-f]{2})")
.Cast<Match>()
.Select(x => x.Captures[0].Value.ToUpper()))
which does work, but feels like it has a lot of extraneous overhead (Regex, LINQ).
Is the second method actually the best way to do this? Is there something obvious I'm overlooking?

I don't know how long the hex strings can be in your case, but you can convert your string (after cleaning it out of not needed characters) to byte array and use BitConverter class to convert it to proper string.
It is described e.g. here:
How do you convert Byte Array to Hexadecimal String, and vice versa?
BitConverter class is described here: https://msdn.microsoft.com/en-us/library/system.bitconverter(v=vs.110).aspx

Related

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

Replace long integers in raw json with strings

I'm looking for a way to wrap all integers over 17 digits long in a json-formatted string in quotes (essentially making them strings when deserialized).
Someone facing the same issue in Javascript posted here Convert all the integer value to string in JSON
I suspect there is a way to use Regex.Replace() here but the need to understand the syntax and regex's between the two languages has me a bit lost.
So far I have:
string pattern = #"/:\s*(\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d+)\s*([,\}])/g";
content = Regex.Replace(content,pattern, #":""{1}""{2}");
Zero-width negative lookahead/lookbehind (https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx#grouping_constructs) is what you should be using to make sure there are no quotes at the start or end. That way you don't need to know about the exact JSON format when you do the replacement:
string pattern = #"(?<![""\w])(\d{17,})(?![""\w])";
string content = Regex.Replace(content, pattern, "\"$1\"");
This solution won't care whether there is a space between the : and the number. It will also handle numbers in arrays [ 0123456701234567, 0123456701234567 ] or by themselves.
Regex still isn't an ideal solution unless you know what content will be passed into it as this breaks as soon as you have a number included in a string value, e.g. "abc 0123456701234567 def".
wrap all integers over 17 digits long in a json-formatted string in quotes
I would use the following:
string pattern = "[^\"\\d](\\d{17,})[^\"\\d]";
content = Regex.Replace(content,pattern, "\"$1\"");
The first line selects all numeric values of 17 digits or greater (and ensure that they aren't already strings).
The second line wraps these 17 digits inside of double quotes.
If your JSON is minified, it changes the regex a little. We can use, which will make sure the resulting JSON is still valid.
string pattern = ":(\\d{17,})";
content = Regex.Replace(content,pattern, "\"$1\"");

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?
The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong 🀜 denotes the Four of circles";
The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.
(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);
In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.

Finding a word - String Operationg or Linq

I have a string full of a few hundred words.
How would I get each "word" (this can also be a single letter number or punctuation), and as each "word" is found, it is removed from the string.
Is this possible?
Example:
String:
"this is a string full of words and letters and also some punctuation! and num6er5."
As far as the algorithm is concerned, there are exactly 15 words in the above string.
What you're trying to do is known as tokenizing.
In C#, the string Split() function works pretty well. If it's used like in Niedermair's code without any parameters, it returns an array of strings split (splitted?) by any spaces like this:
"I have spaces" -> {"I", "have", "spaces"}
You can also give any chars to split on as a parameter to Split() (for instance, ',' or ';' to handle csv files).
The Split() method pays no heed to what goes into the strings, so any letters, numbers and other chars will be handled.
About removing the words from the string: You might want to write the string into a buffer to achieve this, but I seriously think that's going too far. Strings are immutable which means any time you remove the "next word" you'll have to recreate the entire string object.
It will be a lot easier to just Split() the entire string, throw the string away, and work with the array from there on.

C# Build hexadecimal notation string

How do I build an escape sequence string in hexadecimal notation.
Example:
string s = "\x1A"; // this will create the hex-value 1A or dec-value 26
I want to be able to build strings with hex-values between 00 to FF like this (in this example 1B)
string s = "\x" + "1B"; // Unrecognized escape sequence
Maybe there's another way of making hexadecimal strings...
Please try to avoid the \x escape sequence. It's difficult to read because where it stops depends on the data. For instance, how much difference is there at a glance between these two strings?
"\x9Good compiler"
"\x9Bad compiler"
In the former, the "\x9" is tab - the escape sequence stops there because 'G' is not a valid hex character. In the second string, "\x9Bad" is all an escape sequence, leaving you with some random Unicode character and " compiler".
I suggest you use the \u escape sequence instead:
"\u0009Good compiler"
"\u0009Bad compiler"
(Of course for tab you'd use \t but I hope you see what I mean...)
This is somewhat aside from the original question of course, but that's been answered already :)
You don't store hexadecimal values in strings.
You can, but it would just be that, a string, and would have to be cast to an integer or a byte to actually read its value.
You can assign a hexadecimal value as a literal to an int or a byte though:
Byte value = 0x0FF;
int value = 0x1B;
So, its easily possible to pass an hexadecimal literal into your string:
string foo = String.Format("{0} hex test", 0x0BB);
Which would create this string "126 hex test".
But I don't think that's what you wanted?
There's an '\u' escape code for hexadecimal 16 bits unicode character codes.
Console.WriteLine( "Look, I'm so happy : \u263A" );

Categories