HTML hex to polish characters

HTML hex to polish characters - c#

I'm downloading HTML file with polish characters, and parsing it to string by:
public static string HexToString(string hex)
{
var sb = new StringBuilder();
for (int i = 0; i < hex.Length; i += 2)
{
string hexdec = hex.Substring(i, 2);
int number = int.Parse(hexdec, NumberStyles.HexNumber);
char charToAdd = (char)number;
sb.Append(charToAdd);
}
return sb.ToString();
}
so when I found %21 I'm sending 21 to HexToString() and in return there is !, this is ok, but char ą is represented as %C4%85 (Ä) and I whant to get ą char

The problem here is that you are treating the hex codes as if they are UTF16 (which is the native format for char), but they are in fact UTF8.
This is easy to resolve using a UTF8 encoding.
First, let's write a handy StringToByteArray() method:
public static byte[] StringToByteArray(string hex)
{
return Enumerable.Range(0, hex.Length)
.Where(x => x%2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
Now you can convert the hex string to text like so:
string hexStr = "C485"; // Or whatever your input hex string is.
var bytes = StringToByteArray(hexStr);
string text = Encoding.UTF8.GetString(bytes);
// ...use text

Matthew is right, but you can also use this:
public static string ConvertHexToString(string HexValue)
{
var res = "";
var replacedHex = HexValue.Replace("%", String.Empty);
while (replacedHex.Length > 0)
{
res += System.Convert.ToChar(System.Convert.ToUInt32(replacedHex.Substring(0, 2), 16)).ToString();
replacedHex = replacedHex.Substring(2, replacedHex.Length - 2);
}
return res;
}

Related

C# How to convert \uxxxx to plaintext (hex escape sequence) [duplicate]

How can I convert this string:
This string contains the Unicode character Pi(π)
into an escaped ASCII string:
This string contains the Unicode character Pi(\u03a0)
and vice versa?
The current Encoding available in C# converts the π character to "?". I need to preserve that character.

This goes back and forth to and from the \uXXXX format.
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
#"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:
This function contains a unicode character pi (π)
This function contains a unicode character pi (\u03a0)
This function contains a unicode character pi (π)

For Unescape You can simply use this functions:
System.Text.RegularExpressions.Regex.Unescape(string)
System.Uri.UnescapeDataString(string)
I suggest using this method (It works better with UTF-8):
UnescapeDataString(string)

string StringFold(string input, Func<char, string> proc)
{
return string.Concat(input.Select(proc).ToArray());
}
string FoldProc(char input)
{
if (input >= 128)
{
return string.Format(#"\u{0:x4}", (int)input);
}
return input.ToString();
}
string EscapeToAscii(string input)
{
return StringFold(input, FoldProc);
}

As a one-liner:
var result = Regex.Replace(input, #"[^\x00-\x7F]", c =>
string.Format(#"\u{0:x4}", (int)c.Value[0]));

class Program
{
static void Main(string[] args)
{
char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
foreach (char c in originalString)
{
// test if char is ascii, otherwise convert to Unicode Code Point
int cint = Convert.ToInt32(c);
if (cint <= 127 && cint >= 0)
asAscii.Append(c);
else
asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
}
All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.

Here is my current implementation:
public static class UnicodeStringExtensions
{
public static string EncodeNonAsciiCharacters(this string value) {
var bytes = Encoding.Unicode.GetBytes(value);
var sb = StringBuilderCache.Acquire(value.Length);
bool encodedsomething = false;
for (int i = 0; i < bytes.Length; i += 2) {
var c = BitConverter.ToUInt16(bytes, i);
if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
sb.Append((char) c);
} else {
sb.Append($"\\u{c:x4}");
encodedsomething = true;
}
}
if (!encodedsomething) {
StringBuilderCache.Release(sb);
return value;
}
return StringBuilderCache.GetStringAndRelease(sb);
}
public static string DecodeEncodedNonAsciiCharacters(this string value)
=> Regex.Replace(value,/*language=regexp*/#"(?:\\u[a-fA-F0-9]{4})+", Decode);
static readonly string[] Splitsequence = new [] { "\\u" };
private static string Decode(Match m) {
var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
.Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
return Encoding.Unicode.GetString(bytes);
}
}
This passes a test:
public void TestBigUnicode() {
var s = "\U00020000";
var encoded = s.EncodeNonAsciiCharacters();
var decoded = encoded.DecodeEncodedNonAsciiCharacters();
Assert.Equals(s, decoded);
}
with the encoded value: "\ud840\udc00"
This implementation makes use of a StringBuilderCache (reference source link)

A small patch to #Adam Sills's answer which solves FormatException on cases where the input string like "c:\u00ab\otherdirectory\" plus RegexOptions.Compiled makes the Regex compilation much faster:
private static Regex DECODING_REGEX = new Regex(#"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
private const string PLACEHOLDER = #"#!#";
public static string DecodeEncodedNonAsciiCharacters(this string value)
{
return DECODING_REGEX.Replace(
value.Replace(#"\\", PLACEHOLDER),
m => {
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
.Replace(PLACEHOLDER, #"\\");
}

To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints). Use System.Text.Encoding.UTF32.GetBytes() for that, and then write the resulting bytes to the StringBuilder as needed,i.e.
static void Main(string[] args)
{
String originalString = "This string contains the unicode character Pi(π)";
Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
StringBuilder asAscii = new StringBuilder();
for (int idx = 0; idx < bytes.Length; idx += 4)
{
uint codepoint = BitConverter.ToUInt32(bytes, idx);
if (codepoint <= 127)
asAscii.Append(Convert.ToChar(codepoint));
else
asAscii.AppendFormat("\\u{0:x4}", codepoint);
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}

You need to use the Convert() method in the Encoding class:
Create an Encoding object that represents ASCII encoding
Create an Encoding object that represents Unicode encoding
Call Encoding.Convert() with the source encoding, the destination encoding, and the string to be encoded
There is an example here:
using System;
using System.Text;
namespace ConvertExample
{
class ConvertExampleClass
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi(\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
}

How to XOR a MD5 hash and return a 32 character string?

How do I further encrypt a MD5 hash by XOR'ing it with a string of variable size (not bigger than 32 characters) ?
I would like the result of the XOR to be a 32 character string as well.
What i have tried so far is:
convert the md5 string to binary
convert second string to binary
pad second binary with 0's (to the left) until both binaries are of equal length
iterate the binary representations and XOR them
convert the XOR'ed result to a string
The approach may be wrong, im not sure how to do it. My problem is, when converting the result of the XOR, it is not a 32 character long string, as I would like it to be.
Sample code (equal length strings in this case):
class Program
{
static void Main(string[] args)
{
var md51 = ToBinary(ConvertToByteArray(CalculateMD5Hash("Maaa"), Encoding.ASCII));
var md52 = ToBinary(ConvertToByteArray(CalculateMD5Hash("Moo"), Encoding.ASCII));
List<int> xoredResult = new List<int>();
for (int i = 0; i < md51.Length; i++)
{
var string1 = md51[i];
var string2 = md52[i];
var xor = string1 ^ string2;
xoredResult.Add(xor);
}
var resultingString = string.Join("", xoredResult);
Console.WriteLine(resultingString.Length);
var data = GetBytesFromBinaryString(resultingString);
var text = Encoding.ASCII.GetString(data);
}
public static byte[] ConvertToByteArray(string str, Encoding encoding)
{
return encoding.GetBytes(str);
}
public static String ToBinary(Byte[] data)
{
return string.Join("", data.Select(byt => Convert.ToString(byt, 2).PadLeft(8, '0')));
}
public static Byte[] GetBytesFromBinaryString(String binary)
{
var list = new List<Byte>();
for (int i = 0; i < binary.Length; i += 8)
{
String t = binary.Substring(i, 8);
list.Add(Convert.ToByte(t, 2));
}
return list.ToArray();
}
public static string CalculateMD5Hash(string input)
{
// step 1, calculate MD5 hash from input
MD5 md5 = System.Security.Cryptography.MD5.Create();
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hash = md5.ComputeHash(inputBytes);
// step 2, convert byte array to hex string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++)
{
sb.Append(hash[i].ToString("X2"));
}
return sb.ToString();
}
}

xoring a string with what is essentially random bytes is not guaranteed to give you a valid string as a output. Your var text = Encoding.ASCII.GetString(data); is likely failing because you are passing it a non valid string in byte form. You must use something like var text = Convert.ToBase64String(data) to be able to represent the random data without loss of information in the process.

can't display Arabic characters in unity using c# [duplicate]

How can I convert this string:
This string contains the Unicode character Pi(π)
into an escaped ASCII string:
This string contains the Unicode character Pi(\u03a0)
and vice versa?
The current Encoding available in C# converts the π character to "?". I need to preserve that character.

This goes back and forth to and from the \uXXXX format.
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
#"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:
This function contains a unicode character pi (π)
This function contains a unicode character pi (\u03a0)
This function contains a unicode character pi (π)

For Unescape You can simply use this functions:
System.Text.RegularExpressions.Regex.Unescape(string)
System.Uri.UnescapeDataString(string)
I suggest using this method (It works better with UTF-8):
UnescapeDataString(string)

string StringFold(string input, Func<char, string> proc)
{
return string.Concat(input.Select(proc).ToArray());
}
string FoldProc(char input)
{
if (input >= 128)
{
return string.Format(#"\u{0:x4}", (int)input);
}
return input.ToString();
}
string EscapeToAscii(string input)
{
return StringFold(input, FoldProc);
}

As a one-liner:
var result = Regex.Replace(input, #"[^\x00-\x7F]", c =>
string.Format(#"\u{0:x4}", (int)c.Value[0]));

class Program
{
static void Main(string[] args)
{
char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
foreach (char c in originalString)
{
// test if char is ascii, otherwise convert to Unicode Code Point
int cint = Convert.ToInt32(c);
if (cint <= 127 && cint >= 0)
asAscii.Append(c);
else
asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
}
All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.

Here is my current implementation:
public static class UnicodeStringExtensions
{
public static string EncodeNonAsciiCharacters(this string value) {
var bytes = Encoding.Unicode.GetBytes(value);
var sb = StringBuilderCache.Acquire(value.Length);
bool encodedsomething = false;
for (int i = 0; i < bytes.Length; i += 2) {
var c = BitConverter.ToUInt16(bytes, i);
if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
sb.Append((char) c);
} else {
sb.Append($"\\u{c:x4}");
encodedsomething = true;
}
}
if (!encodedsomething) {
StringBuilderCache.Release(sb);
return value;
}
return StringBuilderCache.GetStringAndRelease(sb);
}
public static string DecodeEncodedNonAsciiCharacters(this string value)
=> Regex.Replace(value,/*language=regexp*/#"(?:\\u[a-fA-F0-9]{4})+", Decode);
static readonly string[] Splitsequence = new [] { "\\u" };
private static string Decode(Match m) {
var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
.Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
return Encoding.Unicode.GetString(bytes);
}
}
This passes a test:
public void TestBigUnicode() {
var s = "\U00020000";
var encoded = s.EncodeNonAsciiCharacters();
var decoded = encoded.DecodeEncodedNonAsciiCharacters();
Assert.Equals(s, decoded);
}
with the encoded value: "\ud840\udc00"
This implementation makes use of a StringBuilderCache (reference source link)

A small patch to #Adam Sills's answer which solves FormatException on cases where the input string like "c:\u00ab\otherdirectory\" plus RegexOptions.Compiled makes the Regex compilation much faster:
private static Regex DECODING_REGEX = new Regex(#"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
private const string PLACEHOLDER = #"#!#";
public static string DecodeEncodedNonAsciiCharacters(this string value)
{
return DECODING_REGEX.Replace(
value.Replace(#"\\", PLACEHOLDER),
m => {
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
.Replace(PLACEHOLDER, #"\\");
}

To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints). Use System.Text.Encoding.UTF32.GetBytes() for that, and then write the resulting bytes to the StringBuilder as needed,i.e.
static void Main(string[] args)
{
String originalString = "This string contains the unicode character Pi(π)";
Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
StringBuilder asAscii = new StringBuilder();
for (int idx = 0; idx < bytes.Length; idx += 4)
{
uint codepoint = BitConverter.ToUInt32(bytes, idx);
if (codepoint <= 127)
asAscii.Append(Convert.ToChar(codepoint));
else
asAscii.AppendFormat("\\u{0:x4}", codepoint);
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}

You need to use the Convert() method in the Encoding class:
Create an Encoding object that represents ASCII encoding
Create an Encoding object that represents Unicode encoding
Call Encoding.Convert() with the source encoding, the destination encoding, and the string to be encoded
There is an example here:
using System;
using System.Text;
namespace ConvertExample
{
class ConvertExampleClass
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi(\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
}

Convert Binary String to its readable equivalence C#

I have a string array that contains some binary data, and I would like to convert the binary to its equivalent character representative.
Each element inside the array contains 8 bit "1 byte" of data and I need to know how to convert it to its character equivalence
Here is the string array:
IEnumerable<string> resultChunks = Enumerable.Range(0, result.Length / 8)
.Select(x => result.Substring(x * 8, 8));
string[] newRes = resultChunks.ToArray();
string tempRes="";
for (int i = 0; i < newRes.Length; i++)
{
tempRes+=Convert.ToString(newRes[i]);
}
Current "result" is "0010001111000100001010010011101111000111001100110110011100110110"

If your data is in a byte array then you could use this:
string result = Encoding.ASCII.GetString(bytes);

I guess, your string is base64 string.
Then you can use next method:
public static string FromBase64(string base64Str, Encoding encoding = null)
{
if (string.IsNullOrWhiteSpace(base64Str))
{
return string.Empty;
}
byte[] bytes;
try
{
bytes = System.Convert.FromBase64String(base64Str);
}
catch (FormatException)
{
return string.Empty;
}
var stringEncoding = encoding ?? Encoding.UTF8;
return stringEncoding.GetString(bytes);
}

Converting a paragraph to hex notatation, then back to string

How would you convert a parapraph to hex notation, and then back again into its original string form?
(C#)
A side note: would putting the string into hex format shrink it the most w/o getting into hardcore shrinking algo's?

What exactly do you mean by "hex notation"? That usually refers to encoding binary data, not text. You'd need to encode the text somehow (e.g. using UTF-8) and then encode the binary data as text by converting each byte to a pair of characters.
using System;
using System.Text;
public class Hex
{
static void Main()
{
string original = "The quick brown fox jumps over the lazy dog.";
byte[] binary = Encoding.UTF8.GetBytes(original);
string hex = BytesToHex(binary);
Console.WriteLine("Hex: {0}", hex);
byte[] backToBinary = HexToBytes(hex);
string restored = Encoding.UTF8.GetString(backToBinary);
Console.WriteLine("Restored: {0}", restored);
}
private static readonly char[] HexChars = "0123456789ABCDEF".ToCharArray();
public static string BytesToHex(byte[] data)
{
StringBuilder builder = new StringBuilder(data.Length*2);
foreach(byte b in data)
{
builder.Append(HexChars[b >> 4]);
builder.Append(HexChars[b & 0xf]);
}
return builder.ToString();
}
public static byte[] HexToBytes(string text)
{
if ((text.Length & 1) != 0)
{
throw new ArgumentException("Invalid hex: odd length");
}
byte[] ret = new byte[text.Length/2];
for (int i=0; i < text.Length; i += 2)
{
ret[i/2] = (byte)(ParseNybble(text[i]) << 4 | ParseNybble(text[i+1]));
}
return ret;
}
private static int ParseNybble(char c)
{
if (c >= '0' && c <= '9')
{
return c-'0';
}
if (c >= 'A' && c <= 'F')
{
return c-'A'+10;
}
if (c >= 'a' && c <= 'f')
{
return c-'A'+10;
}
throw new ArgumentOutOfRangeException("Invalid hex digit: " + c);
}
}
No, doing this would not shrink it at all. Quite the reverse - you'd end up with a lot more text! However, you could compress the binary form. In terms of representing arbitrary binary data as text, Base64 is more efficient than plain hex. Use Convert.ToBase64String and Convert.FromBase64String for the conversions.

public string ConvertToHex(string asciiString)
{
string hex = "";
foreach (char c in asciiString)
{
int tmp = c;
hex += String.Format("{0:x2}", (uint)System.Convert.ToUInt32(tmp.ToString()));
}
return hex;
}

While I can't help much on the C# implementation, I would highly recommend LZW as a simple-to-implement data compression algorithm for you to use.

Perhaps the answer can be more quickly reached if we ask: what are you really trying to do? Converting an ordinary string to a string of a hex representation seems like the wrong approach to anything, unless you are making a hexidecimal/encoding tutorial for the web.

static byte[] HexToBinary(string s) {
byte[] b = new byte[s.Length / 2];
for (int i = 0; i < b.Length; i++)
b[i] = Convert.ToByte(s.Substring(i * 2, 2), 16);
return b;
}
static string BinaryToHex(byte[] b) {
StringBuilder sb = new StringBuilder(b.Length * 2);
for (int i = 0; i < b.Length; i++)
sb.Append(Convert.ToString(256 + b[i], 16).Substring(1, 2));
return sb.ToString();
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML hex to polish characters - c#

Related

C# How to convert \uxxxx to plaintext (hex escape sequence) [duplicate]

How to XOR a MD5 hash and return a 32 character string?

can't display Arabic characters in unity using c# [duplicate]

Convert Binary String to its readable equivalence C#

Converting a paragraph to hex notatation, then back to string

Categories

Resources