Encoding errors in embedded Json file

Encoding errors in embedded Json file - c#

I have run into an issue and can't quite get my head around it.
I have this code:
public List<NavigationModul> LoadNavigation()
{
byte[] navBytes = NavigationResources.Navigation;
var encoding = GetEncoding(navBytes);
string json = encoding.GetString(navBytes);
List<NavigationModul> navigation = JsonConvert.DeserializeObject<List<NavigationModul>>(json);
return navigation;
}
public static Encoding GetEncoding(byte [] textBytes)
{
if (textBytes[0] == 0x2b && textBytes[1] == 0x2f && textBytes[2] == 0x76) return Encoding.UTF7;
if (textBytes[0] == 0xef && textBytes[1] == 0xbb && textBytes[2] == 0xbf) return Encoding.UTF8;
if (textBytes[0] == 0xff && textBytes[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (textBytes[0] == 0xfe && textBytes[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (textBytes[0] == 0 && textBytes[1] == 0 && textBytes[2] == 0xfe && textBytes[3] == 0xff) return Encoding.UTF32;
return Encoding.ASCII;
}
The Goal is to load an embedded Json File (NavigationResources.Navigation) from a ResourceFile. The Navigation File is an embedded file. We are just jusing the ResourceManager to avoid Magic strings.
After loading the bytes of the embedded file and checking for its encoding, I now read the String from the file and pass it to the JsonConverter.DeserializeObject function.
But unfortunaly this fails due to invalid Json. Long story short: The loaded json string still contains encoding identification bytes. And I can't figure out how to get rid of it.
I also tryed to convert the utf8 bytearray to default encoding before loading the string but this only makes the encoding bytes become a visible charecter.
I talked to my peers and they told me that they have run into the same problem reading embedded batchfiles, leading to broken batchfiles. They didn't know how to fix the problem either, but came up with a workaround for the batchfiles itself (add a blank line into the batchfile to make it work)
Any suggestions on how to fix this?

Thanks to Alex K. I have a solution:
Cuting of the Identification Bytes before calling Encoding.GetString did the trick.
Here is my function I now use to do the Task:
public static string GetStringFromEncodedBytes(byte[] bytes) {
Encoding encoding = Encoding.Default;
int skipBytes = 0;
if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
encoding = Encoding.UTF7;
skipBytes = 3;
}
if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf)
{
encoding = Encoding.UTF8;
skipBytes = 3;
}
if (bytes[0] == 0xff && bytes[1] == 0xfe)
{
encoding = Encoding.Unicode;
skipBytes = 2;
}
if (bytes[0] == 0xfe && bytes[1] == 0xff)
{
encoding = Encoding.BigEndianUnicode;
skipBytes = 2;
}
if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff)
{
encoding = Encoding.UTF32;
skipBytes = 4;
}
return encoding.GetString(bytes.Skip(skipBytes).ToArray());
}

Here's a simpler approach, removing the BOM after decoding:
// Your data is always in UTF-8 apparently, so just rely on that.
string text = Encoding.UTF8.GetString(data);
if (text.StartsWith("\ufeff"))
{
text = text.Substring(1);
}
This has the downside of copying the string, of course.
Or if you do want to skip the bytes:
// Again, we're assuming UTF-8
int start = data.Length >= 3 && data[0] == 0xef &&
data[1] == 0xbb && data[2] == 0xbf)
? 3 : 0;
string text = Encoding.UTF8.GetString(data, start, data.Length - start);
That way you don't need to use Skip and ToArray, and it avoids doing any extraneous copying.

Related

Unicode character is written with wrong byteorder

I'm trying to add byteorder mark to my string. But when I open the output file, my mark is reversed for some reason (0xFF 0xFE is written instead of 0xFE 0xFF). I wonder what could be the reason for such a behaviour...
if (Globals.target_encoding.WebName == "utf-16" && !isURLFrame)
{
BOM = '\uFEFF'+BOM;
}
if (Globals.target_revision.number == 0x04)
{
processedString = BOM + source.text.Replace(Globals.target_separator.ToString(), '\0' + BOM) + '\0';
}
else
{
processedString = BOM + source.text + '\0';
};
if (!isURLFrame)
{
contents = new byte[1 + Globals.target_encoding.GetByteCount(processedString)];
contents[0] = targetEncodingValue(); // Тип кодировки
Array.Copy(Globals.target_encoding.GetBytes(processedString), 0, contents, 1, Globals.target_encoding.GetByteCount(processedString));
}
else
{
contents = new byte[Encoding.ASCII.GetByteCount(processedString)];
Array.Copy(Encoding.ASCII.GetBytes(processedString), 0, contents, 0, Encoding.ASCII.GetByteCount(processedString));
}

Decode cyrillic quoted-printable content

I'm using this sample for getting mail from server. Problem is that response contains cyrillic symbols I cannot decode.
Here is a header:
Content-type: text/html; charset="koi8-r"
Content-Transfer-Encoding: quoted-printable
And receive response function:
static void receiveResponse(string command)
{
try
{
if (command != "")
{
if (tcpc.Connected)
{
dummy = Encoding.ASCII.GetBytes(command);
ssl.Write(dummy, 0, dummy.Length);
}
else
{
throw new ApplicationException("TCP CONNECTION DISCONNECTED");
}
}
ssl.Flush();
byte[] bigBuffer = new byte[1024*16];
int bites = ssl.Read(bigBuffer, 0, bigBuffer.Length);
byte[] buffer = new byte[bites];
Array.Copy(bigBuffer, 0, buffer, 0, bites);
sb.Append(Encoding.ASCII.GetString(buffer));
string result = sb.ToString();
// here is an unsuccessful attempt at decoding
result = Regex.Replace(result, #"=([0-9a-fA-F]{2})",
m => m.Groups[1].Success
? Convert.ToChar(Convert.ToInt32(m.Groups[1].Value, 16)).ToString()
: "");
byte[] bytes = Encoding.Default.GetBytes(result);
result = Encoding.GetEncoding("koi8r").GetString(bytes);
}
catch (Exception ex)
{
throw new ApplicationException(ex.ToString());
}
}
How to decode stream correctly? In result string I got <p>=F0=D2=C9=D7=C5=D4 =D1 =F7=C1=CE=D1</p> instead of <p>Привет я Ваня</p>.

As #Max pointed out, you will need to decode the content using the encoding algorithm declared in the Content-Transfer-Encoding header.
In your case, it is the quoted-printable encoding.
You will need to decode the text of the message into an array of bytes and then you’ll need to convert that array of bytes into a string using the appropriate System.Text.Encoding. The name of the encoding to use will typically be specified in the Content-Type header as the charset parameter (in your case, koi8-r).
Since you already have the text as bytes in the buffer variable, simply perform the deciding on that:
byte[] buffer = new byte[bites];
int decodedLength = 0;
for (int i = 0; i < bites; i++) {
if (bigBuffer[i] == (byte) '=') {
if (bites > i + 1) {
// possible hex sequence
byte b1 = bigBuffer[i + 1];
byte b2 = bigBuffer[i + 2];
if (IsXDigit (b1) && IsXDigit (b2)) {
// decode
buffer[decodedLength++] = (ToXDigit (b1) << 4) | ToXDigit (b2);
i += 2;
} else if (b1 == (byte) '\r' && b2 == (byte) '\n') {
// folded line, drop the '=\r\n' sequence
i += 2;
} else {
// error condition, just pass it through
buffer[decodedLength++] = bigBuffer[i];
}
} else {
// truncated? just pass it through
buffer[decodedLength++] = bigBuffer[i];
}
} else {
buffer[decodedLength++] = bigBuffer[i];
}
}
string result = Encoding.GetEncoding ("koi8-r").GetString (buffer, 0, decodedLength);
Custom functions:
static byte ToXDigit (byte c)
{
if (c >= 0x41) {
if (c >= 0x61)
return (byte) (c - (0x61 - 0x0a));
return (byte) (c - (0x41 - 0x0A));
}
return (byte) (c - 0x30);
}
static bool IsXDigit (byte c)
{
return (c >= (byte) 'A' && c <= (byte) 'F') || (c >= (byte) 'a' && c <= (byte) 'f') || (c >= (byte) '0' && c <= (byte) '9');
}
Of course, instead of writing your own hodge podge IMAP library, you could just use MimeKit and MailKit ;-)

Reading multi language text file in c#

I have to read a text file which can contains char from following languages: English, Japanese, Chinese, French, Spanish, German, Italian
My task is to simply read the data and write it to new text file (placing new line char \n after 100 chars).
I cannot use File.ReadAllText and File.ReadAllLines as file size can be more than 500 MB. So I have written following code:
using (var streamReader = new StreamReader(inputFilePath, Encoding.ASCII))
{
using (var streamWriter = new StreamWriter(outputFilePath,false))
{
char[] bytes = new char[100];
while (streamReader.Read(bytes, 0, 100) > 0)
{
var data = new string(bytes);
streamWriter.WriteLine(data);
}
MessageBox.Show("Compleated");
}
}
Other than ASCII encoding I have tried UTF-7, UTF-8, UTF-32 and IBM500. But no luck in reading and writing multi language characters.
Please help me to achieve this.

You will have to take a look at the first 4 bytes of the file you are parsing.
these bytes will give you a hint on what encoding you have to use.
Here is a helper method I have written to do the task:
public static string GetStringFromEncodedBytes(this byte[] bytes) {
var encoding = Encoding.Default;
var skipBytes = 0;
if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
encoding = Encoding.UTF7;
skipBytes = 3;
}
if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf) {
encoding = Encoding.UTF8;
skipBytes = 3;
}
if (bytes[0] == 0xff && bytes[1] == 0xfe) {
encoding = Encoding.Unicode;
skipBytes = 2;
}
if (bytes[0] == 0xfe && bytes[1] == 0xff) {
encoding = Encoding.BigEndianUnicode;
skipBytes = 2;
}
if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff) {
encoding = Encoding.UTF32;
skipBytes = 4;
}
return encoding.GetString(bytes.Skip(skipBytes).ToArray());
}

This is a good enough start to get to the answer. If i is not equal to 100 you need to read more chars. No trouble with french chars like é - they are all handled in C# char class.
char[] soFlow = new char[100];
int posn = 0;
using (StreamReader sr = new StreamReader("a.txt"))
using (StreamWriter sw = new StreamWriter("b.txt", false))
while(sr.EndOfStream == false)
{
try {
int i = sr.Read(soFlow, posn%100, 100);
//if i < 100 need to read again with second char array
posn += 100;
sw.WriteLine(new string(soFlow));
}
catch(Exception e){Console.WriteLine(e.Message);}
}
Spec: Read(Char[], Int32, Int32) Reads a specified maximum of characters from the current stream into a buffer, beginning at the specified index.
Certainly worked for me anyway :)

Checking if Bytes are 0x00

What is the most readable (and idiomatic) to write this method?
private bool BytesAreValid(byte[] bytes) {
var t = (bytes[0] | bytes[1] | bytes[2]);
return t != 0;
}
I need a function which tests the first three bytes of a file that it's not begin with 00 00 00.
Haven't done much byte manipulation. The code above doesn't seem correct to me, since t is inferred of type Int32.

t is type-inferred to be an Int32
Yup, because the | operator (like most operators) isn't defined for byte - the bytes are promoted to int values. (See section 7.11.1 of the C# 4 spec for details.)
But given that you only want to compare it with 0, that's fine anyway.
Personally I'd just write it as:
return bytes[0] != 0 && bytes[1] != 0 && bytes[2] != 0;
Or even:
return (bytes[0] != 0) && (bytes[1] != 0) && (bytes[2] != 0);
Both of these seem clearer to me.

private bool BytesAreValid(byte[] bytes) {
return !bytes.Take(3).SequenceEqual(new byte[] { 0, 0, 0 });
}

To anticipate variable array lengths and avoid null reference exceptions:
private bool BytesAreValid(byte[] bytes)
{
if (bytes == null) return false;
return !Array.Exists(bytes, x => x == 0);
}
Non-Linq version:
private bool BytesAreValid(byte[] bytes)
{
if (bytes == null) return false;
for (int i = 0; i < bytes.Length; i++)
{
if (bytes[i] == 0) return false;
}
return true;
}

Binary pattern comparison shortcut / fastest implementation in C#

I need to check a given byte or series of bytes for a particular sequence of bits as follows:
Can start with zero or more number of 0s.
Can start with zero or more number of 1s.
Must contain at least one 0 at the end.
In other words, if the value of bytes is not 0, then we are only interested in values that contain consecutive 1s followed by at least one 0 at the end.
I wrote the following code to do just that but wanted to make sure that it highly optimized. I feel that the multiple checks within the if branches could be optimized but am not sure how. Please advise.
// The parameter [number] will NEVER be negative.
public static bool ConformsToPattern (System.Numerics.BigInteger number)
{
byte [] bytes = null;
bool moreOnesPossible = true;
if (number == 0) // 00000000
{
return (true); // All bits are zero.
}
else
{
bytes = number.ToByteArray();
if ((bytes [bytes.Length - 1] & 1) == 1)
{
return (false);
}
else
{
for (byte b=0; b < bytes.Length; b++)
{
if (moreOnesPossible)
{
if
(
(bytes [b] == 1) // 00000001
|| (bytes [b] == 3) // 00000011
|| (bytes [b] == 7) // 00000111
|| (bytes [b] == 15) // 00001111
|| (bytes [b] == 31) // 00011111
|| (bytes [b] == 63) // 00111111
|| (bytes [b] == 127) // 01111111
|| (bytes [b] == 255) // 11111111
)
{
// So far so good. Continue to the next byte with
// a possibility of more consecutive 1s.
}
else if
(
(bytes [b] == 128) // 10000000
|| (bytes [b] == 192) // 11000000
|| (bytes [b] == 224) // 11100000
|| (bytes [b] == 240) // 11110000
|| (bytes [b] == 248) // 11111000
|| (bytes [b] == 252) // 11111100
|| (bytes [b] == 254) // 11111110
)
{
moreOnesPossible = false;
}
else
{
return (false);
}
}
else
{
if (bytes [b] > 0)
{
return (false);
}
}
}
}
}
return (true);
}
IMPORTANT: The argument [number] sent to the function will NEVER be negative so no need to check for the sign bit.

I'm going to say that none of these answers are accounting for
00000010
00000110
00001110
00011110
00111110
01111110
00000100
00001100
00011100
00111100
01111100
etc, etc, etc.
Here's my byte array method:
public static bool ConformsToPattern(System.Numerics.BigInteger number)
{
bool foundStart = false, foundEnd = false;
int startPosition, stopPosition, increment;
if (number.IsZero || number.IsPowerOfTwo)
return true;
if (!number.IsEven)
return false;
byte[] bytes = number.ToByteArray();
if(BitConverter.IsLittleEndian)
{
startPosition = 0;
stopPosition = bytes.Length;
increment = 1;
}
else
{
startPosition = bytes.Length - 1;
stopPosition = -1;
increment = -1;
}
for(int i = startPosition; i != stopPosition; i += increment)
{
byte n = bytes[i];
for(int shiftCount = 0; shiftCount < 8; shiftCount++)
{
if (!foundEnd)
{
if ((n & 1) == 1)
foundEnd = true;
n = (byte)(n >> 1);
continue;
}
if (!foundStart)
{
if ((n & 1) == 0)
foundStart = true;
n = (byte)(n >> 1);
continue;
}
if (n == 0)
continue;
return false;
}
}
if (foundEnd)
return true;
return false;
}
Here's my BigInteger method:
public static bool ConformsToPattern(System.Numerics.BigInteger number)
{
bool foundStart = false;
bool foundEnd = false;
if (number.IsZero || number.IsPowerOfTwo)
return true;
if (!number.IsEven)
return false;
while (!number.IsZero)
{
if (!foundEnd)
{
if (!number.IsEven)
foundEnd = true;
number = number >> 1;
continue;
}
if (!foundStart)
{
if (number.IsEven)
foundStart = true;
number = number >> 1;
continue;
}
return false;
}
if (foundEnd)
return true;
return false;
}
Choose whichever works better for you. The byte array is faster as of now. The BigIntegers code is 100% accurate reference.
If you're not worried about native endianness remove that part code, but leaving it in there will ensure portability to other than just x86 systems. BigIntegers already gives me IsZero, IsEven and IsPowerOfTwo, so that's not an extra calculation. I'm not sure if that's the fastest way to bitshift right since there is a byte to int cast, but right now, I couldn't find another way. As for use of byte vs short vs int vs long for loop operations, that up to you to change if you feel it'll work better. I'm not sure what kind of BigIntegers you'll be sending so I think int would be safe. You can modify the code to remove the for loop and just copy paste the code 8 times, and it might be faster. Or you can throw that into a static method.

How about something like this? If you find a one, the only things after that can be 1s until a 0 is found. After that, only 0s. This looks like it'll do the trick a little faster because it doesn't do unnecessary or conditions.
// The parameter [number] will NEVER be negative.
public static bool ConformsToPattern (System.Numerics.BigInteger number)
{
byte [] bytes = null;
bool moreOnesPossible = true;
bool foundFirstOne = false;
if (number == 0) // 00000000
{
return (true); // All bits are zero.
}
else
{
bytes = number.ToByteArray();
if ((bytes [bytes.Length - 1] & 1) == 1)
{
return (false);
}
else
{
for (byte b=0; b < bytes.Length; b++)
{
if (moreOnesPossible)
{
if(!foundFirstOne)
{
if
(
(bytes [b] == 1) // 00000001
|| (bytes [b] == 3) // 00000011
|| (bytes [b] == 7) // 00000111
|| (bytes [b] == 15) // 00001111
|| (bytes [b] == 31) // 00011111
|| (bytes [b] == 63) // 00111111
|| (bytes [b] == 127) // 01111111
|| (bytes [b] == 255) // 11111111
)
{
foundFirstOne = true;
// So far so good. Continue to the next byte with
// a possibility of more consecutive 1s.
}
else if
(
(bytes [b] == 128) // 10000000
|| (bytes [b] == 192) // 11000000
|| (bytes [b] == 224) // 11100000
|| (bytes [b] == 240) // 11110000
|| (bytes [b] == 248) // 11111000
|| (bytes [b] == 252) // 11111100
|| (bytes [b] == 254) // 11111110
)
{
moreOnesPossible = false;
}
else
{
return (false);
}
}
else
{
if(bytes [b] != 255) // 11111111
{
if
(
(bytes [b] == 128) // 10000000
|| (bytes [b] == 192) // 11000000
|| (bytes [b] == 224) // 11100000
|| (bytes [b] == 240) // 11110000
|| (bytes [b] == 248) // 11111000
|| (bytes [b] == 252) // 11111100
|| (bytes [b] == 254) // 11111110
)
{
moreOnesPossible = false;
}
}
}
}
else
{
if (bytes [b] > 0)
{
return (false);
}
}
}
}
}
return (true);
}

Here is the method I wrote myself. Not very elegant but pretty fast.
/// <summary>
/// Checks to see if this cell lies on a major diagonal of a power of 2.
/// ^[0]*[1]*[0]+$ denotes the regular expression of the binary pattern we are looking for.
/// </summary>
public bool IsDiagonalMajorToPowerOfTwo ()
{
byte [] bytes = null;
bool moreOnesPossible = true;
System.Numerics.BigInteger number = 0;
number = System.Numerics.BigInteger.Abs(this.X - this.Y);
if ((number == 0) || (number == 1)) // 00000000
{
return (true); // All bits are zero.
}
else
{
// The last bit should always be 0.
if (number.IsEven)
{
bytes = number.ToByteArray();
for (byte b=0; b < bytes.Length; b++)
{
if (moreOnesPossible)
{
switch (bytes [b])
{
case 001: // 00000001
case 003: // 00000011
case 007: // 00000111
case 015: // 00001111
case 031: // 00011111
case 063: // 00111111
case 127: // 01111111
case 255: // 11111111
{
// So far so good.
// Carry on testing subsequent bytes.
break;
}
case 128: // 10000000
case 064: // 01000000
case 032: // 00100000
case 016: // 00010000
case 008: // 00001000
case 004: // 00000100
case 002: // 00000010
case 192: // 11000000
case 096: // 01100000
case 048: // 00110000
case 024: // 00011000
case 012: // 00001100
case 006: // 00000110
case 224: // 11100000
case 112: // 01110000
case 056: // 00111000
case 028: // 00011100
case 014: // 00001110
case 240: // 11110000
case 120: // 01111000
case 060: // 00111100
case 030: // 00011110
case 248: // 11111000
case 124: // 01111100
case 062: // 00111110
case 252: // 11111100
case 126: // 01111110
case 254: // 11111110
{
moreOnesPossible = false;
break;
}
default:
{
return (false);
}
}
}
else
{
if (bytes [b] > 0)
{
return (false);
}
}
}
}
else
{
return (false);
}
}
return (true);
}

If I understand you correctly, you must have only 1 consecutive series of 1's followed by consecutive zeros.
So if it has to end in zero, it has to be even.
All the bytes in the middle must be all 1's and the first and last byte are your only special cases.
if (number.IsZero)
return true;
if (!number.IsEven)
return false;
var bytes = number.ToByteArray();
for (int i = 0; i < bytes.Length; i++)
{
if (i == 0) //first byte case
{
if (!(
(bytes[i] == 1) // 00000001
|| (bytes[i] == 3) // 00000011
|| (bytes[i] == 7) // 00000111
|| (bytes[i] == 15) // 00001111
|| (bytes[i] == 31) // 00011111
|| (bytes[i] == 63) // 00111111
|| (bytes[i] == 127) // 01111111
|| (bytes[i] == 255) // 11111111
))
{
return false;
}
}
else if (i == bytes.Length) //last byte case
{
if (!(
(bytes[i] == 128) // 10000000
|| (bytes[i] == 192) // 11000000
|| (bytes[i] == 224) // 11100000
|| (bytes[i] == 240) // 11110000
|| (bytes[i] == 248) // 11111000
|| (bytes[i] == 252) // 11111100
|| (bytes[i] == 254) // 11111110
))
{
return false;
}
}
else //all bytes in the middle
{
if (bytes[i] != 255)
return false;
}
}

I'm a big fan of regular expressions, so I thought about simply converting the byte to a string and testing it against a regex. However, it's important to carefully define the pattern. By reading your question, I've come up with this one:
^(?:1*)(?:0+)$
Please, check it out:
public static bool ConformsToPattern(System.Numerics.BigInteger number)
{
byte[] ByteArray = number.ToByteArray();
Regex BinaryRegex = new Regex("^(?:1*)(?:0+)$", RegexOptions.Compiled);
return ByteArray.Where<byte>(x => !BinaryRegex.IsMatch(Convert.ToString(x, 2))).Count() > 0;
}

Not sure if this will be faster or slower than what you already have, but it's something to try (hope I didn't botch the logic)...
public bool ConformsToPattern(System.Numerics.BigInteger number) {
bool moreOnesPossible = true;
if (number == 0) {
return true;
}
else {
byte[] bytes = number.ToByteArray();
if ((bytes[bytes.Length - 1] & 1) == 1) {
return false;
}
else {
for (byte b = 0; b < bytes.Length; b++) {
if (moreOnesPossible) {
switch (bytes[b]) {
case 1:
case 3:
case 7:
case 15:
case 31:
case 63:
case 127:
case 255:
continue;
default:
switch (bytes[b]) {
case 128:
case 192:
case 224:
case 240:
case 248:
case 252:
case 254:
moreOnesPossible = false;
continue;
default:
return false;
}
}
}
else {
if (bytes[b] > 0) { return (false); }
}
}
}
}
return true;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Encoding errors in embedded Json file - c#

Related

Unicode character is written with wrong byteorder

Decode cyrillic quoted-printable content

Reading multi language text file in c#

Checking if Bytes are 0x00

Binary pattern comparison shortcut / fastest implementation in C#

Categories

Resources