UTF-8 remove BOM - c#

I have an XML file with a UTF-8 BOM in the beginning of the file, which hinders me from using existing code that reads UTF-8 files.
How can I remove the BOM from the XML file in an easy way?
Here I have a variable xmlfile in Byte type that I convert to string. xmlfile contains the entire XML file.
byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);
string xmlstring = Encoding.UTF8.GetString(xmlfile);

Great stuff DBC :) that worked well with your link. To fix my problem where i had a UTF-8 BOM tag in the beginning of my xml file. I simply added memorystream and streamreader, which automaticly cleanced the the xmlfile(htmlbytes) of BOM elements.
Really easy to implement for existing code.
byte[] htmlbytes = ((Byte[])myReader["xmlMelding"]);
var memorystream = new MemoryStream(htmlbytes);
var s = new StreamReader(memorystream).ReadToEnd();

Encoding.GetString() has an overload that accepts an offset into the byte[] array. Simply check if the array starts with a BOM, and if so then skip it when calling GetString(), eg:
byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);
int offset = 0;
if (xmlfile.Length >= 3 &&
xmlfile[0] == 0xEF &&
xmlfile[1] == 0xBB &&
xmlfile[2] == 0xBF)
{
offset += 3;
}
string xmlstring = Encoding.UTF8.GetString(xmlfile, offset, xmlfile.Length - offset);

Related

UTF8 Character lost when written to file

I am creating an application to scan and merge CSV files. I am having an issue when writing the data to a new file. One of the fields has the ö character which is maintained until i write it to the new file. It then becomes the "actual" value: ö instead of the "expected" value: ö
I am suspecting that UTF8 Encoding is not the best thing to use but have yet to find a better working method. Any help with this would be much appreciated!
byte[] nl = new UTF8Encoding(true).GetBytes("\n");
using (FileStream file = File.Create(filepath))
{
string text;
byte[] info;
for (int r = 0; r < data.Count; r++)
{
int c = 0;
for (; c < data[r].Count - 1; c++)
{
text = data[r][c] + #",";
text = text.Replace("\n", #"");
text = text.Replace(#"☼", #"""");
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
}
text = data[r][c];
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
file.Write(nl, 0, nl.Length);
}
}
I might be mistaken and this should probably go in a comment but I can't comment yet. Text editors will decode the binary data into a certain encoding. You can check the actual binary data in a hex editor. You can verify the binary data you are writing out to the file. Notepad++ has a hex editor plug in that you could use.
BinaryWriter is easier to work with when it comes to writing bytes to a file. you can also set the encoding of the BinaryWriter. You'll want to set this to UTF-8.
Edit
I forgot to mention. When you write out to bytes you are going to want to read in as bytes as well. Use BinaryReader and set the encoding to UTF-8.
Once you read the Bytes in use Encoding.UTF8.GetString() to convert the bytes into a string.
You might be truncating the output since UTF-8 is multibyte.
Don't do this:
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
Instead use info.Length.
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, info.Length); // change this line

Stream reader.Read number of character

Is there any Stream reader Class to read only number of char from string Or byte from byte[]?
forexample reading string:
string chunk = streamReader.ReadChars(5); // Read next 5 chars
or reading bytes
byte[] bytes = streamReader.ReadBytes(5); // Read next 5 bytes
Note that the return type of this method or name of the class does not matter. I just want to know if there is some thing similar to this then i can use it.
I have byte[] from midi File. I want to Read this midi file in C#. But i need ability to read number of bytes. or chars(if i convert it to hex). To validate midi and read data from it more easily.
Thanks for the comments. I didnt know there is an Overload for Read Methods. i could achieve this with FileStream.
using (FileStream fileStream = new FileStream(path, FileMode.Open))
{
byte[] chunk = new byte[4];
fileStream.Read(chunk, 0, 4);
string hexLetters = BitConverter.ToString(chunk); // 4 Hex Letters that i need!
}
You can achieve this by doing something like below but I am not sure this will applicable for your problem or not.
StreamReader sr = new StreamReader(stream);
StringBuilder S = new StringBuilder();
while(true)
{
S = S.Append(sr.ReadLine());
if (sr.EndOfStream == true)
{
break;
}
}
Once you have value on "S", you can consider sub strings from it.

How to read utf-8 encoded string in C#?

My scenario is:
Create an email in Outlook Express and save it as .eml file;
Read the file as string in C# console application;
I'm saving the .eml file encoded in utf-8. An example of text I wrote is:
'Goiânia é badalação.'
There are special characters like âéçã. It is portuguese characters.
When I open the file with notepad++ the text is shown like this:
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
If I open it in outook express again, it's shown normal, like the first way.
When I read the file in console application, using utf-8 decoding, the string is shown like the second way.
The code I using is:
string text = File.ReadAllText(#"C:\fromOutlook.eml", Encoding.UTF8);
Console.WriteLine(text);
I tried all Encoding options and a lot of methods I found in the web but nothing works.
Can someone help me do this simple conversion?
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
to
'Goiânia é badalação.'
string text = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
byte[] bytes = new byte[text.Length * sizeof(char)];
System.Buffer.BlockCopy(text.ToCharArray(), 0, bytes, 0, bytes.Encoding.UTF8.GetString(bytes, 0, bytes.Length);
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
Console.WriteLine(new string(chars));
In this utf-8 table you can see the hex. value of these characters, 'é' == 'c3 a9':
http://www.utf8-chartable.de/
Thanks.
var input = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
var buffer = new List<byte>();
var i = 0;
while(i < input.Length)
{
var character = input[i];
if(character == '=')
{
var part = input.Substring(i+1,2);
buffer.Add(byte.Parse(part, System.Globalization.NumberStyles.HexNumber));
i+=3;
}
else
{
buffer.Add((byte)character);
i++;
}
};
var output = Encoding.UTF8.GetString(buffer.ToArray());
Console.WriteLine(output); // prints: Goiânia é badalação.
Knowing the problem is quoted printable, I found a good decoder here:
http://www.dpit.co.uk/2011/09/decoding-quoted-printable-email-in-c.html
This works for me.
Thanks folks.
Update:
The above link is dead, here is a workable application:
How to convert Quoted-Print String

FileStream returning null characters every other character

I seem to be having some issues with a Filestream in C#.
I am trying to read the last line from a VERY large text file, 10mb, that is generated by a MSI installer.
The code I am using is:
string path = #"C:\uninstall.log";
byte[] buffer = new byte[100];
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
long len = fs.Length;
fs.Seek(-100, SeekOrigin.End);
fs.Read(buffer, 0, 100);
}
string foo = Encoding.UTF8.GetString(buffer);
Console.WriteLine("\"" + foo + "\"");
But the output looks similar to this:
H E L L O W O R L D ! ! ! B L A H B L A H
Apparently the stream that is read contains a '\0' (null) character every other character.
Does anyone know what is causing this?
Use Encoding.UnicodeEncoding instead. Your file is encoded in UTF-16, not UTF-8.
The file is probably a UTF-16 file, not a UTF-8 file. Just try using Encoding.Unicode instead of Encoding.UTF8.
Sounds like the file is actually UTF-16 encoded. Change UTF-8 in your GetString().

How can I determine the index in codepage 850 for a char in C#?

I have a text file which is encoded with codepage 850. I am reading this file the following way:
using (var reader = new StreamReader(filePath, Encoding.GetEncoding(850)))
{
string line;
while ((line = reader.ReadLine()) != null)
{
//...
}
//...
}
Now I need for every character in the string line in the loop above the zero-based index of that character which it has in codepage 850, something like:
for (int i = 0; i < line.Length; i++)
{
int indexInCodepage850 = GetIndexInCodepage850(line[i]); // ?
//...
}
Is this possible and how could int GetIndexInCodepage850(char c) look like?
Use Encoding.GetBytes() on the line. CP850 is an 8-bit encoding, so the byte array should have just as many elements as the string had characters, and each element is the value of the character.
Just read the file as bytes, and you have the codepage 850 character codes:
byte[] data = File.ReadAllBytes(filePath);
You don't get it separated into lines, though. The character codes for CR and LF that you need to look for in the data are 13 and 10.
You don't need to.
You are already specifying the encoding in the streamreader constructor.
The string returned from reader.ReadLine() will already have been encoding using CP850

Categories