UTF8 Beginning of File characters are breaking serializer & readers

UTF8 Beginning of File characters are breaking serializer & readers - c#

Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.
I'm getting a leading six bytes of data:
0xEF
0xBB
0xBF
0xEF
0xBB
0xBF
(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?
Notice the serializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?
//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
Utility.SerializeXml(ms, root);
xml = Encoding.UTF8.GetString(ms.ToArray());
}
//later on, I would take that xml and then write it out to a file like this:
File.WriteAllText(path, xml, Encoding.UTF8);
public static void SerializeXml(Stream output, object data)
{
XmlSerializer xs = new XmlSerializer(data.GetType());
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = "\t";
settings.Encoding = Encoding.UTF8;
XmlWriter writer = XmlTextWriter.Create(output, settings);
xs.Serialize(writer, data);
writer.Flush();
writer.Close();
}

Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:
Encoding.UTF8
means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using
new UTF8Encoding(false)
which should give you a less sucky version.

Yes that is a BOM.
Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.
The solution I used was to stick a pushback stream on the front and filter it out.
Or use a more modern version of Java.

The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.
You've got the same sequence twice.
The only good thing to do with them is ignore and/or delete them.

Related

Why is XmlWriter not honoring the encoding I set?

This method is writing out an XML file (work specific). I have everything writing out exactly was I want it except that that I set it to write the file with UTF-8 (no BOM) encoding.
The XML declaration says UTF-8, but when I open the file in Notepad++, it shows to be encoded in ANSI.
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = new UTF8Encoding(false);
settings.NewLineOnAttributes = true;
using (var xmlWriter = XmlWriter.Create(#"c:\temp\myUIPB.xml", settings))
{
xmlWriter.WriteStartDocument();
xmlWriter.WriteStartElement("UIScript");
// Write Event Nodes
foreach (var eventNode in listBoxOutput.Items)
{
lbEvent myNode = (lbEvent)eventNode;
XmlNode xn = myNode.workflowEvent;
xn.WriteTo(xmlWriter);
}
xmlWriter.WriteFullEndElement();
xmlWriter.WriteEndDocument();
xmlWriter.Flush();
xmlWriter.Close();
}
I would expect that if I set it to output in UTF-8, that the file that writes out is indeed encoded in UTF-8 instead of ANSI encoded.
Thoughts? Help?

File using Utf8 without BOM and ascii encoding look identical if it contains just Latin characters and numbers.
A generic text editing program (like notepad, notepad++) will be able to guess encoding the way you like (unless you provide some hints, usually with "Open with encoding" file open options).
Compliant XML parsers use "encoding" part of "xml" PI (<?xml version="1.0" encoding="UTF-8"?>) to detect correct encoding for files without BOM. In your case you'll likely getting correct "xml" PI and compliant XML parser will open it correctly.
If you need all programs to detect Utf8 correctly specify BOM by passing true to encodings constructor.
Note that without BOM file even with characters with code above 128 may have its encoding detected incorrectly.

Generate zip file with xml content on the fly [duplicate]

I want to write a String to a Stream (a MemoryStream in this case) and read the bytes one by one.
stringAsStream = new MemoryStream();
UnicodeEncoding uniEncoding = new UnicodeEncoding();
String message = "Message";
stringAsStream.Write(uniEncoding.GetBytes(message), 0, message.Length);
Console.WriteLine("This:\t\t" + (char)uniEncoding.GetBytes(message)[0]);
Console.WriteLine("Differs from:\t" + (char)stringAsStream.ReadByte());
The (undesired) result I get is:
This: M
Differs from: ?
It looks like it's not being read correctly, as the first char of "Message" is 'M', which works when getting the bytes from the UnicodeEncoding instance but not when reading them back from the stream.
What am I doing wrong?
The bigger picture: I have an algorithm which will work on the bytes of a Stream, I'd like to be as general as possible and work with any Stream. I'd like to convert an ASCII-String into a MemoryStream, or maybe use another method to be able to work on the String as a Stream. The algorithm in question will work on the bytes of the Stream.

After you write to the MemoryStream and before you read it back, you need to Seek back to the beginning of the MemoryStream so you're not reading from the end.
UPDATE
After seeing your update, I think there's a more reliable way to build the stream:
UnicodeEncoding uniEncoding = new UnicodeEncoding();
String message = "Message";
// You might not want to use the outer using statement that I have
// I wasn't sure how long you would need the MemoryStream object
using(MemoryStream ms = new MemoryStream())
{
var sw = new StreamWriter(ms, uniEncoding);
try
{
sw.Write(message);
sw.Flush();//otherwise you are risking empty stream
ms.Seek(0, SeekOrigin.Begin);
// Test and work with the stream here.
// If you need to start back at the beginning, be sure to Seek again.
}
finally
{
sw.Dispose();
}
}
As you can see, this code uses a StreamWriter to write the entire string (with proper encoding) out to the MemoryStream. This takes the hassle out of ensuring the entire byte array for the string is written.
Update: I stepped into issue with empty stream several time. It's enough to call Flush right after you've finished writing.

Try this "one-liner" from Delta's Blog, String To MemoryStream (C#).
MemoryStream stringInMemoryStream =
new MemoryStream(ASCIIEncoding.Default.GetBytes("Your string here"));
The string will be loaded into the MemoryStream, and you can read from it. See Encoding.GetBytes(...), which has also been implemented for a few other encodings.

You're using message.Length which returns the number of characters in the string, but you should be using the nubmer of bytes to read. You should use something like:
byte[] messageBytes = uniEncoding.GetBytes(message);
stringAsStream.Write(messageBytes, 0, messageBytes.Length);
You're then reading a single byte and expecting to get a character from it just by casting to char. UnicodeEncoding will use two bytes per character.
As Justin says you're also not seeking back to the beginning of the stream.
Basically I'm afraid pretty much everything is wrong here. Please give us the bigger picture and we can help you work out what you should really be doing. Using a StreamWriter to write and then a StreamReader to read is quite possibly what you want, but we can't really tell from just the brief bit of code you've shown.

I think it would be a lot more productive to use a TextWriter, in this case a StreamWriter to write to the MemoryStream. After that, as other have said, you need to "rewind" the MemoryStream using something like stringAsStream.Position = 0L;.
stringAsStream = new MemoryStream();
// create stream writer with UTF-16 (Unicode) encoding to write to the memory stream
using(StreamWriter sWriter = new StreamWriter(stringAsStream, UnicodeEncoding.Unicode))
{
sWriter.Write("Lorem ipsum.");
}
stringAsStream.Position = 0L; // rewind
Note that:
StreamWriter defaults to using an instance of UTF8Encoding unless specified otherwise. This instance of UTF8Encoding is constructed without a byte order mark (BOM)
Also, you don't have to create a new UnicodeEncoding() usually, since there's already one as a static member of the class for you to use in convenient utf-8, utf-16, and utf-32 flavors.
And then, finally (as others have said) you're trying to convert the bytes directly to chars, which they are not. If I had a memory stream and knew it was a string, I'd use a TextReader to get the string back from the bytes. It seems "dangerous" to me to mess around with the raw bytes.

You need to reset the stream to the beginning:
stringAsStream.Seek(0, SeekOrigin.Begin);
Console.WriteLine("Differs from:\t" + (char)stringAsStream.ReadByte());
This can also be done by setting the Position property to 0:
stringAsStream.Position = 0

Problems with strings in the CSV file

I have an application that reads information from a CSV file to write it to the database. But some characters (example: º ç) are appearing problems Gravalos base. Anyone know how to fix this problem?
Thank you.
I'm using these lines of code to read the information from the CSV file:
string directory = #"C:\test.csv";
StreamReader stream = new StreamReader(directory);
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');

StreamReader defaults to UTF8 encoding and your file is in a different encoding. Try specifying it like this...
var encoding = Encoding.UTF16;
StreamReader stream = new StreamReader(directory, encoding);
Note that you need to know what encoding the file is in to properly read it... I'm just guessing that it might be UTF16 but obviously I can't know what it is.

You should specify the right encoding when reading the file. The default is UTF-8. Your file is probably encoded with a different encoding.

This is most likely related to the Encoding that is used when reading the file. By default, UTF8 is assumed as the Encoding. In order to read the file correctly, you need to specify the right encoding, e.g.:
string directory = #"C:\test.csv";
using(StreamReader stream = new StreamReader(directory, Encoding.ASCII))
{
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
}
You can try the following encodings (see this link for a complete list):
Encoding.Default for ANSI encoding based in the current windows code page.
Encoding.ASCII for ASCII encoding.
Encoding.UTF* for different Unicode encodings.
Please note that I enclosed the StreamReader in a using block so that it is disposed when it is not needed anymore.

C# File Encoding Type changed?

I am creating a file with ASCII encoding, but when I test to get the Encoding type of that file, it is returning UTF8Encoding.
Can anyone explain the reason or figure my mistake??
CODE:
Creating File:
FileStream _textStream = File.Open("CreateAsciiFile.txt", FileMode.Create, FileAccess.Write);
StreamWriter _streamWriter = new StreamWriter(_textStream, System.Text.Encoding.ASCII);
Byte[] byteContent = BtyeTowrite(); // This returns the array of byte
foreach(var myByte in byteContent)
_streamWriter.Write(System.Convert.ToChar(myByte));
Reading a file:
StreamReader sr = new StreamReader(#"C:\CreateAsciiFile.txt",true);
string LineText= sr.ReadLine();
System.Text.Encoding enc = sr.CurrentEncoding;
Here enc gives UTF8Encoding... But I am expecting ASCII ???

You need to read from the reader before querying the encoding. So before calling sr.CurrentEncoding try reading something. The StreamReader looks at the first bytes to try to guess the encoding and because ASCII has no BOM it might not be recognizable as such and you might get wrong results. For example there is no difference (at the binary level) between an ASCII encoded file a ISO-8859-1 encoded file.

The answer is probably here:"
Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value.
In other words, your ASCII file is both valid UTF-8 and ASCII. It is detected as UTF-8.

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.

Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}

By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p

I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.

Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}

The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

UTF8 Beginning of File characters are breaking serializer & readers - c#

Yes that is a BOM. Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java. The solution I used was to stick a pushback stream on the front and filter it out. Or use a more modern version of Java.

The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32. You've got the same sequence twice. The only good thing to do with them is ignore and/or delete them.

Related

Why is XmlWriter not honoring the encoding I set?

Generate zip file with xml content on the fly [duplicate]

Problems with strings in the CSV file

C# File Encoding Type changed?

C# - Detecting encoding in a file, write change to file using the found encoding

Categories

Resources