Why is XmlWriter not honoring the encoding I set? - c#

This method is writing out an XML file (work specific). I have everything writing out exactly was I want it except that that I set it to write the file with UTF-8 (no BOM) encoding.
The XML declaration says UTF-8, but when I open the file in Notepad++, it shows to be encoded in ANSI.
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.Encoding = new UTF8Encoding(false);
settings.NewLineOnAttributes = true;
using (var xmlWriter = XmlWriter.Create(#"c:\temp\myUIPB.xml", settings))
{
xmlWriter.WriteStartDocument();
xmlWriter.WriteStartElement("UIScript");
// Write Event Nodes
foreach (var eventNode in listBoxOutput.Items)
{
lbEvent myNode = (lbEvent)eventNode;
XmlNode xn = myNode.workflowEvent;
xn.WriteTo(xmlWriter);
}
xmlWriter.WriteFullEndElement();
xmlWriter.WriteEndDocument();
xmlWriter.Flush();
xmlWriter.Close();
}
I would expect that if I set it to output in UTF-8, that the file that writes out is indeed encoded in UTF-8 instead of ANSI encoded.
Thoughts? Help?

File using Utf8 without BOM and ascii encoding look identical if it contains just Latin characters and numbers.
A generic text editing program (like notepad, notepad++) will be able to guess encoding the way you like (unless you provide some hints, usually with "Open with encoding" file open options).
Compliant XML parsers use "encoding" part of "xml" PI (<?xml version="1.0" encoding="UTF-8"?>) to detect correct encoding for files without BOM. In your case you'll likely getting correct "xml" PI and compliant XML parser will open it correctly.
If you need all programs to detect Utf8 correctly specify BOM by passing true to encodings constructor.
Note that without BOM file even with characters with code above 128 may have its encoding detected incorrectly.

Related

UTF-8 CSV file created with C# shows  characters in Excel

When a CSV file is generated using C# and opened in Microsoft Excel it displays  characters before special symbols e.g. £
In Notepad++ the hex value for  is: C2
So before writing the £ symbol to file, I have tried the following...
var test = "£200.00";
var replaced = test.Replace("\xC2", " ");
StreamWriter outputFile = File.CreateText("testoutput.csv"); // default UTF-8
outputFile.WriteLine(replaced);
outputFile.Close();
When opening the CSV file in Excel, I still see the "Â" character before the £ symbol (hex equivalent \xC2 \xA3); It made no difference.
Do I need to use a different encoding? or am I missing something?
Thank you #Evk and #Mortalier, your suggestions lead me to the right direction...
I needed to update my StreamWriter so it would explicitly include UTF-8 BOM at the beginning http://thinkinginsoftware.blogspot.co.uk/2017/12/correctly-generate-csv-that-excel-can.html
So my code has changed from:
StreamWriter outputFile = File.CreateText("testoutput.csv"); // default UTF-8
To:
StreamWriter outputFile = new StreamWriter("testoutput.csv", false, new UTF8Encoding(true))
Or: Another solution I found here was to use a different encoding if you're only expecting latin characters...
http://theoldsewingfactory.com/2010/12/05/saving-csv-files-in-utf8-creates-a-characters-in-excel/
StreamWriter outputFile = new StreamWriter("testoutput.csv", false, Encoding.GetEncoding("Windows-1252"))
My system will most likely use latin & non-latin characters so I'm using the UTF-8 BOM solution.
Final code
var test = "£200.00";
StreamWriter outputFile = new StreamWriter("testoutput.csv", false, new UTF8Encoding(true))
outputFile.WriteLine(test);
outputFile.Close();
I tried your code and Excel does show AŁ in the cell.
Then I tried to open the csv with LibreOffice Clac. At first there too was AŁ, but
on import the program will ask you about encoding.
Once I chose UTF-8 the £ symbol was displayed correctly.
My guess is that in fact there is an issue with your encoding.
This might help with Excel https://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

StreamReader is unable to correctly read extended character set (UTF8)

I am having an issue where I am unable to read a file that contains foreign characters. The file, I have been told, is encoded in UTF-8 format.
Here is the core of my code:
using (FileStream fileStream = fileInfo.OpenRead())
{
using (StreamReader reader = new StreamReader(fileStream, System.Text.Encoding.UTF8))
{
string line;
while (!string.IsNullOrEmpty(line = reader.ReadLine()))
{
hashSet.Add(line);
}
}
}
The file contains the word "achôcre" but when examining it during debugging it is adding it as "ach�cre".
(This is a profanity file so I apologize if you speak French. I for one, have no idea what that means)
The evidence clearly suggests that the file is not in UTF-8 format. Try System.Text.Encoding.Default and see if you get the correct text then — if you do, you know the file is in Windows-1252 (assuming that is your system default codepage). In that case, I recommend that you open the file in Notepad, then re-“Save As” it as UTF-8, and then you can use Encoding.UTF8 normally.
Another way to check what encoding the file is actually in is to open it in your browser. If the accents display correctly, then the browser has detected the correct character set — so look at the “View / Character set” menu to find out which one is selected. If the accents are not displaying correctly, then change the character set via that menu until they do.

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.
Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}
By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p
I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.
Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

UTF8 Beginning of File characters are breaking serializer & readers

Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.
I'm getting a leading six bytes of data:
0xEF
0xBB
0xBF
0xEF
0xBB
0xBF
(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?
Notice the serializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?
//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
Utility.SerializeXml(ms, root);
xml = Encoding.UTF8.GetString(ms.ToArray());
}
//later on, I would take that xml and then write it out to a file like this:
File.WriteAllText(path, xml, Encoding.UTF8);
public static void SerializeXml(Stream output, object data)
{
XmlSerializer xs = new XmlSerializer(data.GetType());
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = "\t";
settings.Encoding = Encoding.UTF8;
XmlWriter writer = XmlTextWriter.Create(output, settings);
xs.Serialize(writer, data);
writer.Flush();
writer.Close();
}
Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:
Encoding.UTF8
means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using
new UTF8Encoding(false)
which should give you a less sucky version.
Yes that is a BOM.
Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.
The solution I used was to stick a pushback stream on the front and filter it out.
Or use a more modern version of Java.
The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.
You've got the same sequence twice.
The only good thing to do with them is ignore and/or delete them.

Categories