Encoding xml as ISO-8859-1 - c#

I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
<?xml version="1.0" encoding="utf-8"?>
Moreover, I am supposed to use ISO-8859-1.
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
I am doint it this way:
XmlSerializer ser = new XmlSerializer(obj.GetType());
var encoding = Encoding.GetEncoding("ISO-8859-1");
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings
{
Indent = true,
OmitXmlDeclaration = false,
Encoding = encoding
};
XmlDocument xd = null;
using (MemoryStream memStm = new MemoryStream())
{
using (var xmlWriter = XmlWriter.Create(memStm, xmlWriterSettings))
{
ser.Serialize(xmlWriter, input);
}
memStm.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
using (var xtr = XmlReader.Create(memStm, settings))
{
xd = new XmlDocument();
xd.Load(xtr);
}
}
byte[] file = encoding.GetBytes(xml.OuterXml);
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?

I sent an xml file which I created while serializing an object and received a response that it is incorrect and not well-formed:
The  chars represents a BOM (byte-order-mark) for utf-8 files. That BOM can be a part of utf-8 encoded files. So your xml is valid if read properly.
More information about BOM: http://www.unicode.org/faq/utf_bom.html#bom1
I assume that I not only have to change <?xml version="1.0" encoding="ISO-8859-1"?>, but additionally I have to create the file during serialization from the code already with encoding ISO-8859-1. Correct?
Correct.
I used a framework to find out what encoding my created files have and when I create them with ISO-8859-1 as above my encoding checker gives me ASCII, is that correct?
So, the encoding of a text file cannot be determined exactly, but only "guessed" by means of an analysis. Various encodings have the same code pages for the ASCII characters, therefore ASCII is suitable as result.

Related

Hidden character in saved XML [duplicate]

I'm generating an utf-8 XML file using XDocument.
XDocument xml_document = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement(ROOT_NAME,
new XAttribute("note", note)
)
);
...
xml_document.Save(#file_path);
The file is generated correctly and validated with an xsd file with success.
When I try to upload the XML file to an online service, the service says that my file is wrong at line 1; I have discovered that the problem is caused by the BOM on the first bytes of the file.
Do you know why the BOM is appended to the file and how can I save the file without it?
As stated in Byte order mark Wikipedia article:
While Unicode standard allows BOM in
UTF-8 it does not require or
recommend it. Byte order has no
meaning in UTF-8 so a BOM only
serves to identify a text stream or
file as UTF-8 or that it was converted
from another format that has a BOM
Is it an XDocument problem or should I contact the guys of the online service provider to ask for a parser upgrade?
Use an XmlTextWriter and pass that to the XDocument's Save() method, that way you can have more control over the type of encoding used:
var doc = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("root", new XAttribute("note", "boogers"))
);
using (var writer = new XmlTextWriter(".\\boogers.xml", new UTF8Encoding(false)))
{
doc.Save(writer);
}
The UTF8Encoding class constructor has an overload that specifies whether or not to use the BOM (Byte Order Mark) with a boolean value, in your case false.
The result of this code was verified using Notepad++ to inspect the file's encoding.
First of all: the service provider MUST handle it, according to XML spec, which states that BOM may be present in case of UTF-8 representation.
You can force to save your XML without BOM like this:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = new UTF8Encoding(false); // The false means, do not emit the BOM.
using (XmlWriter w = XmlWriter.Create("my.xml", settings))
{
doc.Save(w);
}
(Googled from here: http://social.msdn.microsoft.com/Forums/en/xmlandnetfx/thread/ccc08c65-01d7-43c6-adf3-1fc70fdb026a)
The most expedient way to get rid of the BOM character when using XDocument is to just save the document, then do a straight File read as a file, then write it back out. The File routines will strip the character out for you:
XDocument xTasks = new XDocument();
XElement xRoot = new XElement("tasklist",
new XAttribute("timestamp",lastUpdated),
new XElement("lasttask",lastTask)
);
...
xTasks.Add(xRoot);
xTasks.Save("tasks.xml");
// read it straight in, write it straight back out. Done.
string[] lines = File.ReadAllLines("tasks.xml");
File.WriteAllLines("tasks.xml",lines);
(it's hoky, but it works for the sake of expediency - at least you'll have a well-formed file to upload to your online provider) ;)
By UTF-8 Documents
String XMLDec = xDoc.Declaration.ToString();
StringBuilder sb = new StringBuilder(XMLDec);
sb.Append(xDoc.ToString());
Encoding encoding = new UTF8Encoding(false); // false = without BOM
File.WriteAllText(outPath, sb.ToString(), encoding);

c# - create and save an xml as UTF-8 (without BOM) [duplicate]

I'm generating an utf-8 XML file using XDocument.
XDocument xml_document = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement(ROOT_NAME,
new XAttribute("note", note)
)
);
...
xml_document.Save(#file_path);
The file is generated correctly and validated with an xsd file with success.
When I try to upload the XML file to an online service, the service says that my file is wrong at line 1; I have discovered that the problem is caused by the BOM on the first bytes of the file.
Do you know why the BOM is appended to the file and how can I save the file without it?
As stated in Byte order mark Wikipedia article:
While Unicode standard allows BOM in
UTF-8 it does not require or
recommend it. Byte order has no
meaning in UTF-8 so a BOM only
serves to identify a text stream or
file as UTF-8 or that it was converted
from another format that has a BOM
Is it an XDocument problem or should I contact the guys of the online service provider to ask for a parser upgrade?
Use an XmlTextWriter and pass that to the XDocument's Save() method, that way you can have more control over the type of encoding used:
var doc = new XDocument(
new XDeclaration("1.0", "utf-8", null),
new XElement("root", new XAttribute("note", "boogers"))
);
using (var writer = new XmlTextWriter(".\\boogers.xml", new UTF8Encoding(false)))
{
doc.Save(writer);
}
The UTF8Encoding class constructor has an overload that specifies whether or not to use the BOM (Byte Order Mark) with a boolean value, in your case false.
The result of this code was verified using Notepad++ to inspect the file's encoding.
First of all: the service provider MUST handle it, according to XML spec, which states that BOM may be present in case of UTF-8 representation.
You can force to save your XML without BOM like this:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = new UTF8Encoding(false); // The false means, do not emit the BOM.
using (XmlWriter w = XmlWriter.Create("my.xml", settings))
{
doc.Save(w);
}
(Googled from here: http://social.msdn.microsoft.com/Forums/en/xmlandnetfx/thread/ccc08c65-01d7-43c6-adf3-1fc70fdb026a)
The most expedient way to get rid of the BOM character when using XDocument is to just save the document, then do a straight File read as a file, then write it back out. The File routines will strip the character out for you:
XDocument xTasks = new XDocument();
XElement xRoot = new XElement("tasklist",
new XAttribute("timestamp",lastUpdated),
new XElement("lasttask",lastTask)
);
...
xTasks.Add(xRoot);
xTasks.Save("tasks.xml");
// read it straight in, write it straight back out. Done.
string[] lines = File.ReadAllLines("tasks.xml");
File.WriteAllLines("tasks.xml",lines);
(it's hoky, but it works for the sake of expediency - at least you'll have a well-formed file to upload to your online provider) ;)
By UTF-8 Documents
String XMLDec = xDoc.Declaration.ToString();
StringBuilder sb = new StringBuilder(XMLDec);
sb.Append(xDoc.ToString());
Encoding encoding = new UTF8Encoding(false); // false = without BOM
File.WriteAllText(outPath, sb.ToString(), encoding);

Encode XDocumnet form win-1251 to utf-8

I try to convert XDocument from win-1 to utf-8. But in raw-view russian characters have bad view.
var encoding = new UTF8Encoding(false,false);
XmlTextWriter xmlTextWriter = new XmlTextWriter("F:\\File", Encoding.GetEncoding("windows-1251"));
document.Save(xmlTextWriter);
xmlTextWriter.Close();
xmlTextWriter = null;
string text = File.ReadAllText("F:\\File", Encoding.Default);
XDocument documentcode = XDocument.Parse(text);
xmlTextWriter = new XmlTextWriter(_Stream, encoding);
documentcode.Save(xmlTextWriter);
xmlTextWriter.Flush();
_Stream.Position = 0;
Headers.ContentType = new MediaTypeHeaderValue("application/xml");
This is the raw-view in SOAPUI
<?xml version="1.0" encoding="utf-8"?><StatObservationList><StatObservation><ObjectID>0b575ec1-7dea-41c4-a1f0-287190715ed2</ObjectID><Name>Тестовое статнаблюдение</Name><Code>GPPCode42</Code></StatObservation><StatObservation><ObjectID>3a871ea1-06ee-4991-a263-d643b424bdd4</ObjectID><Name>МиСП</Name><Code /></StatObservation></StatObservationList>
I think I've got it now. The text in your XDocument has, for whatever reason, been decoded incorrectly using Windows-1251.
Ideally, you need to go back to the source and ensure it is decoded properly (with UTF8). Converting this may not be an entirely loss-free process, as there are code points in the UTF8 that don't have a representation in Windows-1251 (a quick glance at the code page shows nothing for 0x98, for example).
However, to convert this after the fact the simplest way is just to get the text back, get the bytes for the encoding it was decoded with and then decode those with the correct encoding:
var windows1251 = Encoding.GetEncoding("windows-1251");
var utf8 = Encoding.UTF8;
var originalBytes = windows1251.GetBytes(document.ToString());
var correctXmlString = utf8.GetString(originalBytes);
var correctDocument = XDocument.Parse(correctXmlString);

XmlWriter.WriteNode(XmlReader) writes invalid XML using memory streams

I have a need to normalize XML streams to UTF-16. I use the following method:
All streams passed are byte streams: MemoryStream or FileStream. My problem is when I pass in a filestream containing the following (correctly encoded) XML as jobTicket:
<?xml version="1.0" encoding="utf-8"?>
<workflow>
<file>
<request name="create-temp-file" tag="очень">
</request>
<request name="create-temp-folder" tag="非常に">
</request>
</file>
</workflow>
ticketStreamU16 contains an XML declaration, complete with UTF-8 encoding declaration as UTF-16. This is not well formed XML.
public void EncodeJobTicket(Stream jobTicket, Stream ticketStreamU16)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.Unicode;
if (jobTicket.CanSeek)
{
jobTicket.Position = 0;
}
using (XmlReader xmlRdr = XmlReader.Create(jobTicket))
using (XmlWriter xmlWtr = XmlWriter.Create(ticketStreamU16, settings))
{
xmlWtr.WriteNode(xmlRdr, false);
}
}
What am I missing? Shouldn't xmlWtr write the correct xml declartion? Do I have to look for a declaration and replace it?
You should remove the xml declaration from the read stream, because it copy it without change. Xml writer will write correct one for you:
using (XmlWriter xmlWtr = XmlWriter.Create(ticketStreamU16, settings))
{
xmlRdr.MoveToContent();
xmlWtr.WriteNode(xmlRdr, false);
}

Convert utf-8 XML document to utf-16 for inserting into SQL

I have an XML document that has been created using utf-8 encoding. I want to store that document in a sql 2008 xml column but I understand I need to convert it to utf-16 in order to do that.
I've tried using XDocument to do this but I'm not getting a valid XML result after the conversion. Here is what I've tried to do the conversion on (Utf8StringWriter is a small class that inherits from StringWriter and overloads Encoding):
XDocument xDoc = XDocument.Parse(utf8Xml);
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
The data in the utf16Xml is invalid and when trying to insert into the database I get the error:
{"XML parsing: line 1, character 38, unable to switch the encoding"}
However the initial utf8Xml data is definitely valid and contains all the info I need.
UPDATE:
The initial XML is obtained by using XMLSerializer (with an Utf8StringWriter class) to create the xml string from an existing object model (engine). The code for this is:
public static void Serialise<T>(T engine, ref StringWriter writer)
{
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings() { Encoding = writer.Encoding });
XmlSerializer xs = new XmlSerializer(engine.GetType());
xs.Serialize(xml, engine);
}
I have to leave this like this as that code is out of my control to change.
Before I even send the utf16Xml string to the failing database call I can view it via the Visual Studio debugger and I notice that the entire string is not present and instead I get a string literal was not closed error on the XML viewer.
The error is on first line XDocument xDoc = XDocument.Parse(utf8Xml);. Most likely you converted utf8 stream into a string (utf8xml), but encoding specified in the string is still utf-8, so XML reader fails. If it is true than load XML directly from stream using Load instead of converting it to string first.
Set the encoding of the document to UTF-16 after you have parsed it from utf8xml
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
Here's what I had to do to make it work. This just converts the XML to utf-16
string getUtf16Xml(System.Xml.XmlDocument xmlDoc)
{
System.Xml.Linq.XDocument xDoc = System.Xml.Linq.XDocument.Parse(xmlDoc.OuterXml);
xDoc.Declaration.Encoding = "utf-16";
return xDoc.ToString();
}
Then I can save the results to the DB.

Categories