Reading Xml files with umlaut chars

Reading Xml files with umlaut chars - c#

I have asked this question yesterday and got a reply.
Writing encoded values for umlauts
In the code the parse method works if it's a string like so:
XDocument xDoc = XDocument.Parse("<description>Top Shelf-ÖÄÜookcase</description>");
To pass the input xml file as string, I have to read it first. The read method will fail if there are umlauts in the input xml.
How do I get past that?
Tried both Load and Parse methods of XDocument.
Load:
Invalid character in the given encoding. Line 3, position 35.
Parse:
Data at the root level is invalid. Line 1, position 1.
Here is a sample xml after using CDATA:
<?xml version="1.0" encoding="utf-8"?>
<kal>
<description><![CDATA[Top Shelf-ÖÄÜookcase]]> </description>
</kal>

Change encoding to "iso-8859-1"

Have you tried wrapping the description data with a CDATA?
<description><![CDATA[Top Shelf-ÖÄÜookcase]]> </description>
Special characters don't particularly parse well in XML unless you wrap them with CDATA.

As Besi stated, you have to use the correct encoding of the xml-file in order to achieve correct handling of the umlauts.
Even so you said that the creation of the incoming xml-file is not in your hand, you can still affect the encoding to use for parsing the xml by using a dedicated StreamReader:
// create your XDocument
XDocument Doc;
// setup a StreamReader for your file, specifying the encoding you need
using (StreamReader Reader = new StreamReader(#"C:\your-file.xml", System.Text.Encoding.GetEncoding("ISO-8859-1")))
{
// PARSE the STRING that is RETURNED from the StreamReader.ReadToEnd()-method
Doc = XDocument.Parse(Reader.ReadToEnd());
}

Related

"An error occurred while parsing EntityName" after grabbing content from valid XML

I am reading an XML string with XDocument
XmlReader reader = XmlReader.Create(new StringReader(xmltext));
reader.Read();
XDocument xdoc = XDocument.Load(reader);
Then I grab the content of some tags and put them within tags in a different string.
When I try to Load this string in the same way I did with the first, I get an error "An error occurred while parsing EntityName. Line 1, position 344.".
I think it should be parsed correctly since it has beem parsed before so I guess I am missing something here.
I am reading and copying the content of the first XML with (string)i.Element("field").
I am using .net 4

When I grab the content of the xml that I want to use for building another Xml string I use (string)i.Element("field") and this is converting my Xml into string. My next Xml Parsing does not recognize it as an Element anymore so I solved the problem by not using (string) before I read my element, just i.Element("field") and this works.

It sounds like you've got something like this:
<OriginalDocument>
<Foo>A & B</Foo>
</OriginalDocument>
That A & B represents the text A & B. So when you grab the text from the element, you'll get the string "A & B". If you then use that to build a new element like this:
string foo = "<Foo>" + fooText + "</Foo>";
then you'll end up with invalid XML like this:
<Foo>A & B</Foo>
Basically, you shouldn't be constructing XML in text form. It's not clear what you're really trying to achieve, but you can copy an element from one place to another pretty easily in XElement form; you shouldn't need to build a string and then reparse it.

So after spending hours on this issue:
it turns out that if you have an ampersand symbol ("&") or any other XML escape characters within your xml string, it will always fail will you try read the XML.
TO solve this, replace the special characters with their escaped string format
YourXmlString = YourXmlString.Replace("'", "&apos;").Replace("\"", """).Replace(">", ">").Replace("<", "<").Replace("&", "&");

Save string to XML File

I want to save the following string in an XML File:
<text><![CDATA[<p>what is my pet name</p>]]></text>
When I am saving it, it looks like:
<text><![CDATA[<p>what is my pet name</p>]]></text>
I have tried File.WriteAllText(), XmlDocument.Save() methods but didnt get the proper response.
basically everywhere other than opening and closing tags in the XML, < is replaced by < and > is replaced by >.

What is happening is that the XML parser is encoding your string. When you try to access the string later, it can be decoded again at that time.
What I suggest, is that you either try to load the text as into a new 'XmlDocument' with XmlDocument.LoadXml(string s), and then import that into your current document, or leave it encoded.
You should not try to both use an XML parser, and manually add text at the same time.

I guess you add the CDATA manually and the XML writing mechanism correctly escapes your CDATA because it treats it as text content. Instead explicitly add a CDATA section with just the contents.
If you are using the old XML API (System.XML), then use this method to create the CDATA Section: http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.createcdatasection
Then append the node to the element just like in the example in the link.

XML is being written correctly.
XML has special characters that are reserved for commands, just like C# reserves words like "if" and "string".
XML is encoding your string for storage. What you need to do is when you retrieve your string, run it through a similar decode process.
Use this: HttpServerUtility.HtmlDecode(encodedString)
Reference:
Decode XML returned by a webservice (< and > are replaced with < and &gt)?

How to stop XDocument.Save writing escape chars

I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to "&gt".
Is there an easy way to prevent this?

First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}

Should I preserve in XML element strings

I read XML files that sometimes contain elements like
<stringValue>text
text</stringValue>
XmlReader returns
text\ntext
for such strings.
So, when I rewrite the source XML later using XmlWriter I don't get the same strings (there is no
in them).
Should I worry about all this or it's fine to allow string to be changed this way?

I would worry about it yes because your manipulating the data. This means if you do a round-trip to the XML document the text formatting wouldn't be the same.
You would need to make sure on saving back out to XML persist the same formatting.

is the xml encoding for a new line character (\n). If your XML data has a new line in the text, then this notation is correct and the output from XMLWriter is correct. If the new line was not in the original XML data, I've been seeing an issue with IE10/IE11 using the XMLHttpRequest object inserting \r\n in the XML data.

Writing XMLDocument unescaped in C#

Currently I'm writing XHTML in a XmlDocument. This works perfect, but I'm stuck on one problem. Some XmlText elements can contain things like . When I want to write such things to a stream it uses the innerXML instead of the innerText value for such nodes. The problem is that the ouput is wrong because now its outputting &nbsp; instead of . How can I use xmlwriter and xmldocument without performing such escaping when writing to a stream? I just want unescaped output.

If you use XmlWriter.WriteRaw, it won't perform any escaping - it assumes you've got raw XML.
For example:
using System;
using System.Xml;
class Test
{
static void Main()
{
using (XmlWriter writer = XmlWriter.Create(Console.Out))
{
writer.WriteStartDocument();
writer.WriteStartElement("root");
writer.WriteRaw("<element> </element>");
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
}
Output:
<?xml version="1.0" encoding="IBM437"?><root><element> </element></root>

You're almost certainly trying to solve the wrong problem here. If you want text with non-breaking spaces, then you should use the non-breaking space character. In a C# string literal you can write it as the escape sequence \u00A0, for example:
var xmldoc = new XmlDocument();
XmlElement test = xmldoc.CreateElement("test");
xmldoc.AppendChild(test);
XmlText nbsp = xmldoc.CreateTextNode("\u00A0");
test.AppendChild(nbsp);
HTML entities like nbsp are just a way to encode such characters in a non-unicode text file. You shouldn't be using them when constructing an XML DOM. By the way, if you force .NET to write the above DOM to an ASCII encoded file (via the proper XmlWriterSettings) then it will probably write the non-breaking space character as  . In an UTF-8 encoded file (the default) it will just appear as a space.
If you force certain literal character sequences to appear in the XML output, then you risk creating invalid XML that cannot be loaded by conforming XML processors. For example, try to load <test> </test> in an empty XmlDocument. This will throw an exception. To be fair, you can declare such entities, and the XHTML schema does so. But I hope you see my point.
edit: XmlDocument is doing it's job correctly. If it wouldn't escape characters such as & < > then you could create invalid XML that's impossible to load again. To force an XML entity in the output you should use XmlDocument.CreateEntityReference. The bug is in whatever code is using entities in XmlText nodes instead of generating XmlEntityReference nodes.

Assuming you are using .NET 3.x ,learn and use LINQ-to-XML... the API is very simple and more capable. That way you need not walk/traverse the DOM...instead you can just query the object tree.
Specifically, look into the XDocument clas of the API.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading Xml files with umlaut chars - c#

Change encoding to "iso-8859-1"

Have you tried wrapping the description data with a CDATA? <description><![CDATA[Top Shelf-ÖÄÜookcase]]> </description> Special characters don't particularly parse well in XML unless you wrap them with CDATA.

Related

"An error occurred while parsing EntityName" after grabbing content from valid XML

Save string to XML File

How to stop XDocument.Save writing escape chars

Should I preserve in XML element strings

Writing XMLDocument unescaped in C#

Categories

Resources