XML Extra free space - c#

Good day, in general is a problem, I work with XML through C# XMLdocument, after saving that "document", there is such a thing: 
<Name></Name>
After saving:
<Name>
</Name>
How to remove extra spaces?  I've tried: doc.PreserveWhitespace=true;  before saving and before loading. The result is not one that removes all spaces. XML document (large volume) become visually unreadable.
I have already tried, same result. And need Encoding windows-1251 Why XmlDocument do this bad thing? That free or whitespace important for me and my "program".
the problem is solved. thank you all

It can be done. You've got to help control the formatting options when you save the document:
XmlDocument doc = new XmlDocument();
using (var wr = new XmlTextWriter(fileName))
{
wr.Formatting = Formatting.None;
doc.Save(wr);
}
Or you can fine-tune it further with XmlWriterSettings:
var settings = new XmlWriterSettings
{
Indent = false,
NewLineChars = String.Empty
};
using (var wr = XmlWriter.Create(fileName, settings))
{
wr.Formatting = Formatting.None;
doc.Save(wr);
}

Related

Indenting Xml file edited with XDocument in C#

I have some code that edits and Xml file.
When I save the file the new elements are not properly indented while existing elements are, e.g.:
Before:
<MyGroup>
<ExistingElement1>a value</ExistingElement1>
<ExistingElement2>something else</ExistingElement2>
</MyGroup>
After:
<MyGroup>
<ExistingElement1>a value</ExistingElement1>
<ExistingElement2>something else</ExistingElement2>
<NewElement>Inserted by code</NewElement></MyGroup>
New elements are added with XElement, like:
myGroup.Add(new XElement(ns + "NewElement", "Inserted by code"));
when saving the file I use XmlWriterSettings as I want to avoid saving the XmlDeclaration:
var settings = new XmlWriterSettings
{
OmitXmlDeclaration = true,
Encoding = Encoding.UTF8,
Indent = true
};
using (var writer = XmlWriter.Create(filePath, settings))
{
xmlDoc.Save(writer);
}
So it looks like that the Indent = true option is not working for the newly added elements, does anyone know why?
BTW, when I open the file I use LoadOptions.PreserveWhitespace
XDocument xmlDoc = XDocument.Load(filePath, LoadOptions.PreserveWhitespace);

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

XDocument will not parse html entities (e.g. ) but XmlDocument will

I am currently converting our old parsers that run on XmlDocument to the XDocument. I do this mainly to get the Linq querying and the added linenumber info.
My xml contains an element like this:
<?xml version="1.0"?>
<fulltext>
hello this is a failed textnode
and I don't know how to parse it.
</fulltext>
My problem is that while XmlDocument seems to have no problem reading that node with:
var xmlDocument = new XmlDocument();
var physicalPath = GetPhysicalPath(uploadFolderFile);
try
{
xmlDocument.Load(physicalPath);
}
catch (XmlException xmlException)
{
_log.Warn("Problems with the document", xmlException);
}
The example above parses the document fine but when I try to do:
XDocument xmlDocument;
var physicalPath = GetPhysicalPath(uploadFolderFile);
var xmlStream = new System.IO.StreamReader(physicalPath);
try
{
xmlDocument = XDocument.Load(xmlStream, LoadOptions.SetLineInfo | LoadOptions.SetBaseUri);
}
catch (XmlException)
{
_log.Warn("Trying to clean document for HexaDecimal", xmlException);
}
It fails to read the document because of the character
The special character seems to be allowed in XML version 1.1 but changing the description doesn't help.
I have thought about just parsing the document with XmlDocument and then converting it; but that seems to be counterintuitive. Can anybody help with this problem?
Ok...so I sort of found a solution to this problem.
First of all I try to parse the xml using the following code:
private XDocument GetXmlDocument(String physicalPath)
{
XDocument xmlDocument;
var xmlStream = new System.IO.StreamReader(physicalPath);
try
{
xmlDocument = XDocument.Load(xmlStream, LoadOptions.SetLineInfo);
}
catch (XmlException)
{
//_log.Warn("Trying to clean document for HexaDecimal", xmlException);
xmlDocument = XmlSanitizingStream.TryToCleanXMLBeforeParsing(physicalPath);
}
return xmlDocument;
}
If it fails to load the document, then I will try to clean it using the technique used in this blogpost:
http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
It will not remove the character I mentioned before, but it will remove any character not allowed by the XML standard.
Then, after sanitizing the XML, I add an XMLReader and set its settings to not check characters:
public static XDocument TryToCleanXMLBeforeParsing(String physicalPath)
{
string xml;
Encoding encoding;
using (var reader = new XmlSanitizingStream(File.OpenRead(physicalPath)))
{
xml = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
}
byte[] encodedString;
if (encoding.Equals(Encoding.UTF8)) encodedString = Encoding.UTF8.GetBytes(xml);
else if (encoding.Equals(Encoding.UTF32)) encodedString = Encoding.UTF32.GetBytes(xml);
else encodedString = Encoding.Unicode.GetBytes(xml);
var ms = new MemoryStream(encodedString);
ms.Flush();
ms.Position = 0;
var settings = new XmlReaderSettings {CheckCharacters = false};
XmlReader xmlReader = XmlReader.Create(ms, settings);
var xmlDocument = XDocument.Load(xmlReader);
ms.Close();
return xmlDocument;
}
Since I've cleaned the document removing illegal characters before I add the ignore characters to the reader, I am pretty sure that I do not read a malformed XML document. Worst case scenario is I get a malformed XML and it will throw an error anyways.
I only use this for parsing and it should only be used to read the data. This will not make the XML well-formed and will in many cases throw exceptions elsewhere in your code. I am only using this because I cannot change what the customer is sending us and I have to read it as is.

C# XDocument Load with multiple roots

I have an XML file with no root. I cannot change this. I am trying to parse it, but XDocument.Load won't do it. I have tried to set ConformanceLevel.Fragment, but I still get an exception thrown. Does anyone have a solution to this?
I tried with XmlReader, but things are messed up and can't get it work right. XDocument.Load works great, but if I have a file with multiple roots, it doesn't.
XmlReader itself does support reading of xml fragment - i.e.
var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var reader = XmlReader.Create("fragment.xml", settings))
{
// you can work with reader just fine
}
However XDocument.Load does not support reading of fragmented xml.
Quick and dirty way is to wrap the nodes under one virtual root before you invoke the XDocument.Parse. Like:
var fragments = File.ReadAllText("fragment.xml");
var myRootedXml = "<root>" + fragments + "</root>";
var doc = XDocument.Parse(myRootedXml);
This approach is limited to small xml files - as you have to read file into memory first; and concatenating large string means moving large objects in memory - which is best avoided.
If performance matters you should be reading nodes into XDocument one-by-one via XmlReader as explained in excellent #Martin-Honnen 's answer (https://stackoverflow.com/a/18203952/2440262)
If you use API that takes for granted that XmlReader iterates over valid xml, and performance matters, you can use joined-stream approach instead:
using (var jointStream = new MultiStream())
using (var openTagStream = new MemoryStream(Encoding.ASCII.GetBytes("<root>"), false))
using (var fileStream =
File.Open(#"fragment.xml", FileMode.Open, FileAccess.Read, FileShare.Read))
using (var closeTagStream = new MemoryStream(Encoding.ASCII.GetBytes("</root>"), false))
{
jointStream.AddStream(openTagStream);
jointStream.AddStream(fileStream);
jointStream.AddStream(closeTagStream);
using (var reader = XmlReader.Create(jointStream))
{
// now you can work with reader as if it is reading valid xml
}
}
MultiStream - see for example https://gist.github.com/svejdo1/b9165192d313ed0129a679c927379685
Note: XDocument loads the whole xml into memory. So don't use it for large files - instead use XmlReader for iteration and load just the crispy bits as XElement via XNode.ReadFrom(...)
The only in-memory tree representations in the .NET framework that can deal with fragments are the XmlDocumentFragment in .NET's DOM implementation so you would need to create an XmlDocument and a fragment with e.g.
XmlDocument doc = new XmlDocument();
XmlDocumentFragment frag = doc.CreateDocumentFragment();
frag.InnerXml = stringWithXml; // for instance
// frag.InnerXml = File.ReadAllText("fragment.xml");
or is XPathDocument where you can create one using an XmlReader with ConformanceLevel set to Fragment:
XPathDocument doc;
using (XmlReader xr =
XmlReader.Create("fragment.xml",
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
}))
{
doc = new XPathDocument(xr);
}
// new create XPathNavigator for read out data e.g.
XPathNavigator nav = doc.CreateNavigator();
Obviously XPathNavigator is read-only.
If you want to use LINQ to XML then I agree with the suggestions made that you need to create an XElement as a wrapper. Instead of pulling in a string with the file contents you could however use XNode.ReadFrom with an XmlReader e.g.
public static class MyExtensions
{
public static IEnumerable<XNode> ParseFragment(XmlReader xr)
{
xr.MoveToContent();
XNode node;
while (!xr.EOF && (node = XNode.ReadFrom(xr)) != null)
{
yield return node;
}
}
}
then
XElement root = new XElement("root",
MyExtensions.ParseFragment(XmlReader.Create(
"fragment.xml",
new XmlReaderSettings() {
ConformanceLevel = ConformanceLevel.Fragment })));
That might work better and more efficiently than reading everything into a string.
If you wanted to use XmlDocument.Load() then you would need to wrap the content in a root node.
or you could try something like this...
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
XmlDocument d = new XmlDocument();
d.CreateElement().InnerText = xmlReader.ReadOuterXml();
}
}
XML document cannot have more than one root elements. One root element is required. You may do one thing. Get all the fragment elements and wrap them into a root element and parse it with XDocument.
This would be the best and easiest approach that one could think of.

How to get Xml as string from XDocument?

I am new to LINQ to XML. After you have built XDocument, how do you get the OuterXml of it like you did with XmlDocument?
You only need to use the overridden ToString() method of the object:
XDocument xmlDoc ...
string xml = xmlDoc.ToString();
This works with all XObjects, like XElement, etc.
I don't know when this changed, but today (July 2017) when trying the answers out, I got
"System.Xml.XmlDocument"
Instead of ToString(), you can use the originally intended way accessing the XmlDocument content: writing the xml doc to a stream.
XmlDocument xml = ...;
string result;
using (StringWriter writer = new StringWriter())
{
xml.Save(writer);
result = writer.ToString();
}
Several responses give a slightly incorrect answer.
XDocument.ToString() omits the XML declaration (and, according to #Alex Gordon, may return invalid XML if it contains encoded unusual characters like &).
Saving XDocument to StringWriter will cause .NET to emit encoding="utf-16", which you most likely don't want (if you save XML as a string, it's probably because you want to later save it as a file, and de facto standard for saving files is UTF-8 - .NET saves text files as UTF-8 unless specified otherwise).
#Wolfgang Grinfeld's answer is heading in the right direction, but it's unnecessarily complex.
Use the following:
var memory = new MemoryStream();
xDocument.Save(memory);
string xmlText = Encoding.UTF8.GetString(memory.ToArray());
This will return XML text with UTF-8 declaration.
Doing XDocument.ToString() may not get you the full XML.
In order to get the XML declaration at the start of the XML document as a string, use the XDocument.Save() method:
var ms = new MemoryStream();
using (var xw = XmlWriter.Create(new StreamWriter(ms, Encoding.GetEncoding("ISO-8859-1"))))
new XDocument(new XElement("Root", new XElement("Leaf", "data"))).Save(xw);
var myXml = Encoding.GetEncoding("ISO-8859-1").GetString(ms.ToArray());
Use ToString() to convert XDocument into a string:
string result = string.Empty;
XElement root = new XElement("xml",
new XElement("MsgType", "<![CDATA[" + "text" + "]]>"),
new XElement("Content", "<![CDATA[" + "Hi, this is Wilson Wu Testing for you! You can ask any question but no answer can be replied...." + "]]>"),
new XElement("FuncFlag", 0)
);
result = root.ToString();
While #wolfgang-grinfeld's answer is technically correct (as it also produces the XML declaration, as opposed to just using .ToString() method), the code generated UTF-8 byte order mark (BOM), which for some reason XDocument.Parse(string) method cannot process and throws Data at the root level is invalid. Line 1, position 1. error.
So here is a another solution without the BOM:
var utf8Encoding =
new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (var memory = new MemoryStream())
using (var writer = XmlWriter.Create(memory, new XmlWriterSettings
{
OmitXmlDeclaration = false,
Encoding = utf8Encoding
}))
{
CompanyDataXml.Save(writer);
writer.Flush();
return utf8Encoding.GetString(memory.ToArray());
}
I found this example in the Microsoft .NET 6 documentation for XDocument.Save method. I think it answers the original question (what is the XDocument equivalent for XmlDocument.OuterXml), and also addresses the concerns that others have pointed out already. By using the XmlWritingSettings you can predictably control the string output.
https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xdocument.save
StringBuilder sb = new StringBuilder();
XmlWriterSettings xws = new XmlWriterSettings();
xws.OmitXmlDeclaration = true;
xws.Indent = true;
using (XmlWriter xw = XmlWriter.Create(sb, xws)) {
XDocument doc = new XDocument(
new XElement("Child",
new XElement("GrandChild", "some content")
)
);
doc.Save(xw);
}
Console.WriteLine(sb.ToString());
Looking at these answers, I see a lot of unnecessary complexity and inefficiency in pursuit of generating the XML declaration automatically. But since the declaration is so simple, there isn't much value in generating it. Just KISS (keep it simple, stupid):
// Extension method
public static string ToStringWithDeclaration(this XDocument doc, string declaration = null)
{
declaration ??= "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n";
return declaration + doc.ToString();
}
// Usage
string xmlString = doc.ToStringWithDeclaration();
// Or
string xmlString = doc.ToStringWithDeclaration("...");
Using XmlWriter instead of ToString() can give you more control over how the output is formatted (such as if you want indentation), and it can write to other targets besides string.
The reason to target a memory stream is performance. It lets you skip the step of storing the XML in a string (since you know the data must end up in a different encoding eventually, whereas string is always UTF-16 in C#). For instance, for an HTTP request:
// Extension method
public static ByteArrayContent ToByteArrayContent(
this XDocument doc, XmlWriterSettings xmlWriterSettings = null)
{
xmlWriterSettings ??= new XmlWriterSettings();
using (var stream = new MemoryStream())
{
using (var writer = XmlWriter.Create(stream, xmlWriterSettings))
{
doc.Save(writer);
}
var content = new ByteArrayContent(stream.GetBuffer(), 0, (int)stream.Length);
content.Headers.ContentType = new MediaTypeHeaderValue("text/xml");
return content;
}
}
// Usage (XDocument -> UTF-8 bytes)
var content = doc.ToByteArrayContent();
var response = await httpClient.PostAsync("/someurl", content);
// Alternative (XDocument -> string -> UTF-8 bytes)
var content = new StringContent(doc.ToStringWithDeclaration(), Encoding.UTF8, "text/xml");
var response = await httpClient.PostAsync("/someurl", content);

Categories