XML stream processing - c#

I've been looking through the page but I was not able to find any answer to my problem:
I'm downloading a XML file from a server through a stream and processing it using XmlReader:
XmlReader xml = XmlReader.Create(XMLstream, settings);
while (xml.Read())
{
...
}
So the XML is being downloaded while it's being parsed, which is what I want for efficency reasons (I don't usually need to read the full document).
The problem is that sometimes, I'm facing a problem that already appeared in this website, which is that the XML contains non standard characters, which leads to exceptions.
I know how to filter this characters with XmlConvert.IsXmlChar(), and a method to perform this clean up would be: download full XML code -> filter -> pass filtered string to XmlReader.
The problem with this method is that I have to download the full XML file, and sometimes this could be up to 10MB in slow connections!
Is there any method (callback or something) to filter the information just before the chunk of XML code is parsed?
As far as I know the method XmlReader.Read() manages the stream and the parse the result, but I need something in the middle.
Any idea?
Thank you very much in advance.

Related

Programatically tell the difference between data

Im converting mass files to XML and each file is either XML, JSON, CSV or PSV. To do the conversion I need to know what data type the file is without looking at the file extension (Some are coming from API's). Someone suggested that I try parse each file by each of the types until you get a success but that is pretty inefficient and CSV cant be easily parsed as it is essentially just a text file (Same as PSV).
Does anyone have any ideas on what I can do? Thanks.
You can have some kind of "pre-parsing":
Either it starts with an XML declaration, or directly with the root node, first character of an XML file should be <.
First character of a JSON file can only be { if the JSON is built on an object, or [ if the JSON is built on an array.
For CSV and PSV (I guess PSV stands for Point-Separated Values?), each line of the file represent a specific record.
So by checking first character, you may find XML and/or JSON parsing is pointless.
Parsing the first line of the file should be enough to decide if the file format is CSV or PSV.

C# XmlDocument with custom formatting

This code:
XmlNode columnNode = null;
columnNode = xmlDoc.CreateElement("SYSID");
columnNode.InnerText = ""; // Empty string
newRowNode.AppendChild(columnNode);
...does this:
<SYSID>
</SYSID>
And I would like to have this, when string is empty:
<SYSID></SYSID>
Is there any solution?
If you have another tool that requires that format, then the other tool is wrong - it is incapable of reading XML. So if you have control over the other tool, I'd suggest fixing it rather than trying to coerce your code into matching it.
If you can't fix the other tool...
If you're just building a Document to write it out to disk, then you can use a stream and write the elements directly yourself (as simple text). This will be faster (and may well be easier) than using an XmlDoc.
As an improvement on that, you may be able to use an XmlWriter to write elements, but when you go to write an empty element, write raw text to the stream (i.e. writer.WriteRaw("<SYSID></SYSID>\n")) so that you control the formatting for those particular elements.
If you need to build an in-memory XmlDocument, then to a large extent you have to put up with the formatting that it uses when you ask it to serialize to disk (aside from basic settings like PreserveWhitespace, you're asking the document to deal with storing the information, and so you lose a lot of control over the functionality that the XmlDocument encapsulates). THe best suggestion I can think of in this case would be to write the XmlDocument to a MemoryStream and then post-process that memory stream to remove newlines from within empty elements. (Yuck!)

XML exception on loading

I have an XML file. When I try to load it using .LOAD methods, I get this exception:
System.Xml.XmlException: data at root level invalid at position 1 line 1.
What I have at the beginning of the XML file is this:
<?xml version="1.0" standalone="yes" ?>
I think that string that is used for LoadXml is constructed wrong by either
ignoring BOM and forcing wrong encoding
reading BOM as first character
constructed by hand altogether and first character is not <
Based on last comment I bet that code looks like (or some variation of it) instead of loading XML directly from Stream object (which will handle encoding properly):
// My guess of how wrong code looks like! Not a solution!!!!
StreamReader r = new StreamReader(path, System.Text.Encoding.Unicode);
string xml = r.ReadToEnd();
XmlDocument d = new XmlDocument();
d.LoadXml(xml);
You should review your code that constructs the string you are using in XmlDocument.LoadXml and check if it is indeed valid XML. I'd recommend to create small program that models code that is failing and investigate the behavior.
Position 1 line 1 suggests a problem with the very first char it encounters.
I would suggest firstly confirming that no leading whitespace/other char is in there (sounds silly, but they can creep in easily).
It could also be a char encoding issue, causing that first char to not be read as a '<'.
I bet it's not there. I've found that when I've gotten this error the file or path is missing/incorrect.
Thanks for pouring in your suggestions. The problem was on the build server, the XML file was being pulled from a field called contents in a table called File. I am accessing the XML using the FileID. But the FileID is not the same as FileID on my local database. So, On the build server, I was pulling the XML from a test record which had dummy data. Hence the error. Hope I have made sense. I have fixed the issue by dynamically finding the FileID and querying the contents.

C# Use Linq to Extract a single XML attribute for each XML file in a directory

How do I use Linq to extract a single XML attribute form each XML file in a directory and put that element in a C# list. Do I have to loop thru each file one-by-one? The XML files are quite large so I'd like to do this without loading the entire file into memory.
Thanks,
j
Unless the files are massive (100 MB+) I would be unable to turn down the elegance of this code:
var result = Directory.GetFiles(filePath)
.Select(path => XDocument.Load(path))
.Select(doc => doc.Root.Element("A").Attribute("B").Value)
.ToList();
I really hope your XML files are not that big though...
You do have to go through every file, and this will mean at least parsing enough of the XML content of each file to get to the required attribute.
XDocument (i.e. LINQ to SQL) will parse and load the complete document in each case, so you might be better using an XmlReader instance directly. This will require more work: you will have to read the XML nodes until you get to the right one, keeping track of where you are.

XML Exception: Invalid Character(s)

I am working on a small project that is receiving XML data in string form from a long running application. I am trying to load this string data into an XDocument (System.Xml.Linq.XDocument), and then from there do some XML Magic and create an xlsx file for a report on the data.
On occasion, I receive the data that has invalid XML characters, and when trying to parse the string into an XDocument, I get this error.
[System.Xml.XmlException]
Message: '?', hexadecimal value 0x1C, is an invalid character.
Since I have no control over the remote application, you could expect ANY kind of character.
I am well aware that XML has a way where you can put characters in it such as &#x1C or something like that.
If at all possible I would SERIOUSLY like to keep ALL the data. If not, than let it be.
I have thought about editing the response string programatically, then going back and trying to re-parse should an exception be thrown, but I have tried a few methods and none of them seem successful.
Thank you for your thought.
Code is something along the line of this:
TextReader tr;
XDocument doc;
string response; //XML string received from server.
...
tr = new StringReader (response);
try
{
doc = XDocument.Load(tr);
}
catch (XmlException e)
{
//handle here?
}
You can use the XmlReader and set the XmlReaderSettings.CheckCharacters property to false. This will let you to read the XML file despite the invalid characters. From there you can import pass it to a XmlDocument or XDocument object.
You can read a little more about in my blog.
To load the data to a System.Xml.Linq.XDocument it will look a little something like this:
XDocument xDocument = null;
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
{
xmlReader.MoveToContent();
xDocument = XDocument.Load(xmlReader);
}
More information can be found here.
XML can handle just about any character, but there are ranges, control codes and such, that it won't.
Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.
(You can't even resort to CDATA, as there is no way to escape these characters there.)
Would something as described in this blog post be helpful?
Basically, he creates a sanitizing xml stream.
If your input is not XML, you should use something like Tidy or Tagsoup to clean the mess up.
They would take any input and try, hopefully, to make a useful DOM from it.
I don't know how relevant dark side libraries are called.
Garbage In, Garbage Out. If the remote application is sending you garbage, then that's all you'll get. If they think they're sending XML, then they need to be fixed. In this case, you're not doing them any favors by working around their bug.
You should also make sure of what they think they're sending. What did the %1C mean to them? What did they want it to be?
IMHO the best solution would be to modify the code/program/whatever produced the invalid XML that is being fed to your program. Unfortunately this is not always possible. In this case you need to escape all characters < 0x20 before trying to load the document.
If you really can't fix the source XML data, consider taking an approach like I described in this answer. Basically, you create a TextReader subclass (e.g StripTextReader) that wraps an existing TextReader (tr) and discards invalid characters.
Its a late answer, but may help someone. When you read or serialize an XML it may have 1 invisible character at the beginning of the XML. XDocument don't like this invisible character.
So while reading the XML, just start reading from the first < character:
var myXml = XDocument.Parse(loadedString.Substring(loadedString.IndexOf("<")));
That's it and it loads just fine.

Categories