Im converting mass files to XML and each file is either XML, JSON, CSV or PSV. To do the conversion I need to know what data type the file is without looking at the file extension (Some are coming from API's). Someone suggested that I try parse each file by each of the types until you get a success but that is pretty inefficient and CSV cant be easily parsed as it is essentially just a text file (Same as PSV).
Does anyone have any ideas on what I can do? Thanks.
You can have some kind of "pre-parsing":
Either it starts with an XML declaration, or directly with the root node, first character of an XML file should be <.
First character of a JSON file can only be { if the JSON is built on an object, or [ if the JSON is built on an array.
For CSV and PSV (I guess PSV stands for Point-Separated Values?), each line of the file represent a specific record.
So by checking first character, you may find XML and/or JSON parsing is pointless.
Parsing the first line of the file should be enough to decide if the file format is CSV or PSV.
Related
I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D
I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);
You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function
I am invoking a service that returns responses as xml format. The response doesnt follow the xml guidelines and contains some new lines and "\".
Due to the formatting issues, the deserialization is failing.
XML Format:
\r\n\r\n<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<N><details><date>25042014</date><orderNumber>OrderNumber </orderNumber><Response>1</Response></details>
I worked around the problem by removing the new lines and "\" before deserialization but was searching for a cleaner solution if exists.
The XML file has to be well defined, so it must be corresponding to an XSD structure. The escape sequences and new lines will destroy the valid xml, and thus will not correspond to the XSD structure, which, in turn, will cause the deserialization to fail. As far as I know, there is no way around it, except to read the file beforehand, remove the unwanted characters and sequences, and saving it again, so that it may be successfully deserialized when read by an XmlDocument.
I read XML files that sometimes contain elements like
<stringValue>text
text</stringValue>
XmlReader returns
text\ntext
for such strings.
So, when I rewrite the source XML later using XmlWriter I don't get the same strings (there is no
in them).
Should I worry about all this or it's fine to allow string to be changed this way?
I would worry about it yes because your manipulating the data. This means if you do a round-trip to the XML document the text formatting wouldn't be the same.
You would need to make sure on saving back out to XML persist the same formatting.
is the xml encoding for a new line character (\n). If your XML data has a new line in the text, then this notation is correct and the output from XMLWriter is correct. If the new line was not in the original XML data, I've been seeing an issue with IE10/IE11 using the XMLHttpRequest object inserting \r\n in the XML data.
I have an XML file. When I try to load it using .LOAD methods, I get this exception:
System.Xml.XmlException: data at root level invalid at position 1 line 1.
What I have at the beginning of the XML file is this:
<?xml version="1.0" standalone="yes" ?>
I think that string that is used for LoadXml is constructed wrong by either
ignoring BOM and forcing wrong encoding
reading BOM as first character
constructed by hand altogether and first character is not <
Based on last comment I bet that code looks like (or some variation of it) instead of loading XML directly from Stream object (which will handle encoding properly):
// My guess of how wrong code looks like! Not a solution!!!!
StreamReader r = new StreamReader(path, System.Text.Encoding.Unicode);
string xml = r.ReadToEnd();
XmlDocument d = new XmlDocument();
d.LoadXml(xml);
You should review your code that constructs the string you are using in XmlDocument.LoadXml and check if it is indeed valid XML. I'd recommend to create small program that models code that is failing and investigate the behavior.
Position 1 line 1 suggests a problem with the very first char it encounters.
I would suggest firstly confirming that no leading whitespace/other char is in there (sounds silly, but they can creep in easily).
It could also be a char encoding issue, causing that first char to not be read as a '<'.
I bet it's not there. I've found that when I've gotten this error the file or path is missing/incorrect.
Thanks for pouring in your suggestions. The problem was on the build server, the XML file was being pulled from a field called contents in a table called File. I am accessing the XML using the FileID. But the FileID is not the same as FileID on my local database. So, On the build server, I was pulling the XML from a test record which had dummy data. Hence the error. Hope I have made sense. I have fixed the issue by dynamically finding the FileID and querying the contents.
I have an XML structure like this, some Student item contains invalid UTF-8 byte sequenceswhich may cause XML parsing fail for the whole XML document.
What I want to do is, filter out Student item which contains UTF-8 byte sequences, and keep the valid byte sequences ones. Any advice or samples about how to do this in .Net (C# preferred)?
BTW: invalid byte sequences I mean => http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
<?xml version="1.0" encoding="utf-8"?>
<AllStudents>
<Student>
Mike
</Student>
<Student>
(Invalid name here)
</Student>
</AllStudents>
thanks in advance,
George
That's pretty hard to do. You won't get an XML parser to parse a document with invalid characters in it, so I think you're reduced to a couple of options:
Figure out why the encoding is wrong - a common problem is labeling the document as UTF-8 (or having no encoding declaration) when the document is actually written in Latin-1.
Take out the bad sections by hand.
Try and find a tag soup parser for .NET that will continue parsing after the error.
Reject the invalid XML document.
I don't know C#, so I'm afraid I can't give you code to do this, but the basic idea is to read the whole file as a utf-8 text file, using a DecoderFallback to replace invalid sequences with either question mark characters or the unicode chacter 0xFFFD. Then write the file back out as a utf-8 text file, and parse that.
Basically, you separate out the operation of "wiping out bad utf-8 sequences" from the operation of "parsing the xml file".
You should probably even be able to skip writing the file back out again before running the XML parser to read in the fixed data; there should be some way to write the file to an in-memory byte stream and parse that byte stream as XML. (Again, sorry for not knowing C#)
Very close from XML encoding issue.