converting Xml to txt - c#

I am currently working with an XML file that keeps race information in XML format like so
<Row xmlns="Practice2a">
<RecordType>Qualifying Classification</RecordType>
<_x0030_02150Position>3</_x0030_02150Position>
<Class>250</Class>
<_x0030_02150MachineNo>11</_x0030_02150MachineNo>
<RiderName>Kevin James</RiderName>
<Machine>Honda</Machine>
<_x0030_02150ToDBehind>29.680</_x0030_02150ToDBehind>
<_x0030_02150BestLapSpeed>97.1415157615475</_x0030_02150BestLapSpeed>
<_x0030_02150ToDBestLapTime>5:32.274</_x0030_02150ToDBestLapTime>
<_x0030_02150BestOnLap>7</_x0030_02150BestOnLap>
</Row>
I want to create a plain txt file with just some of the information , I just want in kind off in a table format e.g
pos Name racetime and BestLaptime
I have attempted to remove the tags from the file and create a txt file so now I get
I create a line count to possibly use as delimiters for extracting the right fields.
139 Qualifying Classification
140 3
141 250
142 11
Driver Name: Machine Type: Kevin James
145 Honda
146 29.680
147 97.1415157615475
148 5:32.274
My code is getting quite out of hand and I am wondering if there is a much better way to achieve this rather than adding 14 to count each time , that's how i am displaying Driver Name:" instead of a number.
Any pointers as to how you would go about this would be a great insight.

A quick solution would be to read your xml in XmlDocument (or even simpler to a dataset), and generate the text file in your c# code.
See:
Walkthrough: Reading XML Data into a Dataset
Read XML Attribute using XmlDocument
Alternate approach would be to define an xslt to reformat your xml to layout of your choice. Normally its a preferred approach for generating html docs from your xml datam, though could be used to transform into normal text reports. You can read more about it on
W3School- XSLT
XSLT Basics

You can parse and format it using LinqToXml:
using System.Xml.Linq
// [...]
// Load the XML, either from a string or from an url
var doc = XDocument.Parse(xmlString);
// or
var doc = XDocument.Load(new Uri(#"C:\myFile.xml"));
var result = String.Empty;
foreach (var el in doc.Descendants())
{
// do something with it and format the data to your liking... e.g.
result += FormatElement(el);
}
// or more compact
doc.Descendants().ToList().ForEach(el => result += FormatElement(el));
// [...]
private string FormatElement(XElement el)
{
return String.Format("{0}: {1}", el.Name, el.Value);
}
Of course you need to adapt the FormatElement method to your needs, but this scheme should work.

XML is designed so that some features are required and may be depended-on, while other things are the choice of the author of the document. Your scheme seems to get those features exactly backwards! Which line something appears on is not guaranteed by the standard. An entire legal XML file may occupy a single line.
The whole point of XML is that the use of a standard format allows for the use of common tools. The .NET Framework has (several) XML parsing components built in to it that can read this file and give you exactly the information you are looking for. You can then output that information as text in whatever format you like.
There is no reason to parse it yourself.
And remember, if your solution includes RegEx, then you've already lost.
(I'm kidding about that last part. Sort of.)

Related

Parsing XML file in C# - how to report errors

First I load the file in a structure
XElement xTree = XElement.Load(xml_file);
Then I create an enumerable collection of the elements.
IEnumerable<XElement> elements = xTree.Elements();
And iterate elements
foreach (XElement el in elements)
{
}
The problem is - when I fail to parse the element (a user made a typo or inserted a wrong value) - how can I report exact line in the file?
Is there any way to tie an element to its corresponding line in the file?
One way to do it (although not a proper one) –
When you find a wrong value, add an invalid char (e.g. ‘<’) to it.
So instead of: <ExeMode>Bla bla bla</ExeMode>
You’ll have: <ExeMode><Bla bla bla</ExeMode>
Then load the XML again with try / catch (System.Xml.XmlException ex).
This XmlException has LineNumber and LinePosition.
If there is a limited set of acceptable values, I believe XML Schemas have the concept of an enumerated type -- so write a schema for the file and have the parser validate against that. Assuming the parser you're using supports Schemas, which most should by now.
I haven't looked at DTDs in decades, but they may have the same kind of capability.
Otherwise, you would have to consider this semantic checking rather than syntactic checking, and that makes it your application's responsibility. If you are using a SAX parser and interpreting the data as you go, you may be able to get the line number; check your parser's features.
Otherwise the best answer I've found is to report the problem using an xpath to the affected node/token rather than a line number. You may be able to find a canned solution for that, either as a routine to run against a DOM tree or as a state machine you can run alongside your SAX code to track the path as you go.
(All the "maybe"s are because I haven't looked at what's currently available in a very long time, and because I'm trying to give an answer that is valid for all languages and parser implementations. This should still get you pointed in some useful directions.)

Invalid characters in XML values [duplicate]

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);

Validating an XML file without an xsd file

I found a lot of great questions on this subject. Unfortunately the answers all say to use an xsd file. I made an xsd file from an xml file using xsd.exe. I copied code from here and pasted into Visual studio and I got an error on the first line.
Not wanted to spend time figuring out why it would not run I decided to code the validation myself.
Here are the two points I am using:
Each left caret will have a right caret, so at the end of the file their will be an equal amount of left and right carets.
At the end of the file, if I either take the amount of left carets, or the amount of right carets subtract one from the total (Because the header does not have a backslash) and divide the total by two, I get the number of slashes.
I am encountering some problems though.
I am using string.count() This method also counts carets that are in the attributes (Which I do not want).
I compute the expected number of backslashes when I am done reading the file. If the numbers do not match I write "The expected number of slashes do not match" But I don't know where it is in the file.
I can't think of a way to fix these problems at the moment.
Does anyone have a better way to validate an xml file without using an xsd file?
Take note that WELL FORMED XML is different to VALID XML.
XML that adheres to the XML standard is considered well formed, while XML that adheres to a DTD is considered VALID.
If you just want to check if XML is well formed try this:
try
{
var file = "Your xml path";
var settings = new XmlReaderSettings { DtdProcessing = DtdProcessing.Ignore, XmlResolver = null };
using (var reader = XmlReader.Create(new StreamReader(file), settings))
{
var document = new XmlDocument();
document.Load(reader);
}
}
catch (Exception exc)
{
//show the exception here
}
P.S: Well formedness of an XML is always a prerequisite of a valid XML.
Hope it helps!
When you say "caret", I think you must be talking about the "<" and ">" symbols, which in the XML world are usually referred to as "angle brackets".
When you talk about checking whether the carets match up, you are therefore talking about whether the file conforms to XML syntax. That's called "well-formedness checking" in the XML world. Validation is something different and deeper. You need a schema (either an XSD schema or some other kind) for validation, but for well-formedness checking all you need is an XML parser.
Don't try to implement the well-formedness checking yourself. (a) because it's not easy, (b) because parsers are readily available off-the-shelf, and (c) because you clearly don't have a very advanced understanding of the problem. Just run your file through an XML parser and it will do the job for you.

using linq for this:

i have just started learning linq because i like the sound of it. and so far i think im doing okay at it.
i was wondering if Linq could be used to find the following information in a file, like a group at a time or something:
Control
Text
Location
Color
Font
Control Size
example:
Label
"this is text that will
appear on a label control at runtime"
23, 77
-93006781
Tahoma, 9.0, Bold
240, 75
The above info will be in a plain file and wil have more than one type of control and many different sizes, font properties etc associated with each control listed. is it possible in Linq to parse the info in this txt file and then convert it to an actual control?
i've done this using a regex but regex is too much of a hassle to update/maintain.
thanks heaps
jase
Edit:
Since XML is for structured data, would Linq To XML be appropriate for this task? And would you please share with me any helpful/useful links that you may have? (Other than MSDN, because I am looking at that now. :))
Thank you all
If you are generating this data yourself, then I HIGHLY recommend you store this in an XML file. Then you can use XElement to parse this.
EDIT: This is exactly the type of thing that XML is designed for, structured data.
EDIT EDIT: In response to the second question, Linq to XML is exactly what your looking for:
For an example, here is a couple of links to code I have written that parses XML using XElements. It also creates a XML document.
Example 1 - Loading and Saving: have a look under the FromXML() and ToXML() methods.
Example 2 - Parsing a large XML doc: have a look under the ParseXml method.
Hope these get you going :D
LINQ is good for filtering off rows, selecting relevant columns etc.
Even if you use LINQ for this, you will still need regex to select the relevant text and do the parsing.

How to tell if a string is xml?

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.
We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.
Currently I'm using this approach:
string redact(string eventDetail)
{
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
...
Is there a better way?
Are there any edge cases this approach could miss?
I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.
Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:
<TableName FirstField="Foo" SecondField="Bar" />
<TableName FirstField="Foo" SecondField="Bar" />
...
Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.
SOLUTION
Based on multiple comments (thanks guys!)
string redact(string eventDetail)
{
if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
string detail = eventDetail.Trim();
if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
XmlDocument xml = new XmlDocument();
try
{
xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
}
catch (XmlException e)
{
log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
return eventDetail;
}
... // redact
If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.
One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.
If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.
Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance,
are you OK with accepting
<the quick brown fox jumped over the lazy dog's back>
as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.
How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.
Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).
We need to know more about the data format really.
If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:
foo<bar/>baz
In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!
try
{
XmlDocument myDoc = new XmlDocument();
myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
//take care of the exception
}

Categories