Avoid skipping elements in xml reader - c#

Let's suppose that I have a xml like this:
<articles>
<article>
<id>1</id>
<name>A1</name>
<price>10</price>
</article>
<article>
<id>2</id>
<name>A2</name>
</article>
<article>
<id>3</id>
<name>A3</name>
<price>30</price>
</article>
</articles>
As you can see article A2 is missing price tag.
I have a real world case where I parse xml file where some tags in some articles are missing (which I didn't know earlier). I wrote a very simple parser like this:
using (XmlReader reader = XmlReader.Create(new StringReader(myXml)))
{
while (true)
{
bool articleExists = reader.ReadToFollowing("article");
if (!articleExists) return;
reader.ReadToFollowing("id");
string id = reader.ReadElementContentAsString();
reader.ReadToFollowing("name");
string name = reader.ReadElementContentAsString();
reader.ReadToFollowing("price");
string price = reader.ReadElementContentAsString();
//do something with these values
}
But if there is no price tag in article 2 xmlreader will jump to price tag in article A3 and I get articles mixed up and some data skipped, right?
How can I protect from this? So if some tag in article node is absent, then let's say default value is used?
I would still like to use xmlreader if possible. My real file is 200 MB big so I need a simple,fast and efficient solution that won't hang the system.

Related

Retain HTML tags on JSON to XML conversion

I have a JSON object which I convert to XML using the following code:
private string ConvertFileToXml(string file)
{
string fileContent = File.ReadAllText(file);
XmlDocument doc = JsonConvert.DeserializeXmlNode(fileContent, "root");
// Retain html tags.
doc.InnerXml = HttpUtility.HtmlDecode(doc.InnerXml);
return XDocument.Parse(doc.InnerXml).ToString();
}
where string json is the following object:
{
"id": "2639",
"type": "www.stack.com",
"bodyXML": "\n<body><p>Democrats also want to “reinvigorate and modernise” US <ft-content type=\"http://www.stack.com/ontology/content/Article\" url=\"http://api.stack.com/content/d2c32614-61c6-11e7-91a7-502f7ee26895\">antitrust</ft-content> laws for a broad attack on corporations.</p>\n<p>Mr Schumer said the Democrats’ new look should appeal to groups that backed Mrs Clinton, such as the young and minority groups, and members of the white working-class who deserted Democrats for Mr Trump. </p>\n</body>",
"title": "Democrats seek to reclaim populist mantle from Donald Trump",
"standfirst": "New economic plan is pitched as an assault on growing corporate power",
"byline": "David J Lynch in Washington",
"firstPublishedDate": "2017-07-24T17:51:25Z",
"publishedDate": "2017-07-24T17:50:25Z",
"requestUrl": "http://api.stack.com/content/e8bec6dc-708d-11e7-aca6-c6bd07df1a3c",
"brands": [
"http://api.ft.com/things/dbb0bdae-1f0c-11e4-b0cb-b2227cce2b54"
],
"standout": {
"editorsChoice": false,
"exclusive": false,
"scoop": false
},
"canBeSyndicated": "yes",
"webUrl": "http://www.stack.com/cms/s/e8bec6dc-708d-11e7-aca6-c6bd07df1a3c.html"
}
and the output of the method generates this:
<root>
<id>2639</id>
<type>www.stack.com</type>
<bodyXML>
<p>Democrats also want to “reinvigorate and modernise” US <ft-content type="http://www.stack.com/ontology/content/Article" url="http://api.stack.com/content/d2c32614-61c6-11e7-91a7-502f7ee26895">antitrust</ft-content> laws for a broad attack on corporations.</p>
<p>Mr Schumer said the Democrats’ new look should appeal to groups that backed Mrs Clinton, such as the young and minority groups, and members of the white working-class who deserted Democrats for Mr Trump. </p>
</body></bodyXML>
<title>Democrats seek to reclaim populist mantle from Donald Trump</title>
<standfirst>New economic plan is pitched as an assault on growing corporate power</standfirst>
<byline>David J Lynch in Washington</byline>
<firstPublishedDate>2017-07-24T17:51:25Z</firstPublishedDate>
<publishedDate>2017-07-24T17:50:25Z</publishedDate>
<requestUrl>http://api.stack.com/content/e8bec6dc-708d-11e7-aca6-c6bd07df1a3c</requestUrl>
<brands>http://api.ft.com/things/dbb0bdae-1f0c-11e4-b0cb-b2227cce2b54</brands>
<standout>
<editorsChoice>false</editorsChoice>
<exclusive>false</exclusive>
<scoop>false</scoop>
</standout>
<canBeSyndicated>yes</canBeSyndicated>
<webUrl>http://www.stack.com/cms/s/e8bec6dc-708d-11e7-aca6-c6bd07df1a3c.html</webUrl>
</root>
Within the original "bodyXML" of the JSON, there is HTML text with HTML tags but they get crushed into HTML entities after the conversion. What I want to do is retain these HTML tags after conversion.
How do I do this?
Help would be much appreciated!
I don't think its possible to have the 'Encoded' HTML tags in the inner text of an xml Node
But its possible to do an HTML Decode on the inner text of that Xml Node after you parse the XmlDocument.
This will get you the text with all the HTML tags intact.
Eg.,
private static string ConvertFileToXml()
{
string fileContent = File.ReadAllText("text.json");
XmlDocument doc = JsonConvert.DeserializeXmlNode(fileContent, "root");
return System.Web.HttpUtility.HtmlDecode(doc.SelectSingleNode("root").SelectSingleNode("bodyXML").InnerText);
}
Namespace required : System.Web

Separating portions of XML file with text in C#

I need to write XML files with a tilde as a separator between portions, like so:
....
<Company>
<CompanyDetail>Blah</CompanyDetail>
<Phone>0000000000</Phone>
</Company>
~
<Company>
<CompanyDetail>Blah</CompanyDetail>
<Phone>0000000000</Phone>
</Company>
....
The way to do that normally in C# would be along the lines of
writer = new XmlTextWriter(fileName, Encoding.UTF8);
writer.Formatting = Formatting.Indented;
writer.WriteStartDocument();
int remainingCompanies = companyList.Count;
foreach (Company company in companyList)
{
writer.WriteStartElement("Company");
writer.WriteStartElement("CompanyDetail");
writer.WriteString("company.companyDetail.toString()");
writer.WriteEndElement();
writer.WriteStartElement("Phone");
writer.WriteString("company.phone.toString()");
writer.WriteEndElement();
writer.WriteEndElement();
if (remainingCompanies-- > 1)
{
writer.WriteString(\n~\n);
}
}
But whenever I do this, the resulting XML file ends up being poorly formatted like so:
<Company>
<CompanyDetail>FirstCompany</CompanyDetail>
<Phone>1111111111</Phone>
</Company>
~
<Company><CompanyDetail>SecondCompany</CompanyDetail><Phone>2222222222</Phone></Company>
~
<Company><CompanyDetail>ThirdCompany</CompanyDetail><Phone>3333333333</Phone></Company>
....
When there's much more information for each company than just CompanyDetail and Phone, you can imagine how difficult it gets to look through the single line to visually find what you need in the XML.
My current workaround is to replace the tilde with a comment, but how do I have a tilde separating parts of this XML file AND maintain clean formatting?

IgnoreWhiteSpace not ignoring whitespace at beginning of xml string

Question
Should whitespace be ignored at the beginning of my multi-line string literal xml?
Code
string XML = #"
<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
Notice in the above code that there is white space before the XML declaration, this doesn't seem to be being ignored despite my setting of the IgnoreWhiteSpace property. Where am I going wrong?!
Note: I have the same behaviour when the XML string does not have a line break, and just a whitespace, as below. I know this will run if I remove the whitespace, my question is as to why the property doesn't take care of this?
string XML = #" <?xml version=""1.0"" encoding=""utf-8"" ?>"
The documentations say that the IgnoreWhitespace property will "Gets or sets a value indicating whether to ignore insignificant white space.". While that first whitespace (and also linebreak) should be insignificant, the one who made XmlReader apparently didn't think so. Just trim XML before use, and you'll be fine.
As stated in comments and for clarity, change your code to:
string XML = #"<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML.Trim()))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
According to Microsoft's documentation regarding XML Declaration
The XML declaration typically appears as the first line in an XML
document. The XML declaration is not required, however, if used it
must be the first line in the document and no other content or white
space can precede it.
The parse should fail for your code because white space precedes the XML declaration. Removing either the white space OR the xml declaration will result in a successful parse.
In other words it would be a bug if XmlReaderSettings were at odds with the documentation for XML Declaration - it is defined behavior.
Here's some code demonstrating the above rules.
using System;
using System.Web;
using System.Xml;
using System.Xml.Linq;
public class Program
{
public static void Main()
{
//The XML declaration is not required, however, if used it must
// be the first line in the document and no other content or
//white space can precede it.
// here, no problem because this does not have an XML declaration
string xml = #"
<xml></xml>";
XDocument doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
//
// problem here because this does have an XML declaration
//
xml = #"
<?xml version=""1.0"" encoding=""utf-8"" ?><xml></xml>";
try
{
doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
} catch(Exception e) {
Console.WriteLine(e.Message);
}
}
}

Getting data from xml file and comparing it to a text file

So I have two files: a mot file and an xml file. What I need to do with these files is to read data from the xml file and compare it to the mot file if it exists. That's the general idea.
Before anything else, for those who are unfamiliar with what a mot
file is (I don't also have much knowledge about it, just the basics)...
(From Wikipedia) A mot file (or a Motorola S-Record
file) is a file format that conveys binary information in ASCII Hex text form.
(from another source) An S-record file consists of a
sequence of specially formatted ASCII character strings. An S-record
will be less than or equal to 78 bytes in length.
The format of a S-Record is:
S | Type | Record Length | Address (starting address) | Data | Checksum
(e.g. S21404200047524D5354524D0000801410AA5AA555F9)
([parsed] S2 14 042000 47524D5354524D0000801410AA5AA555 F9)
The specific idea is that I have data AA BB CC DD and so on allocated in addresses 0x042000 ~ 0x04200F. What’s written in the xml would be:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<record>
<File name="Test.mot">
<Address id="042000">
<Data>AA</Data>
</Address>
</File>
</record>
<record>
<File name="Test.mot">
<Address id="042001">
<Data>BB CC DD</Data>
</Address>
</File>
</record>
<record>
<File name="Test.mot">
<Address id="042004">
<Data>EE FF</Data>
</Address>
</File>
</record>
Then the program would get the data and address from he XML and search the .mot file for any hits. So if a mot file has a record S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9, then this is supposed to bring a match with what's in the xml. Result to true, or 1. If anything in the xml doesn't have a match, then it would return with false or 0.
The problem now would be I’m not well-versed with C# much less with XML although I did have a tiny bit of experience with both. I initially thought it would be something like this:
using (StreamReader sr = new StreamReader("Test.mot"))
{
String line =String.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.Contains("042004") & line.Contains("EE FF"))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Failure");
}
}
}
But obviously, it didn't result with what I expected. And Failure keeps popping up. Am I right to use StreamReader to read the .mot file? And with regards to the XML file, will XMLDocument work? How do I get data from the xml and compare it with the .mot file? Could someone walk me through how to get this done or provide guides how to properly start with this.
Let me know if I'm not clear on anything.
EDIT:
I thought of an idea. I'm not sure if it's doable, though. Let's say the program will read the mot S-Record file, and it will identify the type of the record. From there every record line listed in the file would be broken down as shown in the sample below:
sample record line: "S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9"
S2 - type w/c means there would be a 3-byte address
14 - record length
F9 - checksum
042000 - AA
042001 - BB
042002 - CC
042003 - DD
...
04200F - 5E
With this new list, I think or I hope it would be easier for the program to use the data in the XML to locate it in the mot file.
Tell me if this will work, or if there are any alternatives.
Correct me when i'm wrong as it is full of assumptions:
the XML only gives the starting values of the data package under the mot file:
||||||||||||
S214042000AABBCCDDEEFF01234567891A2B3C4D5EF9
AABBCCDDEEFF
You could read out the xml and place each record in a record class
public class Record
{
string FileName{get;set;}
string Id {get;set;}
string Data {get;set;}
public Record(){} //default constructor
}
with the XmlDocument class you could read out the xml.
something like:
var document = new XmlDocument();
document.LoadXml("your.xml");
var records = document.SelectNodes("record");
var recordList = new List<Record>();
foreach(var r in records)
{
var file = r.SelectSingleNode("file");
var fileName = file.Attributes["name"].Value;
var address = file.SelectSingleNode("Address");
var id = address.Attributes["id"].Value;
var data = address.SelectSingleNode("Data").InnerText.Replace(" ", "");
recordList.Add(new Record{FileName = fileName, Id = id, Data = data});
}
Afterwards you can then readout everyline of the mot file by position:
since the location of the 042000 always be the 5 - 10 character
var fn = "Test.mot";
using (StreamReader sr = new StreamReader(fn))
{
var record = recordList.Single(r=> r.FileName);
String line =String.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.SubString(4,6) == record.Id && line.SubString(10, record.Data.Length) == record.Data)
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Failure");
}
}
}
Let me know if it helped you out a bit

Reading xml file?

I have this xml file that i have created pragmatically using C# :-
<Years>
<Year Year="2011">
<Month Month="10">
<Day Day="10" AccessStartTime="01:15 PM" ExitTime="01:15 PM" />
<Day Day="11" AccessStartTime="01:15 PM" ExitTime="01:15 PM" />
<Day Day="12" AccessStartTime="01:15 PM" ExitTime="01:15 PM" />
<Day Day="13" AccessStartTime="01:15 PM" ExitTime="01:15 PM" />
</Month>
<Month Month="11">
<Day Day="12" AccessStartTime="01:16 PM" ExitTime="01:16 PM" />
</Month>
</Year>
</Years>
I am having problems when i want to get specfic data from it while i am using XmlReader or i am doing it the wrong way cause each time the reader reads one single line and i what i want is to get a list of all days in a specific month and a year
Use Linq-XML or post the code you have tried.
var list = from ele in XDocument.Load(#"c:\filename.xml").Descendants("Year")
select new
{
Year = (string)ele.Attribute("Year"),
Month= (string)ele.Element("Month").Attribute("Month"),
Day = (string)ele.Element("Month").Element("Day").Attribute("Day")
};
foreach (var t in list)
{
Console.WriteLine(t.Year + " " + t.Month + " " + t.Day );
}
I agree with AVD's suggestion of using LINQ to XML. Finding all the days for a specific year and month is simple:
XDocument doc = XDocument.Load("file.xml");
var days = doc.Elements("Year").Where(y => (int) y.Attribute("Year") == year)
.Elements("Month").Where(m => (int) m.Attribute("Month") == month)
.Elements("Day");
(This assumes that Month and Year attributes are specified on all Month and Year elements.)
The result is a sequence of the Day elements for the specified month and year.
In most cases I'd actually write one method call per line, but in this case I thought it looked better to have one full filter of both element and attribute per line.
Note that in LINQ, some queries end up being more readable using query expressions, and some are more readable in the "dot notation" I've used above.
You asked for an explanation of AVD's code, so you may be similarly perplexed by mine - rather than explain the bits of LINQ to XML and LINQ that my code happens to use, I strongly recommend that you read good tutorials on both LINQ and LINQ to XML. They're wonderful technologies which will help your code all over the place.
Take a look at this example how to represent the xml with root node and using xml reader how to get the data ....
using System;
using System.Xml;
class Program
{
static void Main()
{
// Create an XML reader for this file.
using (XmlReader reader = XmlReader.Create("perls.xml"))
{
while (reader.Read())
{
// Only detect start elements.
if (reader.IsStartElement())
{
// Get element name and switch on it.
switch (reader.Name)
{
case "perls":
// Detect this element.
Console.WriteLine("Start <perls> element.");
break;
case "article":
// Detect this article element.
Console.WriteLine("Start <article> element.");
// Search for the attribute name on this current node.
string attribute = reader["name"];
if (attribute != null)
{
Console.WriteLine(" Has attribute name: " + attribute);
}
// Next read will contain text.
if (reader.Read())
{
Console.WriteLine(" Text node: " + reader.Value.Trim());
}
break;
}
}
}
}
}
}
Input text [perls.xml]
<?xml version="1.0" encoding="utf-8" ?>
<perls>
<article name="backgroundworker">
Example text.
</article>
<article name="threadpool">
More text.
</article>
<article></article>
<article>Final text.</article>
</perls>
Output
Start <perls> element.
Start <article> element.
Has attribute name: backgroundworker
Text node: Example text.
Start <article> element.
Has attribute name: threadpool
Text node: More text.
Start <article> element.
Text node:
Start <article> element.
Text node: Final text.

Categories