Dealing with awkward XML layout in c# using XmlTextReader

Dealing with awkward XML layout in c# using XmlTextReader - c#

so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!

I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);

It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)

I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"

Related

Extracting only tags from html text file

I'm working on a steganography method which hides text withing html tags.
for example this tag: <heEAd> I have to extract every character within the tag and then
analyze the case of the letter if it is capital then the bit is set to 1 else 0 and I also want to check the end if it sees the matching closing /head tag
here is the code :
WebClient client = new WebClient();
String htmlCode = client.DownloadString("url");
String Tags = "";
for(int i = 0; i < htmlCode.Length; i++){
if(htmlCode[i] ='<'){
if(htmlCode[i] = '>')
continue;
else{
Tags += htmlCode[i];
}
}
}
That logic is terrible but how do I use IndexOf and lastIndexOf to get the desired substring I tried to use that but I'm just missing something due to the lack of my knowledge about c#

I think you need to use REGEX.
I tried to do this once with Substring and i had much job. Latter i decided to use regex and it was easier than the first one.
var regex = new Regex(#"(?<=<head>).*(?=</head>)");
return regex.Matches(strInput);

How can I deal with parsing bad csv data?

I know that the data should be correct. I have no control over the data and my boss is just going to tell me that I need to figure out a way to deal with someone else's mistake. So please don't tell me it's not my problem that the data is bad, because it is.
Anywho, this is what I'm looking at:
"Words","email#email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
Data has been scrubbed for confidentiality reasons.
So as you see, the data contains quotation marks and there are commas inside some of these quoted fields. So I cannot remove them. But the "Suite A""" is throwing off the parser. There are too many quotation marks. >.<
I'm using the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace with these settings:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
The error is
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
I would like to scrub the data somehow to account for this but I'm not sure how to do it. Or maybe there's a way to just skip this line? Although I suspect my higher ups will not approve of me just skipping data that we might need.

If you are only trying to get rid of the stray " marks in your csv, you can use the following regex to find them and replace them with '
String sourcestring = "source string to match with pattern";
String matchpattern = #"(?<!^|,)""(?!(,|$))";
String replacementpattern = #"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
Explanation:
#"(?<!^|,)""(?!(,|$))"; will find will find any " that is not preceded by the beginning of the string, or a , and that is not followed by the end of the string or a ,

I am not familiar with TextFieldParser. However with CsvHelper, you can add a custom handler for invalid data:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}

My only addition to what everyone is saying (because we've all been there) is to try to attempt to rectify each new issue you encounter with code. There are some decent REGEX strings out there https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean or you could manually fix things using String.Replace (String.Replace("\"\"\"","").Replace("\"\","").Replace("\",,","\",") or such). Eventually, as you detect and find ways of correcting more and more mistakes, your manual recovery rate will be minimized substantially (most of your bad data will likely come from similar mistakes). Cheers!
PS - Idea-ish (it's been a while - the logic may neeed some tweaking as I'm writing from memory), but you'll get the gist:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}

I've had to do this before,
The first step is to parse the data using string.split(',')
The next step is to combine the segments that belong together.
What I essentially did was
make a new list representing the combined strings
if a string begins with a quote, push it onto your new list
if it does not begin with a quote, append it to the last string in your list
Bonus: throw exceptions when a string ends with a quote but the next one does not begin with a quote
Depending on what the rules are regarding what can actually appear in your data, you might have to change your code to account for that.

At the core of CSV's file format, each line is a row, each cell in that row is separated by a comma. In your case, your format also contains the (very unfortunate) stipulation that commas inside a pair of quotation marks do not count as separators and are instead part of the data. I say very unfortunate because a misplaced quotation mark affects the entire rest of the line, and since quotation marks in standard ASCII do not distinguish between open and closed, there really is nothing you can do to recover from this without knowing the original intent.
That is when you log a message in a way that the person who does know the original intent (the person that provided the data) can look at the file and correct the error:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
And since your quotation marks aren't escaping newlines, you can keep on going with the next line after running into this error.
ADDENDUM: And if your company has a choice (i.e. your data is being serialized by a company tool) don't use CSV. Use something like XML or JSON with a much more clearly defined parsing mechanism.

I had to do this once aswell. My approach was to go through a line and keep track on what I was reading.
Basicly, I coded my own scanner chopping off tokens from the input line which gave me full control over my faulty .csv data.
This is what I did:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
If the number of fields in your .csv file is fixed you can count the comma's you recognise as field seperators and when you see a End Of Line you know you have another problem or not.
With the stream of strings received from the input line you can build a 'clean' .csv line and this way build a buffer of accepted and cleaned input that you can use in your already existing code.

Using C# and Regex to find and surround all words and numbers within some html text with a span

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?

A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}

Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).

How to correctly encode & in xml?

Im web-requsting an XML document. Xdocument.Load(stream) throws an exception because the XML contains &, and therefore expects ; like &.
I did read the stream to string and replaced & with &, but that broke all other correctly encoded special chars like ø.
Is there a simple way to encode all disallowed chars in the string before parsing to XDocument?

Try CDATA Sections in xml
A CDATA section can only be used in places where you could have a text node.
<foo><![CDATA[Here is some data including < , > or & etc) ]]></foo>

This kind of methods are not encouraged!! The reason lies in your question!
(replacing & by & turns > to &gt;)
The better suggestion apart from using regex is modifying your source code which is generating such uncoded XML.
I have come across (.NET) code that use 'string concat' to come up with XML! (Instead one should use XML-DOM)
If you have an access to modify the source code then better go head with that .. because encoding such half-encoded XML is not promised with perfection!

#espvar,
This is an input XML:
<root><child>nospecialchars</child><specialchild>data&data</specialchild><specialchild2>You.. & I in this beautiful world</specialchild2>data&</root>
And the Main function:
string EncodedXML = encodeWithCDATA(XMLInput); //Calling our Custom function
XmlDocument xdDoc = new XmlDocument();
xdDoc.LoadXml(EncodedXML); //passed
The function encodeWithCDATA():
private string encodeWithCDATA(string stringXML)
{
if (stringXML.IndexOf('&') != -1)
{
int indexofClosingtag = stringXML.Substring(0, stringXML.IndexOf('&')).LastIndexOf('>');
int indexofNextOpeningtag = stringXML.Substring(indexofClosingtag).IndexOf('<');
string CDATAsection = string.Concat("<![CDATA[", stringXML.Substring(indexofClosingtag, indexofNextOpeningtag), "]]>");
string encodedLeftPart = string.Concat(stringXML.Substring(0, indexofClosingtag+1), CDATAsection);
string UncodedRightPart = stringXML.Substring(indexofClosingtag+indexofNextOpeningtag);
return (string.Concat(encodedLeftPart, encodeWithCDATA(UncodedRightPart)));
}
else
{
return (stringXML);
}
}
Encoded XML (ie, xdDoc.OuterXml):
<root>
<child>nospecialchars</child>
<specialchild>
<![CDATA[>data&data]]>
</specialchild>
<specialchild2>
<![CDATA[>You.. & I in this beautiful world]]>
</specialchild2>
<![CDATA[>data&]]>
</root>
All I have used is, substring, IndexOf, stringConcat and recursive function call.. Let me know if you don't understand any part of the code.
The sample XML that I have provided possess data in the parent nodes as well, which is kind of HTML property .. ex: <div>this is <b>bold</b> text</div>.. and my code takes care of encoding data outside <b> tag if they have special character ie, &..
Please note that, I have taken care of encoding '&' only and .. data cannot have chars like '<' or '>' or single-quote or double-quote..

XmlDocument throwing "An error occurred while parsing EntityName"

I have a function where I am passing a string as params called filterXML which contains '&' in one of the properties.
I know that XML will not recognize it and it will throw me an err. Here is my code:
public XmlDocument TestXMLDoc(string filterXml)
{
XmlDocument doc = new XmlDocument();
XmlNode root = doc.CreateElement("ResponseItems");
// put that root into our document (which is an empty placeholder now)
doc.AppendChild(root);
try
{
XmlDocument docFilter = new XmlDocument();
docFilter.PreserveWhitespace = true;
if (string.IsNullOrEmpty(filterXml) == false)
docFilter.LoadXml(filterXml); //ERROR THROWN HERE!!!
What should I change in my code to edit or parse filterXml? My filterXml looks like this:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
I am changing my string value from & to &. Here is my code for that:
string editXml = filterXml;
if (editXml.Contains("&"))
{
editXml.Replace('&', '&');
}
But its giving me an err on inside the if statement : Too many literals.

The file shown above is not well-formed XML because the ampersand is not escaped.
You can try with:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
or:
<Testing>
<Test><![CDATA[CITY & COUNTY]]></Test>
</Testing>

About the second question: there are two signatures for String.Replace. One that takes characters, the other that takes strings. Using single quotes attempts to build character literals - but "&", for C#, is really a string (it has five characters).
Does it work with double quotes?
editXml.Replace("&", "&");
If you would like to be a bit more conservative, you could also write code to ensure that the &s you are replacing are not followed by one of
amp; quot; apos; gt; lt; or #
(but this would still not be a perfect filtering)

To specify an ampersand in XML you should use & since the ampersand sign ('&') has a special meaning in XML.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.