Doing regex style compare while looping through a XML file in C# - c#

I have a XML file that i am using to loop through an on matching of a child node getting the value of a an attribute.The thing is matching these values with a * character or ? character like some regex style..can someone tell me how to do this .So if a request comes like g.portal.com it should match the second node .I am using .net 2.0
Below is my XML file
<Test>
<Test Text="portal.com" Sample="1" />
<Test Text="*.portal.com" Sample="201309" />
<Test Text="portal-0?.com" Sample="201309" />
</Test>
XmlDocument xDoc = new XmlDocument();
xDoc.Load(PathToXMLFile);
foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
{
if (node.Attributes["Sample"].InnerText == value)
{
}
}

What you need to do is first convert each Text attribute into a valid Regex pattern and then use it to match your input. Something like this:
string input = "g.portal.com";
XmlNode foundNode = null;
foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
{
string value = node.Attributes["Text"].Value;
string pattern = Regex.Escape(value)
.Replace(#"\*", ".*")
.Replace(#"\?", ".");
if (Regex.IsMatch(input, "^" + pattern + "$"))
{
foundNode = node;
break; //remove if you want to continue searching
}
}
After executing the above code, foundNode should contain the second node from the xml file.

So you have an XML file that sets up patterns, right? You'll want to feed those patterns into Regexes and then stream a number of requests through them. Did I get that correct?
Assuming the XML file doesn't change it only needs to be processed into according Regexes. For example *.portal.com would translate to
new Regex("\\w+\\.portal\\.com");
You'll just have to escape the dots, replace * with \\w+ and ? with \\w if i guessed the semantics of you match patterns correctly.
Look up the correct replacements at http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Related

C# regex find specific character between tags

really just started using regex and can only do basic things.
I want to convert hyphens in the element tag of an XML file to underscores. I already have a c# app that reads a config file with find and replace elements and does some other cleaning work in RegEx but can't figure this one out
so currently it will go in as
< convert-there-here > but-not-these < / convert-these-here >
And I want it spat out as
< convert_these_here > but-not-these < / convert_these_here >
The C# script jut sucks in the file and reads it line by line, it doesn't look at it as an xml file
so basically i thought i just need a pattern that looks for any and all hyphens BETWEEN a < and >
Thanks
Ditch the regex. Parse your XML, and fix it. Using the XDocument class makes this really easy.
Say we start with the following XML document:
<this-is>
<an-xml>
<doc but-I="remain-untouched">look-at-me</doc>
</an-xml>
</this-is>
We can load it into an XDocument and fix up the element names.
var str = "<this-is><an-xml><doc but-I=\"remain-untouched\">look-at-me</doc></an-xml></this-is>";
var xdoc = XDocument.Parse(str);
foreach(var el in xdoc.Descendants())
{
var name = el.Name.LocalName;
name = name.Replace("-", "_");
el.Name = el.Name.Namespace + name;
}
var fixedXmlString = xdoc.ToString();
Now this gives us the following:
<this_is>
<an_xml>
<doc but-I="remain-untouched">look-at-me</doc>
</an_xml>
</this_is>

Using C# and Regex to find and surround all words and numbers within some html text with a span

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).

an error occurred while parsing entityname with '&'

I have the next program that open a .XML document with Visual c#. I can´t open the Xml because it has a '&', and I don´t know how i can open.
private void button1_Click(object sender, EventArgs e)
{
XmlDocument doc;
doc = new XmlDocument();
doc.Load("nuevo.xml");
XmlNodeList menus;
menus = doc.GetElementsByTagName("menu");
foreach (XmlNode unMenu in menus)
{
if (unMenu.Attributes["precio"].Value == "50")
{
//Console.WriteLine(unMenu.Attributes["type"].Value);
XPathNavigator navegador = doc.CreateNavigator();
XPathNodeIterator nodos = navegador.Select("/restaurante");
while (nodos.MoveNext())
{
Console.WriteLine(nodos.Current.OuterXml);
Console.WriteLine();
textBox1.Text = nodos.Current.OuterXml;
}
}
}
}
If you get the error
an error occurred while parsing entityname with '&'
then there is an "&" somewhere in the name of an XML element. This is not allowed in an XML document. You cannot open an invalid XML file with the XmlDocument (or XDocument) class.
There are several things you can do:
Make sure that the XML files are always valid before trying to read them. This however depends on your scenario and may not be possible.
Preprocess your XML file to fix the invalid content by replacing "&" with "&". You can either do this manually or at run-time.
Use HtmlAgilityPack to parse the invalid file.
Personally, I would go with 1) if possible or 2) otherwise.
Replace all occurances of & with & in the xml.
So after spending hours on this issue: it turns out that if you have an ampersand symbol ("&") or any other XML escape characters within your xml string, it will always fail will you try read the XML. TO solve this, replace the special characters with their escaped string format
YourXmlString = YourXmlString.Replace("'", "&apos;").Replace("\"", """).Replace(">", ">").Replace("<", "<").Replace("&", "&");

Dealing with awkward XML layout in c# using XmlTextReader

so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!
I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);
It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)
I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"

Regex - remove text while replacing text with c#

I am attempting to learn regex by using it to edit some scripts I have.
My scripts contain like so
<person name="John">Will be out of town</person><person name="Julie">Will be in town.</person>
I need to replace the name values in the script - the addition to the name is always the same, but I might have names that I don't want to update.
Quick example of what I have:
string[] names = new string[1];
names[0] = "John-Example";
names[1] = "Paul-Example";
string ToFix = "<person name=\"John\">Will be out of town</person><person name=\"Julie\">Will be in town.</person>"
for (int i=0; i<names.Length; i++)
{
string Name = names[i];
ToFix = Regex.Replace(ToFix, "(<.*name=\")(" + Name.Replace("-Example", "") + ".*)(\".*>)", "$1" + Name + "$3", RegexOptions.IgnoreCase);
}
This works for the most part, but I have two problems with it. Sometime it removes too much, if I have multiple persons in the string, it will remove everything between the first person and the last person, as so:
Hello <person name="John">This is John</person><person name="Paul">This is Paul</person>
becomes
Hello <person name="John-Example">This is Paul</person>
Also, I would like to remove any extra text behind the name value and before the closing carrat, so that:
<person name="John" hello>
Should be corrected to:
<person name="John-Example">
I have read several articles on regex and feel that I am just missing something small here. How and why would I go about fixing this?
EDIT: I don't think these scripts that I am working with classify as XML - the entire script may or may not have <> tags. Back to my original goal with this question, can someone explain the behavior of the regex? And how would I remove extra text after the name value before the closing tag?
Your regex is too greedy. Try .*? rather than just .*
Also, please don't use regex to parse XML.
Here's an example of how to do what I think you want, using XDocument:
var xdoc = XDocument.Parse(ToFix);
foreach (var person in xdoc.Elements("person"))
{
var name = person.Attribute("name");
if (person.LastAttribute != name)
{
person.RemoveAttributes();
person.SetAttributeValue(name.Name, name.Value + "-Example");
}
}
var output = xdoc.ToString();

Categories