Find HTML / XML node using RegEx - c#

I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:
var parser = new HtmlParser();
var source = "<html><head><title>Test Title</title></head><body><h1>Some example source</h1><p>This is a paragraph element and example postode EC1A 4NP</body></html>";
var document = parser.Parse(source);
Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");
var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.
Edit
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string

If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN
You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.

Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:
//*[contains(text(),'EC1A 4NP')]
returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing

I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in <p> tags.
var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

Related

C# regex to strip value enclosed in XML element

I am trying to figure out how to write a regex that will strip out the values enclosed in an xml tag. For example,
string xml = "<MyElement1 attribute="bla"><MyElement1>12345</MyElement1></MyElement1>"
I want to know how to do the following:
match on MyElement1 nodes that do not have an attribute
So specifically, using my example I would match <MyElement1>12345</MyElement1> and replace <MyElement1> and </MyElement1> so that my final node looks like this: <MyElement1 attribute="bla">12345</MyElement1>
I've tried: [<][^>]*[>] but this matches on all elements. I'm not sure how to specify specific elements I want to match on.
I have made edits to make the question more focused and clearer as suggested based on the downvotes. I understand that I can use parse and navigate my document tree, but I prefer to use a regex replace of some sort because I want to apply this logic to any number of xml files with different tree structures, elements, and attributes.
Well you really don't need to use regular expressions, you just need to parse your XML using an XML parser.
One of the options you have would be to use the XDocument.Parse( xml ) method and XElement, where the first would be to parse the string, and the second to read it's tag and it's value. An example for reading it would be the following one
string xml = "<MyElement1>12345</MyElement1><MyElement2>abcd</MyElement2><MyElement3>12345</MyElement3><MyElement4>12345</MyElement4>";
// wrap your element in a rootnode (you seem to be missing one in your example)
var document = XDocument.Parse( $"<root>{xml}</root>");
// get the root node and loop over it's children (cast XNode to XElement in the process)
foreach (var node in document.Root.Nodes().OfType<XElement>()) {
// name is tag, value is well, it's value
Console.WriteLine($"{node.Name}: {node.Value}");
}
Note that for the example to parse the document correctly, you must add a rootnode, as xml can have only one rootnode in the document. In my sample, I enclosed the rootnode during the parsing
This sample code uses the System.Xml.Linq namespace, so don't forget to import that one.
One additional comment would be that your supplied XML code had an error in it (MyElemen4 opening tag with MyElement4 closing tag)
I would recommend using a XML Parser but if you want, you can use a simple regex like <([\w]*)>(.*?)<\/[\w]*>, this would return the name of the tag and the value inside.
Output:
Match 1
Full match 0-30 <MyElement1>12345</MyElement1>
Group 1. 1-11 MyElement1
Group 2. 12-17 12345
Match 2
Full match 30-59 <MyElement2>abcd</MyElement2>
Group 1. 31-41 MyElement2
Group 2. 42-46 abcd
Match 3
Full match 59-89 <MyElement3>12345</MyElement3>
Group 1. 60-70 MyElement3
Group 2. 71-76 12345
Match 4
Full match 89-118 <MyElemen4>12345</MyElement4>
Group 1. 90-99 MyElemen4
Group 2. 100-105 12345
Keep in mind it doesn't take in consideration of tag attributes. If you want to fetch a specific tag you can replace [\w] with the tag name you want.

Parsing HTML string usiing C#

I have a string with html text as shown below.
string htmlText = "<h1>This is heading 1</h1><p>This is some text.</p>
<hr><h2>This is heading 2</h2><p>This is some other text.</p><hr>";
Can we convert this html string as we see it in browser after it has been parsed so that later we can use this parsed string where ever required.
Later I want to copy this data to a sharepoint list multiline rich text column. There I dont need these tags to come, but
This answer provides an example using HtmlAgilityPack, which is much more robust than rolling your own parsing or regular expressions.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
Your question isn't entirely clear and cuts off at the end. But you can actually parse the data if you want. Just examine each character to find the tags using string indexes (e.g. htmlText[i]).
If you need something a little more robust, use HtmlMonkey or HtmlAgilityPack to parse it for you.
The best way is using regular expression to extract inner next between html tags
some. Something like this might does work:
((.+?)</h.?>)+((.+?)</p.?>)

Regex to find iframe tags and retrieve attributes

I am trying to retrieve iframe tags and attributes from an HTML input.
Sample input
<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>
I have been trying to collect them using the following regex:
<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>
This results in
This is exactly the format I want.
The problem is, if the HTML attributes are in a different order this regex won't work.
Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?
Here is a regex that will ignore the order of attributes:
(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>
RegexStorm demo
C# sample code:
var rx = new Regex(#"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = #"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();
Output:
Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.
You can deal with this in 2 ways:
The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.
The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.
As stated, you should go with the first option.
You can use this regex
<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>
it matches each 'name'='value' pair recursively and stores it in the same order in matches, you can iterate through the mathes to get names and values sequentially. Caters for most chars in value but you may add a few more if needed.
With Html Agility Pack (to be had via nuget):
using System;
using HtmlAgilityPack;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);
HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");
foreach (HtmlNode iframe in iframes)
{
Console.WriteLine(iframe.GetAttributeValue("width","null"));
Console.WriteLine(iframe.GetAttributeValue("height", "null"));
Console.WriteLine(iframe.GetAttributeValue("src","null"));
}
}
}
}
You need to use an OR operator (|). See changes below
<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>

How can I grab text before a tag with HTMLAgilityPack

Suppose I have this HTML string:
These are some links<br>1234 - <a id="1234" href="#">My Number 1</a><br>4321 - My Number 2...
I want to extract the text after the <br> tag (1234 -), the inner text of the <a> tag (My Number 1), and the id attribute of the <a> tag (1234) as well. I am using the HTMLAgilityPack to help parse the HTML data that I get.
So far I have tried doing this:
// mNodes = code to get html string I want to parse
HtmlNode mNumberListNodes = mNodes[1]; // mNodes[1] is equal to a string as shown above
List<HtmlNode> mNumberNodes = mNumberListNodes.Descendants("a").ToList();
I am using debugging points to stop and view the previous child nodes in each of the HtmlNode's, but I am not having any luck finding the corresponding number text.
Anyone have any experience using the HTMLAgilityPack in C# that could help me?
I believe the
mNodes.InnerText
property will give you all the text that is not in html tags, specifically the "1234" you want. Text itself is not a node in the DOM.
Assuming the code above is correct, to get the id value, use:
mNumberListNodes.Descendants("a").ToList()[0].Attributes["id"].Value
I've had pretty good success using XPath with this library, and also regular expressions.

Find and replace text inside xml document using regular expression

I am using c# console app to get xml document. Now once xmldocument is loaded i want to search for specific href tag:
href="/abc/def
inside the xml document.
once that node is found i want to strip tag completly and just show Hello.
Hello
I think i can simply get the tag using regex. But can anyone please tell me how can i remove the href tag completly using regex?
xml & html same difference: tagged content. xml is stricter in it's formatting.
for this use case I would use transformations and xpath queries rebuild the document. As #Yahia stated, regex on tagged documents is typically a bad idea. the regex for parsing is far to complex to be affective as a generic solution.
The most popular technology for similar tasks is called XPath. (It is also a key component of XQuery and XSLT.) Would the following perhaps solve your task, too?
root.SelectSingleNode("//a[#href='/abc/def']").InnerText = "Hello";
You could try
string x = #"<?xml version='1.0'?>
<EXAMPLE>
<a href='/abc/def'>Hello</a>
</EXAMPLE>";
System.Xml.XmlDocument doc = new XmlDocument();
doc.LoadXml(x);
XmlNode n = doc.SelectSingleNode("//a[#href='/abc/def']");
XmlNode p = n.ParentNode;
p.RemoveChild(n);
System.Xml.XmlNode newNode = doc.CreateNode("element", "a", "");
newNode.InnerXml = "Hello";
p.AppendChild(newNode);
Not really sure if this is what you are trying to do but it should be enough to get you headed in right direction.

Categories