I have a webpage source code which has several occurrences of
<div class="detName">some unpredictable text</div>
I want to be able to get a colleection of all some unpredictable text.
I tried something like:
var match = Regex.Match(pageSourceCode, #"<div class='detName'>/(A-Za-z0-9\-]+)\</div>", RegexOptions.IgnoreCase);
But had no success, what would be a good solution for this issue?
Don't use regex to parse HTML, you can use HTML Agility Pack:
string html = "<div class=\"detName\">some unpredictable text</div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[contains(#class,'detName')]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
var match = Regex.Match(pageSourceCode, #"(?<=<div class='detName'>)(.*)(?=</div>)", RegexOptions.IgnoreCase);
Related
I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}
I'm trying this for finding all links in HTML code.
If you have an HTML file like this:
<div class="a">
</div>
And you're searching for example the following words: theword and other. You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:
Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();
List<string> urls = new List<string>();
foreach (HtmlNode n in nodeList)
{
urls.Add(n.Attributes["href"].Value);
}
Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:
node.SelectNodes(".//a[contains(#href,'theword') or contains(#href,'other')]")
There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:
node.SelectNodes(".//a[matches(#href,'(theword|other)')]")
I Find this and that works for me.
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
var temp = catagory.Attributes["href"].Value;
if (temp.Contains("some_word"))
{
dgv.Rows.Add(temp);
}
}
I have an HTML that I download via my webrequest client. And out of entire html I want to parse only this part of HTML:
<span class="sku">
<span class="fb">SKU :</span>118880101
</span>
I'm using HTML agilty pack to retrieve this value: 118880101
And I've written something like this:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
return htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']").ElementAt(0).InnerText;
And this returns me this value from HTML:
SKU :118880101
Literally like this, spaces included... How can I fix this logic with HTML Agilty pack so that I can only take out this 118880101 value?
Can someone help me out?
Edit: a regex like this would do the thing:
Substring(skuRaw.LastIndexOf(':') + 1);
which would mean to take everything after ":' sign in string that I receive... But I'm not sure if it's safe to use regex like this ?
Try This
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var innerText=htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']")
.ElementAt(0).InnerText;
return innerText.replace(/\D/g,'');
if you want to use only Html Agility pack try this
var child = htmlDoc.DocumentNode.SelectNodes("//span[#class='fb']")
.FirstOrDefault();
if (child != null)
{
var parent = child.ParentNode;
parent.RemoveChild(child);
var innerText = parent.InnerText;
}
I am trying to create regular expression that returns number of tables or table array. So far I have
#"<table>^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$</table>"
The html can be
<table>
<p id='p1'></p>
</table>
<table>
<p>abc</p>
</table>
for example if I run following code
string str = "<table><p id='p1'></p></table><table><p>abc</p></table>";
Regex r = new Regex(#"/<table>^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$</table>/");
MatchCollection s = r.Matches(str);
Response.Write(s.Count);
Then it should write "2" since there are two tables.
The above regex isn't working as expected. The regex for parsing html seems to be ok, but I am having difficulty in combining the regex for html and regex that encapsulates html (table that encapsulates html elements)
Recommended using Html Agility Pack:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var nodes = htmlDocument.DocumentNode.SelectNodes("//table");
In a C# app, I want to match every HTML "font" tag with "color" attribute.
I have the following text:
1<font color="red">2<font color="blue">3</font>4</font>56
And I want a MatchCollection containing the following items:
[0] <font color="red">234</font>
[1] <font color="blue">3</font>
But when I use this code:
Regex.Matches(result, "<font color=\"(.*)\">(.*)</font>");
The MatchCollection I get is the following one:
[0] <font color="red">2<font color="blue">3</font>4</font>
How can I get the MatchCollection I want using C#?
Thanks.
Regex on "HTML" is an antipattern. Just don't do it.
To steer you on the right path, look at what you can do with HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"1<font color=""red"">2<font color=""blue"">3</font>4</font>56");
var fontElements = doc.DocumentNode.Descendants("font");
var newNodes = fontElements.Select(fe => {
var newNode = fe.Clone();
newNode.InnerHtml = fe.InnerText;
return newNode;
});
var collection = newNodes.Select(n => n.OuterHtml);
Now, in collection we have the following strings:
<font color="red">234</font>
<font color="blue">3</font>
mmm... lovely.
Matches m = Regex.Matches(result, "<font color=\"(.*?)\">(.*?)</font>");
//add a ? after the * and print the result .you will know how to get it.
A way with Html Agility Pack and a XPath query to ensure that the color attribute is present:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
String html = "1<font color=\"red\">2<font color=\"blue\">3</font>4</font>56";
htmlDoc.LoadHtml(html);
HtmlNodeCollection fontTags = htmlDoc.DocumentNode.SelectNodes(".//font[#color]");
foreach (HtmlNode fontTag in fontTags)
{
Console.WriteLine(fontTag.InnerText);
}
I have a string variable that contains the entire HTML of a web page.
The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ).
What is the best possible way to do it ?
Will using any extension function help ? what about using Regex ?
Thanks in Advance
Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.
There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:
string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
.Where(n => n.Name == "a" && n.Attributes.Contains("href")
.Select(n => n.Attributes["href"].Value);
I think you'll find this answers your question to a T
http://msdn.microsoft.com/en-us/library/t9e807fx.aspx
:)
I would go with Regex.
Regex exp = new Regex(
#"{href=}*{>}",
RegexOptions.IgnoreCase);
string InputText; //supply with HTTP
MatchCollection MatchList = exp.Matches(InputText);
Try this Regex (should work):
var matches = Regex.Matches (html, #"href=""(.+?)""");
You can go through the matches and extract the captured URL.
Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/
With this you can simply us XPATH to get all of the links on the page and put them into a list.
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Taken from another post here - Get all links on html page?