Find text between know pattern

Find text between know pattern - c#

I have a webpage source code which has several occurrences of
<div class="detName">some unpredictable text</div>
I want to be able to get a colleection of all some unpredictable text.
I tried something like:
var match = Regex.Match(pageSourceCode, #"<div class='detName'>/(A-Za-z0-9\-]+)\</div>", RegexOptions.IgnoreCase);
But had no success, what would be a good solution for this issue?

Don't use regex to parse HTML, you can use HTML Agility Pack:
string html = "<div class=\"detName\">some unpredictable text</div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[contains(#class,'detName')]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}

var match = Regex.Match(pageSourceCode, #"(?<=<div class='detName'>)(.*)(?=</div>)", RegexOptions.IgnoreCase);

Related

Get links with specific words from a HTML code in C#

I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}
I'm trying this for finding all links in HTML code.

If you have an HTML file like this:
<div class="a">
</div>
And you're searching for example the following words: theword and other. You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:
Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();
List<string> urls = new List<string>();
foreach (HtmlNode n in nodeList)
{
urls.Add(n.Attributes["href"].Value);
}
Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:
node.SelectNodes(".//a[contains(#href,'theword') or contains(#href,'other')]")
There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:
node.SelectNodes(".//a[matches(#href,'(theword|other)')]")

I Find this and that works for me.
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
var temp = catagory.Attributes["href"].Value;
if (temp.Contains("some_word"))
{
dgv.Rows.Add(temp);
}
}

HtmlAgiltyPack parse HTML and take value out of span tag and class name

I have an HTML that I download via my webrequest client. And out of entire html I want to parse only this part of HTML:
<span class="sku">
<span class="fb">SKU :</span>118880101
</span>
I'm using HTML agilty pack to retrieve this value: 118880101
And I've written something like this:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
return htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']").ElementAt(0).InnerText;
And this returns me this value from HTML:
SKU :118880101
Literally like this, spaces included... How can I fix this logic with HTML Agilty pack so that I can only take out this 118880101 value?
Can someone help me out?
Edit: a regex like this would do the thing:
Substring(skuRaw.LastIndexOf(':') + 1);
which would mean to take everything after ":' sign in string that I receive... But I'm not sure if it's safe to use regex like this ?

Try This
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var innerText=htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']")
.ElementAt(0).InnerText;
return innerText.replace(/\D/g,'');
if you want to use only Html Agility pack try this
var child = htmlDoc.DocumentNode.SelectNodes("//span[#class='fb']")
.FirstOrDefault();
if (child != null)
{
var parent = child.ParentNode;
parent.RemoveChild(child);
var innerText = parent.InnerText;
}

Regex for html tags that are encapsulated by table elements

I am trying to create regular expression that returns number of tables or table array. So far I have
#"<table>^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$</table>"
The html can be
<table>
<p id='p1'></p>
</table>
<table>
<p>abc</p>
</table>
for example if I run following code
string str = "<table><p id='p1'></p></table><table><p>abc</p></table>";
Regex r = new Regex(#"/<table>^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$</table>/");
MatchCollection s = r.Matches(str);
Response.Write(s.Count);
Then it should write "2" since there are two tables.
The above regex isn't working as expected. The regex for parsing html seems to be ok, but I am having difficulty in combining the regex for html and regex that encapsulates html (table that encapsulates html elements)

Recommended using Html Agility Pack:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var nodes = htmlDocument.DocumentNode.SelectNodes("//table");

Match nested HTML tags

In a C# app, I want to match every HTML "font" tag with "color" attribute.
I have the following text:
1<font color="red">2<font color="blue">3</font>4</font>56
And I want a MatchCollection containing the following items:
[0] <font color="red">234</font>
[1] <font color="blue">3</font>
But when I use this code:
Regex.Matches(result, "<font color=\"(.*)\">(.*)</font>");
The MatchCollection I get is the following one:
[0] <font color="red">2<font color="blue">3</font>4</font>
How can I get the MatchCollection I want using C#?
Thanks.

Regex on "HTML" is an antipattern. Just don't do it.
To steer you on the right path, look at what you can do with HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"1<font color=""red"">2<font color=""blue"">3</font>4</font>56");
var fontElements = doc.DocumentNode.Descendants("font");
var newNodes = fontElements.Select(fe => {
var newNode = fe.Clone();
newNode.InnerHtml = fe.InnerText;
return newNode;
});
var collection = newNodes.Select(n => n.OuterHtml);
Now, in collection we have the following strings:
<font color="red">234</font>
<font color="blue">3</font>
mmm... lovely.

Matches m = Regex.Matches(result, "<font color=\"(.*?)\">(.*?)</font>");
//add a ? after the * and print the result .you will know how to get it.

A way with Html Agility Pack and a XPath query to ensure that the color attribute is present:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
String html = "1<font color=\"red\">2<font color=\"blue\">3</font>4</font>56";
htmlDoc.LoadHtml(html);
HtmlNodeCollection fontTags = htmlDoc.DocumentNode.SelectNodes(".//font[#color]");
foreach (HtmlNode fontTag in fontTags)
{
Console.WriteLine(fontTag.InnerText);
}

To search for strings with in a string (search for all hrefs in HTML source)

I have a string variable that contains the entire HTML of a web page.
The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ).
What is the best possible way to do it ?
Will using any extension function help ? what about using Regex ?
Thanks in Advance

Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.
There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:
string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
.Where(n => n.Name == "a" && n.Attributes.Contains("href")
.Select(n => n.Attributes["href"].Value);

I think you'll find this answers your question to a T
http://msdn.microsoft.com/en-us/library/t9e807fx.aspx
:)

I would go with Regex.
Regex exp = new Regex(
#"{href=}*{>}",
RegexOptions.IgnoreCase);
string InputText; //supply with HTTP
MatchCollection MatchList = exp.Matches(InputText);

Try this Regex (should work):
var matches = Regex.Matches (html, #"href=""(.+?)""");
You can go through the matches and extract the captured URL.

Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/
With this you can simply us XPATH to get all of the links on the page and put them into a list.
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Taken from another post here - Get all links on html page?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find text between know pattern - c#

var match = Regex.Match(pageSourceCode, #"(?<=<div class='detName'>)(.*)(?=</div>)", RegexOptions.IgnoreCase);

Related

Get links with specific words from a HTML code in C#

HtmlAgiltyPack parse HTML and take value out of span tag and class name

Regex for html tags that are encapsulated by table elements

Match nested HTML tags

To search for strings with in a string (search for all hrefs in HTML source)

Categories

Resources