Get links with specific words from a HTML code in C# - c#

I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}
I'm trying this for finding all links in HTML code.

If you have an HTML file like this:
<div class="a">
</div>
And you're searching for example the following words: theword and other. You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:
Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();
List<string> urls = new List<string>();
foreach (HtmlNode n in nodeList)
{
urls.Add(n.Attributes["href"].Value);
}
Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:
node.SelectNodes(".//a[contains(#href,'theword') or contains(#href,'other')]")
There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:
node.SelectNodes(".//a[matches(#href,'(theword|other)')]")

I Find this and that works for me.
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
var temp = catagory.Attributes["href"].Value;
if (temp.Contains("some_word"))
{
dgv.Rows.Add(temp);
}
}

Related

Retrieve all the ids for a given sentence using regex in c#

I am working on a .Net(C#) software which get and processes an html file. I need to get the id's of the html elements from that file and i want to use regular expression for that. I've tried some combinations but with no luck.
For example, if I have the line:
<a href="#" id="thisAnchor" >Link to somewhere</a><div id="divToCollect">BigDiv</div>
I want to get: thisAnchor and divToCollect. I am using Regex:
Regex.Matches(currentLine, expression);
You should not use regex for that, use HtmlAgilityPack and you will have no problems getting all the attributes you need:
string html = "<div id='divid'></div><a id='ancorid'></a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var divIds = doc.DocumentNode
.Descendants("div")
.Where(div => div.Attributes["id"] != null)
.Select(div => div.Attributes["id"].Value)
.ToList();

How can I read an HTML file a Paragraph at a time?

I reckon it would be something like (pseudocode):
var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
par = getNextParagraph();
pars.Add(par);
}
...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.
Does anybody have insight on how exactly to do this / a better methodology?
UPDATE
I tried to use Aurelien Souchet's code.
I have the following usings:
using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;
...but this code:
HtmlDocument doc = new HtmlDocument();
is unwanted ("Cannot access private constructor 'HtmlDocument' here")
Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg
UPDATE 2
Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.
As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.
Here is what I would write:
//don't forgot to add the reference
using HtmlAgilityPack;
//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
//first create an HtmlDocument
HtmlDocument doc = new HtmlDocument();
//load the html (from a string)
doc.LoadHtml(sourceHtml);
//Select all the <p> nodes in a HtmlNodeCollection
HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");
//Iterates on every Node in the collection
foreach (HtmlNode paragraph in paragraphs)
{
//Add the InnerText to the list
pars.Add(paragraph.InnerText);
//Or paragraph.InnerHtml depends what you want
}
return pars;
}
It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.
Hope it helps!

Find text between know pattern

I have a webpage source code which has several occurrences of
<div class="detName">some unpredictable text</div>
I want to be able to get a colleection of all some unpredictable text.
I tried something like:
var match = Regex.Match(pageSourceCode, #"<div class='detName'>/(A-Za-z0-9\-]+)\</div>", RegexOptions.IgnoreCase);
But had no success, what would be a good solution for this issue?
Don't use regex to parse HTML, you can use HTML Agility Pack:
string html = "<div class=\"detName\">some unpredictable text</div>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[contains(#class,'detName')]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
var match = Regex.Match(pageSourceCode, #"(?<=<div class='detName'>)(.*)(?=</div>)", RegexOptions.IgnoreCase);

To search for strings with in a string (search for all hrefs in HTML source)

I have a string variable that contains the entire HTML of a web page.
The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ).
What is the best possible way to do it ?
Will using any extension function help ? what about using Regex ?
Thanks in Advance
Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.
There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:
string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
.Where(n => n.Name == "a" && n.Attributes.Contains("href")
.Select(n => n.Attributes["href"].Value);
I think you'll find this answers your question to a T
http://msdn.microsoft.com/en-us/library/t9e807fx.aspx
:)
I would go with Regex.
Regex exp = new Regex(
#"{href=}*{>}",
RegexOptions.IgnoreCase);
string InputText; //supply with HTTP
MatchCollection MatchList = exp.Matches(InputText);
Try this Regex (should work):
var matches = Regex.Matches (html, #"href=""(.+?)""");
You can go through the matches and extract the captured URL.
Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/
With this you can simply us XPATH to get all of the links on the page and put them into a list.
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Taken from another post here - Get all links on html page?

Regular Expression to get the SRC of images in C#

I'm looking for a regular expression to isolate the src value of an img.
(I know that this is not the best way to do this but this is what I have to do in this case)
I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.
string matchString = Regex.Match(original_text, #"(<img([^>]+)>)").Value;
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
I know you say you have to use regex, but if possible i would really give this open source project a chance:
HtmlAgilityPack
It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.
Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time
The code for your query would look something like this: (uncompiled code)
List<string> imgScrs = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
var nodes = doc.DocumentNode.SelectNodes(#"//img[#src]"); s
foreach (var img in nodes)
{
HtmlAttribute att = img["src"];
imgScrs.Add(att.Value)
}
I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:
List<string> images = new List<string>();
WebClient client = new WebClient();
string site = "http://www.mysite.com";
var htmlText = client.DownloadString(site);
var htmlDoc = new HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true
};
htmlDoc.LoadHtml(htmlText);
foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
{
HtmlAttribute att = img.Attributes["src"];
images.Add(att.Value);
}
This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D
<img.+?src="(.+?)".+?/?>
The regex you want should be along the lines of:
(<img.*?src="([^"])".*?>)
Hope this helps.
you can also use a look behind to do it without needing to pull out a group
(?<=<img.*?src=")[^"]*
remember to escape the quotes if needed
This is what I use to get the tags out of strings:
</? *img[^>]*>
Here is the one I use:
<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>
The good part is that it matches any of the below:
<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">
And it can also match some unexpected scenarios like extra attributes, e.g:
<img src = "test.jpg" width="300">

Categories