Parsing Hyperlinks from a webpage - c#

I have written following code to parse hyperlinks from a given page.
WebClient web = new WebClient();
string html = web.DownloadString("http://www.msdn.com");
string[] separators = new string[] { "<a ", ">" };
List<string> hyperlinks= html.Split(separators, StringSplitOptions.None).Select(s =>
{
if (s.Contains("href"))
return s;
else
return null;
}).ToList();
Although string split still has to be tweaked to return urls perfectly. My question is there some Data Structure, something on the line of XmlReader or so, which could read HTML strings efficiently.
Any suggestion for improving above code would also be helpful.
Thanks for your time.

try HtmlAgilityPack
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.msdn.com");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
Console.WriteLine(link.GetAttributeValue("href", null));
}
this will print out every link on your URL.
if you want to store the links in a list:
var linkList = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(i => i.GetAttributeValue("href", null)).ToList();

You should be using a parser. The most widely used one is HtmlAgilityPack. Using that, you can interact with the HTML as a DOM.

Assuming you're dealing with well formed XHTML, you could simply treat
the text as an XML document. The framework is loaded with features to
do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Does .NET framework offer methods to parse an HTML string?

refactored,
var html = new WebClient().DownloadString("http://www.msdn.com");
var separators = new[] { "<a ", ">" };
html.Split(separators, StringSplitOptions.None).Select(s => s.Contains("href") ? s : null).ToList();

Related

Retrieve all the ids for a given sentence using regex in c#

I am working on a .Net(C#) software which get and processes an html file. I need to get the id's of the html elements from that file and i want to use regular expression for that. I've tried some combinations but with no luck.
For example, if I have the line:
<a href="#" id="thisAnchor" >Link to somewhere</a><div id="divToCollect">BigDiv</div>
I want to get: thisAnchor and divToCollect. I am using Regex:
Regex.Matches(currentLine, expression);
You should not use regex for that, use HtmlAgilityPack and you will have no problems getting all the attributes you need:
string html = "<div id='divid'></div><a id='ancorid'></a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var divIds = doc.DocumentNode
.Descendants("div")
.Where(div => div.Attributes["id"] != null)
.Select(div => div.Attributes["id"].Value)
.ToList();

How can I read an HTML file a Paragraph at a time?

I reckon it would be something like (pseudocode):
var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
par = getNextParagraph();
pars.Add(par);
}
...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.
Does anybody have insight on how exactly to do this / a better methodology?
UPDATE
I tried to use Aurelien Souchet's code.
I have the following usings:
using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;
...but this code:
HtmlDocument doc = new HtmlDocument();
is unwanted ("Cannot access private constructor 'HtmlDocument' here")
Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg
UPDATE 2
Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.
As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.
Here is what I would write:
//don't forgot to add the reference
using HtmlAgilityPack;
//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
//first create an HtmlDocument
HtmlDocument doc = new HtmlDocument();
//load the html (from a string)
doc.LoadHtml(sourceHtml);
//Select all the <p> nodes in a HtmlNodeCollection
HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");
//Iterates on every Node in the collection
foreach (HtmlNode paragraph in paragraphs)
{
//Add the InnerText to the list
pars.Add(paragraph.InnerText);
//Or paragraph.InnerHtml depends what you want
}
return pars;
}
It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.
Hope it helps!

To search for strings with in a string (search for all hrefs in HTML source)

I have a string variable that contains the entire HTML of a web page.
The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ).
What is the best possible way to do it ?
Will using any extension function help ? what about using Regex ?
Thanks in Advance
Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.
There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:
string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
.Where(n => n.Name == "a" && n.Attributes.Contains("href")
.Select(n => n.Attributes["href"].Value);
I think you'll find this answers your question to a T
http://msdn.microsoft.com/en-us/library/t9e807fx.aspx
:)
I would go with Regex.
Regex exp = new Regex(
#"{href=}*{>}",
RegexOptions.IgnoreCase);
string InputText; //supply with HTTP
MatchCollection MatchList = exp.Matches(InputText);
Try this Regex (should work):
var matches = Regex.Matches (html, #"href=""(.+?)""");
You can go through the matches and extract the captured URL.
Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/
With this you can simply us XPATH to get all of the links on the page and put them into a list.
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Taken from another post here - Get all links on html page?

HTML Agility Pack Question (Attempting to parse string from source)

I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.
The HTML that I want to parse looks like:
<h1 class="subHead">
Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>
Also, the page I want to parse is: http://caps.fool.com/Ticker/MSFT.aspx
So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?
Edit: Current code
public String getStockName(String ticker)
{
String text ="";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");
var node = doc.DocumentNode.SelectSingleNode("/h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
}
This would give you a list of all stock names, for your sample Html just of Microsoft:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");
var nodes = doc.DocumentNode.SelectNodes("//h1[#class='subHead']");
foreach (var node in nodes)
{
string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}
Edit based on updated question - this should work for you:
string text = "";
HtmlWeb web = new HtmlWeb();
string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var node = doc.DocumentNode.SelectSingleNode("//h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
Use an xpath expression to select the element then pickup the text.
foreach (var element in doc.DocumentNode.SelectNodes("//h1[#clsss='subHead']/span"))
{
Console.WriteLine (element.InnerText);
}

Regular Expression to get the SRC of images in C#

I'm looking for a regular expression to isolate the src value of an img.
(I know that this is not the best way to do this but this is what I have to do in this case)
I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.
string matchString = Regex.Match(original_text, #"(<img([^>]+)>)").Value;
string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
I know you say you have to use regex, but if possible i would really give this open source project a chance:
HtmlAgilityPack
It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.
Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time
The code for your query would look something like this: (uncompiled code)
List<string> imgScrs = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
var nodes = doc.DocumentNode.SelectNodes(#"//img[#src]"); s
foreach (var img in nodes)
{
HtmlAttribute att = img["src"];
imgScrs.Add(att.Value)
}
I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:
List<string> images = new List<string>();
WebClient client = new WebClient();
string site = "http://www.mysite.com";
var htmlText = client.DownloadString(site);
var htmlDoc = new HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true
};
htmlDoc.LoadHtml(htmlText);
foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
{
HtmlAttribute att = img.Attributes["src"];
images.Add(att.Value);
}
This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D
<img.+?src="(.+?)".+?/?>
The regex you want should be along the lines of:
(<img.*?src="([^"])".*?>)
Hope this helps.
you can also use a look behind to do it without needing to pull out a group
(?<=<img.*?src=")[^"]*
remember to escape the quotes if needed
This is what I use to get the tags out of strings:
</? *img[^>]*>
Here is the one I use:
<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>
The good part is that it matches any of the below:
<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">
And it can also match some unexpected scenarios like extra attributes, e.g:
<img src = "test.jpg" width="300">

Categories