HTML Agility Pack Question (Attempting to parse string from source) - c#

I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.
The HTML that I want to parse looks like:
<h1 class="subHead">
Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>
Also, the page I want to parse is: http://caps.fool.com/Ticker/MSFT.aspx
So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?
Edit: Current code
public String getStockName(String ticker)
{
String text ="";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");
var node = doc.DocumentNode.SelectSingleNode("/h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
}

This would give you a list of all stock names, for your sample Html just of Microsoft:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");
var nodes = doc.DocumentNode.SelectNodes("//h1[#class='subHead']");
foreach (var node in nodes)
{
string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}
Edit based on updated question - this should work for you:
string text = "";
HtmlWeb web = new HtmlWeb();
string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var node = doc.DocumentNode.SelectSingleNode("//h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;

Use an xpath expression to select the element then pickup the text.
foreach (var element in doc.DocumentNode.SelectNodes("//h1[#clsss='subHead']/span"))
{
Console.WriteLine (element.InnerText);
}

Related

HtmlAgiltyPack parse HTML and take value out of span tag and class name

I have an HTML that I download via my webrequest client. And out of entire html I want to parse only this part of HTML:
<span class="sku">
<span class="fb">SKU :</span>118880101
</span>
I'm using HTML agilty pack to retrieve this value: 118880101
And I've written something like this:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
return htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']").ElementAt(0).InnerText;
And this returns me this value from HTML:
SKU :118880101
Literally like this, spaces included... How can I fix this logic with HTML Agilty pack so that I can only take out this 118880101 value?
Can someone help me out?
Edit: a regex like this would do the thing:
Substring(skuRaw.LastIndexOf(':') + 1);
which would mean to take everything after ":' sign in string that I receive... But I'm not sure if it's safe to use regex like this ?
Try This
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var innerText=htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']")
.ElementAt(0).InnerText;
return innerText.replace(/\D/g,'');
if you want to use only Html Agility pack try this
var child = htmlDoc.DocumentNode.SelectNodes("//span[#class='fb']")
.FirstOrDefault();
if (child != null)
{
var parent = child.ParentNode;
parent.RemoveChild(child);
var innerText = parent.InnerText;
}

How can I read an HTML file a Paragraph at a time?

I reckon it would be something like (pseudocode):
var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
par = getNextParagraph();
pars.Add(par);
}
...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.
Does anybody have insight on how exactly to do this / a better methodology?
UPDATE
I tried to use Aurelien Souchet's code.
I have the following usings:
using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;
...but this code:
HtmlDocument doc = new HtmlDocument();
is unwanted ("Cannot access private constructor 'HtmlDocument' here")
Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg
UPDATE 2
Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.
As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.
Here is what I would write:
//don't forgot to add the reference
using HtmlAgilityPack;
//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
//first create an HtmlDocument
HtmlDocument doc = new HtmlDocument();
//load the html (from a string)
doc.LoadHtml(sourceHtml);
//Select all the <p> nodes in a HtmlNodeCollection
HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");
//Iterates on every Node in the collection
foreach (HtmlNode paragraph in paragraphs)
{
//Add the InnerText to the list
pars.Add(paragraph.InnerText);
//Or paragraph.InnerHtml depends what you want
}
return pars;
}
It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.
Hope it helps!

Using htmlAgilityPack scraping all inner text from <a> tag [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I want to scrap all the word from the link http://search.freefind.com/siteindex.html?id=59478474&ltr=10240&fwr=0&pid=i&ics=1
I tried something like this:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://search.freefind.com/siteindex.html?id=59478474&ltr=10240&fwr=0&pid=i&ics=1");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//font[#class='search-index-font']//a");
if (nodes != null)
{
foreach (HtmlNode n in nodes)
{
link = n.InnerText;
my_link.Add(link);
MessageBox.Show(link);
}
}
else
MessageBox.Show("no wordfound ");
My expexted output should like
a
aa
aachhe
aagrashi
aagun
aaj
aam
aanka
aankhi
aar
aashman
abāddhō
abāddhōtā
abadh
..
..
But it didn't work??It shows "no word found" Means it returns null.How can i get all text from < a > tag in that case???
Can anyone tell me What should be in SelectNodes("")???
You need to search for the next text node after <script> tag(not <a> tag as you said), inside <font class='search-index-font'>. This xpath expression will do the trick:
//font[#class='search-index-font']/script/following-sibling::text()[1]
And this code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://search.freefind.com/siteindex.html?id=59478474&ltr=10240&fwr=0&pid=i&ics=1");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//font[#class='search-index-font']/script/following-sibling::text()[1]");
will return text nodes you need:
a
aa
aachhe
aagrashi
aagun
aaj
aam
aanka
aankhi
aar
...
Try this:
doc.DocumentNode.SelectNodes("//a[#class='search-index-links']");
instead of
doc.DocumentNode.SelectNodes("//font[#class='search-index-font']//a");
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc =
web.Load("http://search.freefind.com/siteindex.html?id=59478474&ltr=10240&fwr=0&pid=i&ics=1");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//font[#class='search-index-font']");
string link = string.Empty;
if (nodes != null)
{
foreach (var item in nodes)
{
var value =
item.Elements("script").ToList();
foreach (var items in value)
{
link += items.NextSibling.InnerText+ "\n";
}
}
MessageBox.Show(link);
}
else
MessageBox.Show("no wordfound ");
Your problem is
doc.DocumentNode.SelectNodes("//font[#class='search-index-font']//a");
returns null, as documented here.
That is because there are no a elements in font elements with a class attribute equal to search-index-font in the html you have loaded in doc.
If you change the xpath you pass to SelectNodes to select something that exists then your code will take a different path. Without knowing what you to achieve I can't advise further.
You can use HAP to parse the valid html, i.e. use it to identify the script elements. Then you'll have to hand roll something to parse the inner text of the script tag to extract what you want.
Ultimately, what you want is a list of Bengali words.

Parsing Hyperlinks from a webpage

I have written following code to parse hyperlinks from a given page.
WebClient web = new WebClient();
string html = web.DownloadString("http://www.msdn.com");
string[] separators = new string[] { "<a ", ">" };
List<string> hyperlinks= html.Split(separators, StringSplitOptions.None).Select(s =>
{
if (s.Contains("href"))
return s;
else
return null;
}).ToList();
Although string split still has to be tweaked to return urls perfectly. My question is there some Data Structure, something on the line of XmlReader or so, which could read HTML strings efficiently.
Any suggestion for improving above code would also be helpful.
Thanks for your time.
try HtmlAgilityPack
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.msdn.com");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
Console.WriteLine(link.GetAttributeValue("href", null));
}
this will print out every link on your URL.
if you want to store the links in a list:
var linkList = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(i => i.GetAttributeValue("href", null)).ToList();
You should be using a parser. The most widely used one is HtmlAgilityPack. Using that, you can interact with the HTML as a DOM.
Assuming you're dealing with well formed XHTML, you could simply treat
the text as an XML document. The framework is loaded with features to
do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Does .NET framework offer methods to parse an HTML string?
refactored,
var html = new WebClient().DownloadString("http://www.msdn.com");
var separators = new[] { "<a ", ">" };
html.Split(separators, StringSplitOptions.None).Select(s => s.Contains("href") ? s : null).ToList();

Selecting attribute values with html Agility Pack

I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?
You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;
Html Agility Pack does not support attribute selection.
You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}
Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342
Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}
I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.
I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});

Categories