Extract id style from html page using Html agility pack - c#

I have a c# application. I need to extract data from a html page and add it to my database. The html page contains some css code and I am interested in all of the id's attributes from the css. How can I pull out the id's info into my code? I tried something like this but it doesn't seem to work:
var styles = document.DocumentNode.SelecNodes("//style");
foreach(HtmlNode node in styles)
{
var text = node.Attributes["id"];
}
I really appreciate any help!

More of a fishing rod than a fish, but that's all I got time to do ATM.
First, look at this tutorial: xpath on w3schools. I've done some work with XPath, and it was only after going through their tutorial that things started to make a bit of sense.
Then, please get this html agility test pack, it will let you quickly test your code against the page you're trying to parse.
From here, it should be a short way to get what you want.

Try this, access Id property directly :
var styles = document.DocumentNode.SelecNodes("//*[#style]");
foreach(HtmlNode node in styles)
{
var text = node.Id;
}
Edit: expression changed to "//*[#style]" which gets you only elements with style attribute.

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

C# CsQuery as Html Documents Builder

So far I used HtmlAgilityPack for building Html documents.
The problem is that it is not stable, I get Stackoverflow Exceptions and it doesn't support jQuery syntax.
What I am trying to use to build Html documents is CsQuery.
My question is:
Is it designated for building Html documents?
I like the functions it offers, but I cannot render the modified html document.
For example:
CQ fragment= CQ.CreateFragment("<p>some text</p>");
CQ html = CQ.CreateFromFile(#"index.html");
CQ modified_html= html.Select("#test").Append(fragment);
Which means, I want to append fragment variable to element with id "test".
the problem is that I expect modified_html.Render() to return the modified version (including < p> sometext < /p> added to #test element), but it actually doesn't!!!
is there anyway to achieve this?
Actually it does. I also checked with your code, it do append <p>some text</p> to the modified_html. The only possible issue I can think: there is no element with id = "test" in index.html. You may also want to save modified html to file so it will be easier for you to examine the output :
modified_html.Save(#"index_modified.html");

Select "src" value with XPath to HtmlAgilityPack

I'm on a development process of a crawling engine. My program crawls websites through Xpath with HtmlAgilityPack. I need to get some image src tag's directly. You can see my simple code below which is not working correctly, thanks in advice!
PS: Please ignore " char problem, XPath patterns are provided by database.
Agility.DocumentNode.SelectSingleNode("//img[#id="product_photo"]/#src");
And this is the line i need to crawl (the *...* part shows block to extract
<img id="product_photo" src="*/images/thumb/4400/10280/st.jpg*">
Some pages provide image in meta tags so .Attributes["src"] wont work.
UPDATE: You can see my query and result here
You cann't get the value of "src" or any other attributes in using:
Agility.DocumentNode.SelectSingleNode(yourXpath);
Just by using:
string s=Agility.DocumentNode.SelectSingleNode(yourXpath).value;
It's because XPath cann't return value of an attribute by SelectSingleNode() func in HtmlAgilityPack class. So you must use SelectSingleNode(yourXpath).value or use Regex after the pharsing to get just the "src" without the outerText.

How to generate xpath by looking for a string in an HTML document?

I have an HTML document, and I am willing to find out the xpath to an element containing a certain string.
To elaborate a bit more:
My HTML document is created dynamically and I have no specific names for s. The divs I am interested at look like (more or less):
<div>Country: China</div>
<div>Type: Earphones</div>
I want to get the whole string "Country: China". In order to do so, I want to find the xpath to this div by searching for "Country:" in the HTML.
I hope I was specific enough... Thank you!
Here are a couple ways:
//div[contains(child::text(), "Country:")]
//div/child::text()[contains(., "Country:")]/parent::node()
If you want to try things out within a browser, try out in-browser XPath bookmarklet.

Parse CSS out from <style> elements

Can someone tell me an efficient method of retrieving the CSS between tags on a page of markup in .NET?
I've come up with a method which uses recursion, Split() and CompareTo() but is really long-winded, and I feel sure that there must be a far shorter (and more clever) method of doing the same.
Please keep in mind that it is possible to have more than one element on a page, and that the element can be either or .
I'd probably go for HTML Agility Pack which gives you DOM style access to pages. It would be able to pick out your chunks of CSS data, but not actually parse this data into key/value pairs. You can get the relevant pieces of HTML using X-Path style expressions.
Edit: An example of typical usage of Html Agility Pack is shown below.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
var nodes = doc.DocumentElement.SelectNodes("//a[#style"]);
//now you can iterate over the selected nodes etc
Here's a C# CSS parser. Should do what you need.
http://www.codeproject.com/KB/recipes/CSSParser.aspx
Try Regex.
goto:http://gskinner.com/RegExr/
paste html with css, and use this expression at the top:
<style type=\"text/css\">(.*?)</style>
here is the c# version:
using System.Text.RegularExpressions;
Match m = Regex.Match(this.textBox1.Text, "<style type=\"text/css\">(.*?)</style>", RegexOptions.Singleline);
if (m.Success)
{
string css = m.Groups[1].Value;
//do stuff
}

Categories