HTML to DOM Library - c#

I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP):
http://simplehtmldom.sourceforge.net/
Of course I know I could embed a browser control, but I am looking for something more efficient.

Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.

I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.

We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.

There is no way to get DOM with styles like that. Only option is "Selenium" framework that works with real browser.

Related

Deserializing schema.org microdata from HTML in C#

I am looking for a way to deserialize a set of offline HTML pages that have schema.org microdata embedded. How could I do this in C#? I have found Bam.Net.Schema.Org but there is almost no code that teaches me how to use it.
I have found several "parsers" for node.js but they are imperfect and not something I could use from C# - at least not in way I would prefer (semantic-schema-parser and node-microdata-scraper).
Suggestions are welcome. Should I simply create my own?
You can use HtmlAgilityPack to parse the html and than query the DOM for schema.org values.

Generating CSS Selector in Firefox

Well i used to use htmlagilitypack as well as xPath to scrap some info from websites but i have read that css selectors are much faster so i searched for good engine for css and i found CsQuery; However, i am still confused as i don't know how to get the css path of an element.
In xPath i have used a firefox plugin called xPath checker that returned a fine xPaths like this
id('yt-masthead-signin')/button
But i can't find an equivalent one for CSS. So if someone helped my i will really appreciate it because i don't find and answer on google for my question specifically.
Install Firebug + Firepath
Click the selecting button to select something on the page, then it can generate either xpath or css selector. However, you need some changes to make the generated ones more efficient.

Locating HTML Tags

I am trying to automate the testing of web forms. To that end I need to know how to use C# to dynamically locate input tags within the HTML page then assign values to them. I don't want to use XPath, because each time I will be using a different web form. I want to pass the web form's URL to Selenium and then automatically populate the fields. I've heard of HTMLAgilityPack. Would that help me? If so, how can I use it?
I appreciate your help.
I may have missed a crucial part of your question, however, have you looked at Selenium WebDriver?
If you write a test that handles a generic web form you can back your test by data that is dynamic. Therefore you can cater for changes in the page by using Data Driven Tests. I've written tests for many pages and there are always common actions, but I cater for each page differently though as there are different things on that page!
[EDIT]
Following on from your comments, I think looking into Selenium would be a good idea. The way to handle different pages is to have these element definitions ready in a 'definitions' class for each page. That way once you know what the page is, you just use the correct class for your definitions. It is best to know what elements you are going to be interacting with in your tests before the tests run. The point of automated UI testing is for a known set of actions to be performed and a correct result achieved.
I would suggest you look up some tutorials such as this and you can see my blog
though I wrote this when I was initially learning WatiN and then replaced it with Selenium (I like it better :P).
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
// Your Code...

How can I make an application in c# collect data from a website?

First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)
First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.
If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).
Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.
Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.

Good way to navigate the DOM

Can I please get some C# code samples on the best way to navigate the DOM? My target data could be buried any where in the node.
I am talking about HTML DOM.
Assuming you mean HTML, I'd go with http://www.codeplex.com/htmlagilitypack

Categories