Locating HTML Tags

Locating HTML Tags - c#

I am trying to automate the testing of web forms. To that end I need to know how to use C# to dynamically locate input tags within the HTML page then assign values to them. I don't want to use XPath, because each time I will be using a different web form. I want to pass the web form's URL to Selenium and then automatically populate the fields. I've heard of HTMLAgilityPack. Would that help me? If so, how can I use it?
I appreciate your help.

I may have missed a crucial part of your question, however, have you looked at Selenium WebDriver?
If you write a test that handles a generic web form you can back your test by data that is dynamic. Therefore you can cater for changes in the page by using Data Driven Tests. I've written tests for many pages and there are always common actions, but I cater for each page differently though as there are different things on that page!
[EDIT]
Following on from your comments, I think looking into Selenium would be a good idea. The way to handle different pages is to have these element definitions ready in a 'definitions' class for each page. That way once you know what the page is, you just use the correct class for your definitions. It is best to know what elements you are going to be interacting with in your tests before the tests run. The point of automated UI testing is for a known set of actions to be performed and a correct result achieved.
I would suggest you look up some tutorials such as this and you can see my blog
though I wrote this when I was initially learning WatiN and then replaced it with Selenium (I like it better :P).

Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
// Your Code...

Related

How to access IE XHTML DOM+JS engines without starting the browser itself

I'm trying to build a headless browser in c#. c# has plenty of classes, which are supposed to make this possible, like, for example JScriptCodeProvider.
I am looking to get IE XML DOM classes for the JavaScript code to work with. Can anyone tell me where to find those, and, if possible, to provide me with a workable example for what I'm trying do to?

Use the webbrowser control. That should get you everything you need.

How can I make an application in c# collect data from a website?

First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)

First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.

If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).

Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.

Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.

HTML Parser validate tags

I need html parse which have capability to identify error in generated html and if tags are not closed then close it and return the valid html.
More detail: i am getting data from database and break that record to show partial detail on my website to click on more button then show complete content. After breaking string then validate.
I have already used Html Agility Pack but i am new to use it, if this library solve my issue then guide me how (tutorial) or suggest me another library.

I don't think such a library does exist. The problem is, that some libraries can indeed identify errors in your HTML but they cant fix them for you.
I think using the W3C validator as a service is the best starting point here. There's an open source library which uses the API of the W3C validator to validate a document and gives you the response if it is valid or not as well as errors and warnings. I would start with this and then go on from there.
W3C Markup Validator library in C#

Here are a couple of validation programs from the World Wide Web Consortium, the W3C:
Windows: http://validator.w3.org/docs/install_win.html
UNIX / Linux: http://validator.w3.org/docs/install.html
You can also use their web services to validate your CSS, HTML, XML, XHTML, JavaScript and many other web technologies. The W3C is one of the overseers of keeping the Internet highly interoperable and internet devices somewhat compatible with each other.

Any way to associate a HtmlElement (.NET) to a JavaScript element?

I'm trying to make an extended version of a WebBrowser with stuff like highlighting text and getting properties or attributes of elements for a Web Scraper. WebBrowser functions doesn't help much at all, so if I could just find a way from HtmlElement to a JavaScript element (like the one returned by document.getElementById), and back, and then add JavaScript functions to the HTML from my application, it would make the job a lot easier. Right now I'm messing with the HTML of the code programmatically from C# and it's very messy. I was thinking about setting some unique Id to each HTML element from my program and then call the JavaScript document.getElementById to retrieve it. But that won't work, they might already have an Id assigned and I will mess up their HTML code. I don't know if I can give them some made up attribute like my_very_own_that_i_hope_no_web_page_on_the_world_ever_uses_attribute and then figure out if there is some JavaScript function getElementByWhateveAttributeIWant but I'm not sure if this would work. I read something about expansion or extended attributes on the DOM documentation in msdn but I'm not sure what that is about. Maybe some of you guys have a better way.

It would be much easier to use some rendering engine like trident to get the data from html document. Here is the Link for trident/MSHTML. you can do google and can have samples in c#

This is not nearly as hard as you imagine. You don't have to modify the document at all.
Once the WebBrowser has loaded a page, it's kept internally as a tree with the document node at the root. This node is available to your program, and you can find any element you want (or just enumerate them all) by walking the tree.
If you can give a concrete example, I can supply some code.

HTML to DOM Library

I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP):
http://simplehtmldom.sourceforge.net/
Of course I know I could embed a browser control, but I am looking for something more efficient.

Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.

I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.

We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.

There is no way to get DOM with styles like that. Only option is "Selenium" framework that works with real browser.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Locating HTML Tags - c#

Related

How to access IE XHTML DOM+JS engines without starting the browser itself

How can I make an application in c# collect data from a website?

HTML Parser validate tags

Any way to associate a HtmlElement (.NET) to a JavaScript element?

HTML to DOM Library

Categories

Resources