How do I scrape a website for information?

How do I scrape a website for information? - c#

I want my program to automatically download only certain information off a website. After finding out that this is nearly impossible I figured it would be best if the program would just download the entire web page and then find the information that I needed inside of a string.
How can I find certain words/numbers after specific words? The word before the number I want to have is always the same. The number varies and that is the number I need in my program.

Sounds like screen scraping. I recommend using CSQuery https://github.com/jamietre/CsQuery (or HtmlAgilityPack if you want). Get the source, parse as object, loop over all text nodes and do your string comparison there. The actual way of doing this varies a LOT on how the source HTML is done.
Maby something like this untested example written from memory (CSQuery)
var dom = CQ.Create(stringWithHtml);
dom["*"].Each((i, e) =>
{
// handle only text nodes
if (e.NodeType == NodeType.TEXT_NODE) {
// do your check here
}
}

I've used HTML Agility Pack for multiple applications and it works well. Lots of options too.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.

Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.

In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

How to retrieve data from an html string from a span tag by using Regular Expressions?

I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.

This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.

Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/

You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.

Reading values from webpage programmatically

I don't know what it called, but i think this is possible
I am looking to write something(don't know the exact name) that will,
go to a webpage and select a value from drop-down box on that page and read values from that page after selection, I am not sure weather it called crawler or activity, i am new to this but i heard long time back from one of my friend this can be done,
can any one please give me a head start
Thanks

You need an HTTP client library (perhaps libcurl in C, or some C# wrapper for it, or some native C# HTTP client library like this).
You also need to parse the retrieved HTML content. So you probably need an HTML parsing library (maybe HTML agility pack).
If the targeted webpage is nearly fixed and has e.g. some comments to ease finding the relevant part, you might use simpler or ad-hoc parsing techniques.
Some sites might send a nearly empty static HTML client, with the actual page being dynamically constructed by Javascript scripts (Ajax). In that case, you are unlucky.
Maybe you want some web service ....

One simple way (but not the most efficient way) is to simply read the webpage as String using the WebClient, for example:
WebClient Web = new WebClient();
String Data = Web.DownloadString("Address");
Now since HTML is simply an XML document you can parse the string to a XDocument and look up the tag that represents the dropdown box. Parsing the string to XDocument is done this way:
XDocument xdoc = XDocument.Pase(Data);
Update:
If you want to read the result of the selected value, and that result is displayed within the page do this:
Get all the items as I explained.
If the page does not make use of models, then you can use your selected value as an argument for example :
www.somepage.com/Name=YourItem?
Read the page again and find the value

c# to transform an unordered list into a series of textboxes

My SQL database is outputting a blob of html that looks something like this:
<ul><li>Test 1</li><li>Test 2</li></ul>
Which I would like to transform into a list of textboxes which look something like this:
<textarea>Test 1</textarea><textarea>Test 2</textarea>
How do I do this? I've tried parsing into XML but this doesn't seem to work and seems a lengthy way of doing things.

I recently discovered the HTML Agility Pack which makes it very easy to do things like this:
var nodes= doc.DocumentNode.Descendants().Where(n => n.Name.StartsWith("li")).ToList();
You could then iterate over the collection and build your TextArea
foreach (HTML.HtmlNode node in nodes)
{
// ... build your text area here
}

From the wording of your question, I am inferring that you are primarily looking for a way to access the DOM of a snippet of HTML. There may be other tools for this, but I can recommend Html Agility Pack which is available as a Nuget package (http://htmlagilitypack.codeplex.com/).
Once you have the DOM structure loaded, you can access and modify it as you need, and you can generate output html as well.
This library is quite forgiving of poorly written HTML, so if your database values are less controlled (entered by users, for example), it is a good choice.

Suggestions on how build an HTML Diff tool?

In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and frequently miss minor formatting changes in my implementation. I then waste a few hours of designer time sifting through my pages to find my mistakes.
The thread offered some good suggestions, but there was nothing that fit the bill. "Fine, then", thought I, "I'll just crank one out myself. I'm a halfway-decent developer, right?".
Well, once I started to think about it, I couldn't quite figure out how to go about it. I can crank out a data-driven website easily enough, or do a CMS implementation, or throw documents in and out of BizTalk all day. Can't begin to figure out how to compare HTML docs.
Well, sure, I have to read the DOM, and iterate through the nodes. I have to map the structure to some data structure (how??), and then compare them (how??). It's a development task like none I've ever attempted.
So now that I've identified a weakness in my knowledge, I'm even more challenged to figure this out. Any suggestions on how to get started?
clarification: the actual content isn't what I want to compare -- the creative guys fill their pages with lorem ipsum, and I use real content. Instead, I want to compare structure:
<div class="foo">lorem ipsum<div>
is different that
<div class="foo"><p>lorem ipsum<p><div>

The DOM is a data structure - it's a tree.

Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.
#! /usr/bin/perl -w
use strict;
undef $/;
my $html = <STDIN>;
while ($html =~ /\S/) {
if ($html =~ s/^\s*<//) {
$html =~ s/^(.*?)>// or die "malformed HTML";
print "<$1>\n";
} else {
$html =~ s/^([^<]+)//;
print "(text)\n";
}
}

#Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.
Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.

If i was to tacke this issue I would do this:
Plan for some kind of a DOM for html pages. starts at lightweight and then add more as needed. I would use composite pattern for the data structure. i.e. every element has children collection of the base class type.
Create a parser to parse html pages.
Using the parser load html element to the DOM.
After the pages' been loaded up to the DOM, you have the hierachical snapshot of your html pages structure.
Keep iterating through every element on both sides till the end of the DOM. You'll find the diff in the structure, when you hit a mismatched of element type.
In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.

I think some of the suggestions above don't take into account that there are other tags in the HTML between two pages which would be textually different, but the resulting HTML markup is functionally equivalent. Danimal lists control IDs as an example.
The following two markups are functionlly identical, but would show up as different if you simply compared tags:
<div id="ctl00_TopNavHome_DivHeader" class="header4">foo</div>
<div class="header4">foo</div>
I was going to suggest Danimal write an HTML translation which looks for the HTML tags and converts both docs into a simplified version of both which omits ID tags and any other tags you designate as irrelevant. This’d likely have to be a work in progress, as you ignore certain attributes/tags and then run into new ones which you also want to ignore.
However, I like the idea of using the XmlSchemaInterface to boil it down to the XML schema, then use a diff tool which understands XML rules.

See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value).
This can be applied to HTML using an HTML parser.
EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.

http://www.mugo.ca/Products/Dom-Diff
Works with FF 3.5. I haven't tested FF 3.6 yet.

This has been an excellent start. A few more clarifications/comments:
I probably don't care about IDs, since .net will mangle them
some of the structure will be in a repeater or other such control, so I might end up having more or fewer repeating elements
further thought:
I think a good start would be to assume the html is XHTML compliant. I could then infer the schema (using the new .net XmlSchemaInference methods), then diff the schemata. I can then look at the differences and consider whether or not they're significant.

My suggestion is just the basic way of doing it... Of course to tackle the issue you mentioned additional rules must be applied here... Which is in your case, we got a matching div element, and then apply attributes/property matching rules and what not...
To be honest, there are many and complicated rules that need to be applied for the comparison, and its not just a simple matching element to another element. For example what happens if you have duplicates.
e.g. 1 div element on one side, and 2 div element on the other side. How are you gonna match up which div elements matches together?
There are alot other complicated issues that you will find in the comparison word. Im speaking based of experience (part of my job is to maitain my company text comparison engine).

Take a look at beyond compare. It has an XML comparison feature that can help you out.

You may also have to consider that the 'content' itself could contain additional mark-up so it's probably worth stripping out everything within certain elements (like <div>s with certain IDs or classes) before you do your comparison. For example:
<div id="mainContent">
<p>lorem ipsum etc..</p>
</div>
and
<div id="mainContent">
<p>Here is some real content<img class="someImage" src="someImage.jpg" /></p>
<ul>
<li>and</li>
<li>some</li>
<li>more..</li>
</ul>
</div>

Pretty Diff can do this. It will compare the code structure only regardless of differences to white space, comments, or even content. Just be sure to check the option "Normalize Content and String Literals".
http://prettydiff.com/

I would use (or contribute to) html5lib and its SAX output. Just zip through the 2 SAX streams looking for mismatches and highlight the whole corresponding subtree.

I don't know any tool but I know there is a simple way to do this:
First, use a regular expression tool to strip off all the text in your HTML file. You can use this regular expression to search for the text (?<=^|>)[^><]+?(?=<|$) and replace them with an empty string (""), i.e. delete all the text. After this step, you will have all HTML markup tags. There are a lot of free regular expression tools out there.
Then, you repeat the first step for the original HTML file.
Last, you use a diff tool to compare the two sets of HTML markups. This will show what is missing between one set and the other.

If i were to do this, first i would learn HTML. (^-^) Then i would build a tool that strips out all of the actual content and then saves that as a file so it can be piped through WinDiff (or other merge tool).

Open each page in the browser and save them as .htm files. Compare the two using windiff.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How do I scrape a website for information? - c#

Related

Parsing Html tags using c#

How to retrieve data from an html string from a span tag by using Regular Expressions?

Reading values from webpage programmatically

c# to transform an unordered list into a series of textboxes

Suggestions on how build an HTML Diff tool?

Categories

Resources