Full HTML code from iframes using webbrowser - c#

I need get the html code this site (with C#):
http://urbs-web.curitiba.pr.gov.br/centro/defmapalinhas.asp?l=n (only works with IE8)
Using the WebClient class, or HttpWebResquest, or any other library, I do not have access to the html code generated dynamically.
So my only solution (I guess) would be to use the WebBrowser Control (WPF).
I was trying and trying, using mshtml.HTMLDocument and SHDocVw.IWebBrowser2
but it is a mess, I can not find what I want on it
it seems there are many "iframe", and inside there are more "iframe".
I do not know, I tried:
IHTMLElementCollection elcol = htmlDoc.getElementsByTagName("iframe");
var test = htmlDoc.getElementsByTagName("HTML");
var test2 = doc.all;
but had no progress, does anyone know how to help me?
Observation / trivia: This is the site that shows where all bus pass in my city. This site is horrible, and only works in IE8 has serious problems. I would like to use this information to try to create a better service, using google maps or bing maps posteriorly.
The site that I was trying to get the information is no longer available, the idea to get dynamic html source code was abandoned and I cannot found the solution using a WebBrowser Control for WPF.
I believe that today there are other ways to solve this problem.

You need to use the "Frames" object in the WebBrowser control, this object collection will return all frames and iframes if I recall correctly, and you need to look at the frames collection for each newly discovered frame you find on the page, get me? So, it’s like a recursive discovery loop that you need to run, you add each frame you find to your array or collection, and for each "unsearched" frame, you must look at that frames ".Frames" collection (they will all have a .Count etc, just a typical collection) and you do this for every newly discovered frame that you find, until of course, there are no longer any newly discovered frames that haven't had their ".Frames" collection searched.
So, the function, if done as per above, will allow for infinitely nested frames to be discovered, as I've done this in a VB6 project (I'm happy to give you the source for it if you would like it). However, the nesting is not preserved in my example, but that is ok since the nesting structure isn't important and you should figure out which was what by the order of the frames that are added to the collection since the order is related to the hierarchy of the frames being added.
Once you do that, getting the html source on this is pretty straight forward and I’m sure you know how to do, probably a .DocumentText depending on the version of the WB control you are using.
Also, you say it is not possible to use the HTTP clients to directly grab the source code? I must disagree, since once you have the frame objects, you can get the URLs from each frame object and do a URL2String type call to get the URL and turn it into a string from any httpclient-like class or framework. The only way it may be prevented on their behalf if if they accept requests only from a particular referrer (ie: the referrer must be from their domain name on some of their files etc), or the USER_AGENT where if it isn't one of the specified browsers, then it is technically possible that they will reject and not return data, unlikely but possible.
However, both referrer and user_agent can be changed in the httpclient you are using, so if they are imposing limits based on this sort of stuff, you can spoof them very easily and give them the data that they expect. Once again, this is low probability stuff, but it is possible they may have set things up this way especially if their data is proprietary.
PS: My first visit to the site ended up in IE crashing and reopening that tab :), terrible site I agree.

Related

Is there a way in C# to get a 'browser' to pre-process a WebRequest so you can work with the elements that you see when you 'View Source'

I want to make a webrequest to e.g. http://finance.yahoo.com/q?s=rb.l and extract the share price. However the text returned is before the browser has processed it and I need it processed first before the <span></span> element exists that I need to look for.
Is this possible, or should I be looking at doing it another way?
Similarly any reliable 15-min delayed free stock service for the LSE or other way of obtaining this data given just the ticker code would be great.
There are two questions here: first, how to programmatically access data on a page after allowing javascripts and such to run on that page as if it were being read by a real browser. Secondly, how to get stock ticker information programmatically.
To answer the first question: You could use something like WebDriver .NET to literally instantiate a browser that opens the page, and then access elements on the page.
To answer the second question, I suggest you try to search for that question directly, since it's a common enough problem that you'll probably find a number of people who have answered it already.

Search engine optimization for database loaded using jQuery

I am currently optimizing my site for search engines. It is mainly a database driven site. I am using C# on the back end but database content is loaded via jQuery ajax and a web service. Therefore, my database content is not in html at the point that the bots will crawl it. My site is kind of like an online supermarket format in that there are thousands of items in my database, users can load a single one of these or more onto the web page at a time and the page does not change significantly once items are loaded.
My question is, how (if at all) can I get my database contents indexed? I was thinking of having an anchor that links to an aspx page (eg called mydatabase) which loads all of my database items as a big html list. Then, using jQuery, I would make the anchor invisible to users. The data would still be accessible to users but not by this link, it would be accessed by using the jQuery interface I have created.
The thing is, I don't really want users to see this big, messy list - would google results show this page eg www.mysite.com/mydatabase.aspx as a search result? Also would google see this as "keyword rich" spam page? I have done quite a lot of research but found nothing on this. only instructions for php. Please help I'm not sure what to do and need to know the best way to go about this.
It's a shame you haven't taken the progressive enhancement approach as it would mean you would have started with a standard HTML output that's crawlable, and then adding the layering behaviour (AJAX) on top for the user experience.
Providing a single file (e.g. mydatabase.aspx) that lists all of your products in a list format provides no real value for the reason you gave - it would just be a big useless list. No editorial content relevance for each link etc.
You're much better off taking another look at your information architecture and trying ensure that each product is accessibile by it's own unique URL, then classifying the products into groups (result pages), being careful to think about pagination.
You can still make this act like a single-page application using AJAX, but you'd want to look into HTML5's History API to achieve this in a search engine friendly way.

HTML Page Scraping

What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.

Easy way to replicate web page across machines?

I am trying to replicate a browser page to another browser on another machine. I basically want to reproduce a page exactly how it appears to a customer for viewing by the website owner. I have done this before using some impersonation trickery, but found that it would throw the session state out of wack when the site owner would switch customers. So I would like to stay away from cookie and authentication manipulation.
Anybody done anything like that? Is there a way to easily transfer the DOM to a webservice?
The tech/programming at my disposal are C#, javascript, WCF.
Is sending image an option? If that is an option, you can use IECapt program to take screenshot of that image and send it to the other machine:
http://iecapt.sourceforge.net/
If session state is getting messed up when the site owner changes customer roles, your implementation might be the problem. I'd probably try fixing how your session management is working before solving a problem which is really a sympton of a deeper problem IMO.
Since you mentioned transferring the DOM to a webservice, I assume you need to inspect the page's source and not just its appearance. I recommend checking this link:
http://www.eggheadcafe.com/community/aspnet/7/10041011/view-source-of-a-web-page.aspx
It was a few suggestions for grabbing a page's source programmatically / screen-scraping.
Of course, a few more details might yield better answers. Specifically, does the customer submit their page to the owner (I imagine a scenario where a user of your site says "Hey, I'm having a problem! Take a look at this...") or is the owner looking at how the page renders when logged-in as a specific customer?
Easiest way is to post the innerHTML of the body tag to your webservice, which your other page can poll (or use comet, or something) to get back. You'll have to be careful to load the right css in your clone page. Also, you'll need to think about how often you want it to update.
This is a bit of a hack though, a better solution would be to have designed the page from the start with this in mind (I'm assuming this is too late now?), so that anything that mutated the page would at the same time send a message back to the server describing what was changed, or if the page is not very interactive, storing the canonical state of the page on the server, and querying that from both browsers with XHRs or similar.

Algorithm for reading the actual content of news articles and ignoring "noise" on the page?

I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.
Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.
A couple of additional notes:
Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.
Any ideas?
As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.
When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.
Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.
Try running diff on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.
Here's what I would do after I checked the robots.txt file to make sure it's fine to scrap the article and parsed the document as an XML tree:
Make sure the article is not broken into many pages. If it is, 'print view', 'single page' or 'mobile view' links may help to bring it to single page. Of course, don't bother if you only want the beginning of the article.
Find the main content frame. To do that, I would count the amount of information in every tag. Now, what we're looking is a node that is big but consists of many small subnodes.
Now I would try to filter out any noise inside the content frame. Well, the websites I read don't put any crap there, only useful images, but you do need to kill anything that has inline javascript and any external links.
Optionally, flatten that into plain text (that is, go into the tree and open all elements; block elements create a new paragraph).
Guess the header. It's usually something with h1, h2 or at least big font size, but you can simplify life by assuming that it somehow resembles the page title.
Finally, find the authors (something with names and email), the copyright notice (try metadata or the word copyright) and the site name. Assemble these somewhere together with the the link to original and state clearly it's probably fair use (or whatever legal doctrine you feel like applies to you.)
There is an almost perfect tool for this job, Boilerpipe.
In fact it has its own tag here though it's little used, boilerpipe. Here's the description right from the tag wiki:
The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The source is all there in the project if you just want to learn the algorithms and techniques, but in fact somebody has already ported it to C# which is quite possibly perfect for your needs: NBoilerpipe.
BTE (Body Text Extraction) is a Python module that finds the portion of a document with the highest ratio of text to tags on a page.
http://www.aidanf.net/archive/software/bte-body-text-extraction
It's a nice, simple way of getting real text out of a website.
Here's my a (probably naive) plan of how to approach this:
Assuming the RSS feed contains the opening words of the article, you could use these to locate the start of the article in the DOM. Walk back up the DOM a little (first parent DIV? first non-inline container element?) and snip. That should be the article.
Assuming you can get the document as a XML (HtmlAgilityPack can help here), you could (for instance) grab all descendant text from <p> elements with the following Linq2Xml:
document
.Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
.Select(
p=>p
.DescendantNodes()
.Where(n => n.NodeType == XmlNodeType.Text)
.Select(t=>t.ToString())
)
.Where(c=>c.Any())
.Select(c=>c.Aggregate((a,b)=>a+b))
.Aggregate((a,b)=>a+"\r\n\r\n"+b);
We successfully used this formula for scraping, but it seems like the terrain you have to cross is considerably more inhospitable.
Obviously not a whole solution, but instead of trying to find the relevant content, it might be easier to disqualify non-relevant content. You could classify certain types of noises and work on coming up with smaller solutions that eliminate them. You could have advertisement filters, navigation filters, etc.
I think that the larger question is do you need to have one solution work on a wide range of content, or are you willing to create a framework that you can extend and implement on a site by site basis? On top of that, how often are you expecting change to the underlying data sources (i.e. volatility)?
You might want to look at Latent Dirichlet Allocation which is an IR technique to generate topics from text data that you have. This should help you reduce noise and get some precise information on what the page is about.

Categories