regular expression: mine text data from other websites - c#

I want to crawl through lets say other companies websites like for cars and extract readonly information in my local database. Then I want to be able to display this collected information on my website. Purely from technology perspective, is there a .net tool, program, etc already out there that is generic enough for my purpose. Or do I have to write it from scratch?
To do it effectively, I may need a WCF job that just mines data on constant basis and refreshes the database which then provides data to the website.
Also, is there a way to mask my calls to those websites? Would I create "traffic burden" for my target websites? Would it impact their functionality if I am just harmlessly crawling them?
How do I make my request look "human" instead of coming from Crawler?
Are there code examples out there on how to use a library that parses the DOM tree?
Can I send request to a specific site and get a response in terms of DOM with WebBrowser control?

Use HtmlAgilityPack to parse the HTML. Then use a Windows Service (not WCF) to run the long-running process.

I don't know about how you'd affect a target site, but one nifty way to generate human-looking traffic is the WinForms browser control. I've used it a couple of times to grab things from Wikipedia because my normal mode of using HttpWebRequest to perform HTTP get flagged a non-human filter there and I got blocked.

As far as affecting the target site it totally depends on the site. If you crawl stackoverflow enough times fast enough they'll ban your ip. If you do the same to google they'll start asking you to answer captchas. Most sites have rate limiters, so you can only ask for a request so often.
As far as scraping the data out of the page, never use regular expressions it's been said over and over. You should be using eaither a library that parses the DOM tree or roll your own if you want. In a previous startup of mine the way we approached the issue was we wrote an intermediary template language that would tell our scraper where the data was on the page so that we knew what data and what type of data we were extracting. The hard part you'll find is constantly changing and varying data. Once you have the parser working it takes constant work to have it keep working even on the same site.

I use a fantastically flexible tool Visual Web Ripper. Output to Excel, SQL, text. Input from the same.

There is no Generic tool which would extract the data from the Web for you. This is not a trivial operation. In general Crawling the pages is not that difficult. But stripping / extracting the content you need is difficult. This operation will have to be customized for every website.
We use professional tools dedicated for this and they are designed to feed the Crawler with instructions about which areas within the web page to extract the data you need.
I have also seen Perl Scripts designed extract data from Specific web pages. They could be highly effective depending on the site you parse.
If you hit a site too frequently, you will be banned (At least temporarily).
To mask your IP you can try http://proxify.com/

Related

Design questions for RSS generating application

I've been tasked with creating a small .net application to serve content from a DB through an RSS feed. The content will be updated from the DB on a fixed interval (say every 30s or so). This will be my first time working with RSS and I have somewhat limited web application skills. However, the DBs and DA layer im pretty good so im not exactly starting from scratch.
My questions are:
I want to decouple the content updating process from the request servicing process. Am I better off writing an independent windows service to handle the db-related content retrieval and XML transformation or would using a background process in a web application be fine?
a. If the answer is dedicated WS, will thread-blocking be an issue as the service tries to update a page at the same time the page is being served?
b. If the answer is BG process, is there a way to share a collection or some-type of in memory object between the background process and the main application so that on client request, the XML is generated real-time from objects in a collection?
So SOAP/REST WS a strong option for content delivery or am I better off with a full web application with rss.aspx?
For transforming the content to XML, should I use SyndicationFeed class or some form of XML template with substitution? There are a very limited number of fields (4-8) that will be updated routinely, so the XML will be relatively tiny.
Sorry if I seem all over the place on this. Im just trying to really think of a robust solution thats extensible and well designed. Thanks in advance and please know I appreciated any thoughts/ideas on this project.
I have some experience building RSS systems, so let me try to answer your questions.
If by decoupling you mean "generating the XML files aynschronously", then it depends on how many differents feeds you have. Based on what you describe, you'll be serving feeds based on queries to a database. If these queries have parameters, then you'll have as many different feeds that you have possible queries and generating them offline will not work. Generally, I think most people generate feeds 'on the fly' with the requests they get.
I'm not familiar with rss.aspx, so I can't help much :)
The benefit of using your own XML templates is that you'll be able extend (the X in XML!) your schema at some point with some other namespaces should you need to do that in the future.

How to load XML from external site periodically?

On a personal project I'm working on, I have a requirement where I need to save (on disk) a XML feed periodically from an external site, and then parse the XML and render the contents in a particular format. Parsing the XML and rendering it is no problem - the confusion comes in finding the appropriate way to pole the external site/url store the XML periodically.
I have done a fair amount of research, but I've ended up even more stumped. My initial thoughts were to create a service that poles the external site, and retrieve and store the XML at prescribed intervals. I've not created a service before, so a) I'm not really sure where to start, and b) I'll be hosting the site through a hosting provider and I'm not sure that this a viable option?
The SO thread writing a service to periodically retrieve XML and send SMS seems to do exactly what I need, but I don't entirely understand the proposed solution.
I also found an article on delivering data across domains using an AJAX proxy, but this seems overkill for what I need.
Does anyone have any recommmendations on how to achieve this?
Read this and when you're finished, I would suggest you read the XML via an HTTPWebRequest, instead of trying to download it. I assume you'll be able to do this and write the result to a file? If not, I can expand my answer a bit.
You'll definitely want to create a windows service as their sole purpose is to keep running in the background and periodically do stuff.

How To Measure The Average Response Time On A Sharepoint SiteCollection Programaticly In C#

I am trying to create a widget for Sharepoint that will show the average response time of a site collection. I have looked through the API and haven't been able to find anything. Does anyone know of an API call at either the SPWeb or SPSite level that will give me the average response time?
I believe that you could accomplish your purpose using some combination of the following two links.
SharePoint Web Analytics does not quite have the information you are looking for, but it does come equipped with quite a lot of information relating to the usage of your site. This would allow you to (even without looking at average response times) speed up only the pages which are being used most often. Speeding up those pages would have the most dramatic effect on the average speed anyway. (usually).
LogParser and IISLogs will allow you to generate some reports on response time on the site. However, I am not very familiar with Log Parser, and I don't believe this is something you'd be able to do in real time.
The top answer shows you how to create the graphs, and the second answer shows how to create them on the fly, but not without some effort on your part.
Using a combination of those two answers, you might be able to generate nightly reports (or hourly, whatever your SLA is) and upload them to SharePoint.

Parse website in Windows Phone 7

I found that after some modification I can use Html Agility Pack for parsing websites in WP7. I can use LINQ but not XPath. I want to ask if there is some other (maybe) better way to parse websites in WP7 and if there si some tutorial. Thanks
Scraping websites is generally a bad idea, unless you control them, as they can change their structure faster than you can update your app.
If you are scraping your own site you'd be better off building an API to expose the data in a structure way that better meets your applications requirements.
If you really must do this then the HtmlAgilityPack is the best solution currently available.
If you really must do this then you'll give your users a faster, better experieince by building your own web service which acts as a proxy between your app and the other site.
The advantages of this are:
- needing to connect to the website less often (probably - assuming you can cache the parsed page)
- faster parsing of the site/page/data
- A faster app (as it has to do less processing
- less data needing to be sent to the app

What is recommended for monitoring traffic to my asp.net application

I have a asp.net 3.5 application hosted on IIS 7.0. I'm looking for a comprehensive system to monitor traffic, down to page level minimum. Does .net have any specific tools or is it better to write my own, or what systems/software is freely available to use
Thanks
Use Google Analytics. Its a small piece of Javascript code that is inserted before the tag. Its based on Urchin analytics tracking software which Google bought. They've been doing this for a long long time.
As long as your site is referenced using a fully qualified domain name, Google Analytics can track what you need. It's got lots of flexibility with the filter mechanism as well (let's you rewrite URLs based on query string parameters, etc.)
LOTS of functionality and well thought out as well as a pretty good API if you need to do tracking on things other than clicks.
If you have access to the IIS logs, you can use a log analyzer to interpret the data. An example is the free AWStats analyzer:
http://awstats.sourceforge.net/
An alternative (and one I recommend) is Google Analytics (http://www.google.com/analytics). This relies on you embedding a small chunk of Javascript in each page you want tracking, then Google does the grunt work for you, presenting the results in an attractive Flash-rich site.
I'd suggest trying both and seeing which suits your needs. I'd definitely recommend against rolling your own system, as the above solutions are very mature and capable. Best of luck!
You'll need a client-side / javascript tracking service (such as Google Analytics but there are other good free alternatives out there) because it runs even when the user clicks the back button and the previous page (on your site) is loaded from the browser cache and not the server. The IIS won't "see" the reload since no request is made to it.

Categories