I found that after some modification I can use Html Agility Pack for parsing websites in WP7. I can use LINQ but not XPath. I want to ask if there is some other (maybe) better way to parse websites in WP7 and if there si some tutorial. Thanks
Scraping websites is generally a bad idea, unless you control them, as they can change their structure faster than you can update your app.
If you are scraping your own site you'd be better off building an API to expose the data in a structure way that better meets your applications requirements.
If you really must do this then the HtmlAgilityPack is the best solution currently available.
If you really must do this then you'll give your users a faster, better experieince by building your own web service which acts as a proxy between your app and the other site.
The advantages of this are:
- needing to connect to the website less often (probably - assuming you can cache the parsed page)
- faster parsing of the site/page/data
- A faster app (as it has to do less processing
- less data needing to be sent to the app
Related
I'd like to create a webapplication that allows users to work with graphs. (Retrieve data related to nodes, create new ones, drag them, etc.) I thought it would be a good idea to store the data in a graph database (e.g. neo4j) and display it with some JS-Frameworks (e.g. http://cytoscape.github.io/cytoscape.js/).
Currently I'm not sure which web application technology I should use. Since one requirement is to use microsoft technologies wherever possible I thought it might be a good idea to go with ASP.NET in C#. However, during the first chapter of my ASP.NET book the following is mentioned:
it’s worth noting that ASP.NET is not the best platform for writing
complex, app-like client-side programs
So, which technology should I use to create my web application? Any recommendations?
Well,
from my experience, I think one of the JVM based languages like Java is a save bet, if not the most sexy one. And it works best with Neo4j, and Java 8 is really nice syntax-wise.
For JS-based frameworks, try Node.js and the Neo4j REST API, should work good, too.
I'm creating a site that is running on a shared server and I need to find a good search engine. What search engine framework should I use to fit my requirements?
Here are some requirements
~100,000 documents need to be indexed
Shared Server (but can run ASP.Net and php apps)
Need to be able to restrict search results to specific tags, categories
Need to be able to sort by relevance + popularity, or relevance + date
A search is preformed on every page load (although i might implement caching). The way it works is kind of like stackoverflow. I have a main document and then suggestions for related documents are loaded on the right. This occurs on every page
Software is free and has very little budget for any type of hosted search solution (at this time anyway)
Here are my thoughts
zend lucene search - performance is not good enough for such a large site
Google custom search - number of sites/queries is limited
Solr, Sphinx, java lucene - on a shared server so I cannot install these
Lucene.net - I'm not sure if this is is possible. My hosting company allows me to run php and asp.net websites...but perhaps Lucene.net has to run as a separate process?
MySql FullText search - I am not aware of performance for large sites like I have described
This seems like a tough bill to satisfy but I'm hoping I don't need to come up with an alternative design.
For this kind of features, and such a big number of documents, I would absolutly not go with MySQL's fulltext : I would definitly use some external indexing/searching solution (like Solr, Lucene, ...)
Like you said :
You have too many documents for Zend Lucene (pure PHP implementation).
MySQL fulltext -- ergh, not that powerful, slow, ...
Solr/Sphinx require you to install them
Not sure about Lucene.NET, but with that kind of volume of data, can you really not get your own server, so you can install what you need to work properly ?
And that's especially true if search is an important part of your application (it seems it is).
If I am not wrong you are using WOrdpress. Will you be able to install MongoDB and php-mongo extension to your server, if yes then MongoDB FUlltext Search with MongoLantern can be efficient plugin for you.It can also be installed with wordpress and override the wordpress search with mongodb fulltext search.
I have used it in few of my projects and they seems worked quite well. You can have MongoLantern WP plugin from here: http://wordpress.org/extend/plugins/mongolantern/
I want to crawl through lets say other companies websites like for cars and extract readonly information in my local database. Then I want to be able to display this collected information on my website. Purely from technology perspective, is there a .net tool, program, etc already out there that is generic enough for my purpose. Or do I have to write it from scratch?
To do it effectively, I may need a WCF job that just mines data on constant basis and refreshes the database which then provides data to the website.
Also, is there a way to mask my calls to those websites? Would I create "traffic burden" for my target websites? Would it impact their functionality if I am just harmlessly crawling them?
How do I make my request look "human" instead of coming from Crawler?
Are there code examples out there on how to use a library that parses the DOM tree?
Can I send request to a specific site and get a response in terms of DOM with WebBrowser control?
Use HtmlAgilityPack to parse the HTML. Then use a Windows Service (not WCF) to run the long-running process.
I don't know about how you'd affect a target site, but one nifty way to generate human-looking traffic is the WinForms browser control. I've used it a couple of times to grab things from Wikipedia because my normal mode of using HttpWebRequest to perform HTTP get flagged a non-human filter there and I got blocked.
As far as affecting the target site it totally depends on the site. If you crawl stackoverflow enough times fast enough they'll ban your ip. If you do the same to google they'll start asking you to answer captchas. Most sites have rate limiters, so you can only ask for a request so often.
As far as scraping the data out of the page, never use regular expressions it's been said over and over. You should be using eaither a library that parses the DOM tree or roll your own if you want. In a previous startup of mine the way we approached the issue was we wrote an intermediary template language that would tell our scraper where the data was on the page so that we knew what data and what type of data we were extracting. The hard part you'll find is constantly changing and varying data. Once you have the parser working it takes constant work to have it keep working even on the same site.
I use a fantastically flexible tool Visual Web Ripper. Output to Excel, SQL, text. Input from the same.
There is no Generic tool which would extract the data from the Web for you. This is not a trivial operation. In general Crawling the pages is not that difficult. But stripping / extracting the content you need is difficult. This operation will have to be customized for every website.
We use professional tools dedicated for this and they are designed to feed the Crawler with instructions about which areas within the web page to extract the data you need.
I have also seen Perl Scripts designed extract data from Specific web pages. They could be highly effective depending on the site you parse.
If you hit a site too frequently, you will be banned (At least temporarily).
To mask your IP you can try http://proxify.com/
I've made a little game as an application for the web in silverlight using C#, and I simply would like to save the top ten scores of any of the users that go on it.
How can I write to a file and save it on my web hosting area? Is this possible?
I think this would be the best way, because I only need to store a name and score (csv file), and this would be extremely easy. I hope this is possible.
If not could someone point me in the rite direction of being able to do this with a database, I've created a template just incase using MySQL with the features provided from my web hosts. Is there any easy way to do it that way?
Thanks in advance,
Lloyd
You can add a small WCF service to your website with an ISaveScores interface. The SL app can connect to the WCF service to post scores, and the WCF service can then store the data however you want. If you use a csv file, make sure you handle locking properly, since it is very possible for multiple requests to happen simultaneously.
EDIT
Since the host is Linux, just create yourself a rest service or some other service that silverlight can post to in the same way. Silverlight can talk to pretty much any type of service, so use the same technique in your environment.
You could do it with a service as Brian suggested (although it sounds like you might not have windows hosting, so you may not be able to use WCF for it) which is probably the best way -- but if you wanted a simpler solution you could also do it with just a postback to a particular page setup for the purpose.
Write a quickee PHP page that looks for a name and score in POST data, and writes it to your MYSQL database. Call it from your SL app with a webrequest. Then you just need another simple page to query the DB and list the results.
I have a asp.net 3.5 application hosted on IIS 7.0. I'm looking for a comprehensive system to monitor traffic, down to page level minimum. Does .net have any specific tools or is it better to write my own, or what systems/software is freely available to use
Thanks
Use Google Analytics. Its a small piece of Javascript code that is inserted before the tag. Its based on Urchin analytics tracking software which Google bought. They've been doing this for a long long time.
As long as your site is referenced using a fully qualified domain name, Google Analytics can track what you need. It's got lots of flexibility with the filter mechanism as well (let's you rewrite URLs based on query string parameters, etc.)
LOTS of functionality and well thought out as well as a pretty good API if you need to do tracking on things other than clicks.
If you have access to the IIS logs, you can use a log analyzer to interpret the data. An example is the free AWStats analyzer:
http://awstats.sourceforge.net/
An alternative (and one I recommend) is Google Analytics (http://www.google.com/analytics). This relies on you embedding a small chunk of Javascript in each page you want tracking, then Google does the grunt work for you, presenting the results in an attractive Flash-rich site.
I'd suggest trying both and seeing which suits your needs. I'd definitely recommend against rolling your own system, as the above solutions are very mature and capable. Best of luck!
You'll need a client-side / javascript tracking service (such as Google Analytics but there are other good free alternatives out there) because it runs even when the user clicks the back button and the previous page (on your site) is loaded from the browser cache and not the server. The IIS won't "see" the reload since no request is made to it.