I'm currently trying to create a small scale CMS for my personal website and thought I'd like to try to make some sort of a page layout from a basic aspx file with some placeholders and load content based on the URL, without the use of url query strings such as ?pageid=1.
I'm trying to wrap my head around how this can be achieved without getting errors of a physical file not existing when I e.g. type in http://mywebsite.com/projects/w8apps/clock.
I've read a lot about BLOB and storing files binarily in the database. But I haven't come across a blog which points in the direction of using a so called page layout and loading content based on the URL instead of a query string.
I'm not asking for a solution, just some hints - blogs mostly - which can point me in the right direction and help me achieve this goal.
To deal with loading a page with a URL that is more friendly, rather than ?page_id=1, you may want to have a look at this article about URL Rewriting and URL Mapping.
http://www.codeproject.com/Articles/18318/URL-Mapping-URL-Rewriting-Search-Engine-Friendly-U
Hope you can find a way of fitting this kind of code into your application!
You questions is too broad but here are couple hints that will point you in the right direction.
Create clear specs before you start working on this. Do you really need to have URLs like this http://mywebsite.com/projects/w8apps/clock ? If yes then check out MVC since it has best support for this
Storing binary files in database doesn’t have much to do with this. You first need to think of how your tables will look like and that is based on what are you trying to achieve…
I’d suggest you install some CRM that if open source and analyze this first. You’ll probably find a lot better ideas this way. Just go to CodePlex and search for CMS.
I need to get information from couple of web sites . For example this site
What would be the best way to get all the links from the page so that the information could be extracted.
Some times need to click on a link to get other links inside that.
I tried Watin and I tried doing the same from within Excel 2007 with Web Data option.
Could you please suggest some better way which I am not aware of .
Ncrawler might be very useful for the deep level crawling . You could also set the MaxCrawlDepth for specifying the same.
Have a look at WGet. It is an incredibly powerful tool for mining the content of a single page or an entire website. The options available allow you to dictate how many levels deep to follow in terms of links, what to do with static resources such as images, how to handle relative links, etc. It also does a very good job of mining pages which are generated dynamically, such as those served by CGI or ASP.
It's been around for many years in the 'nix world but executables compiled for Windows are readily available.
You would need to kick it off from .NET using Process.Start but you could then pipe the results into multiple files (which mimic the original website structure), a single file, or into memory by capturing standard output. Then you can do subsequent analysis such as extracting HREF HTML elements (if it is only links you are interested in) or grabbing the sort of table data evident in the link you provide in your question.
I realise this is not a 'pure' .NET solution but the power WGET offers more than compensates for this, in my opinion. I have used it myself in the past, in this way, for exactly the sort of thing I think you are trying to do.
I recommend to use http://watin.org/. This is much simpler than wget :-)
Wondering if anyone knows of any open source code about contextualization via JS (javascript) or ASP.NET ? That is, contextualization of content - determining "what" content is?
Its an interesting area and I cant seem to find any previous projects on it ?
Really appreciate any help ?
Presumably you are looking to build something like a search engine that can find a relevant document in a sea of nondescript documents which do not contain any metadata, only their textual content.
Computers are notoriously bad at this kind of categorization, for the same reasons that they can identify spelling, but not grammar errors. It's a pattern matching problem that relies on human context to determine the correct solution.
Google is good at this because it relies on human behaviors to create relevance (like how many links from other sites a page has).
The closest thing I can think of that will do what you want (without actually attaching genuine metadata to each document by hand) is full text search. The Wikipedia article has several references to software that does this.
Depending on what you want to do, it may be easier to mine your page for context after the conent has been rendered. That way you are ensured that you have the context that the user is viewing the page. Here is a post to a jQuery plugin that highlights target words on a html page.
Here are some other plugins you might want to review:
quickSearch plugin
QuickSilver Search plugin
I'm attempting to issue a 301 redirect when a user requests http://www.mysite.com/Default.aspx to http://www.mysite.com/
The issue I'm having is that every property I can find within Request (Request.Url, Request.RawUrl, etc) is identical for those two requests.
Edit for further clarification:
This is on a shared web host, I can't install ISAPI extensions.
One more edit. Apparently the first tech support guy I talked to at the host didn't know what he was talking about, they have ISAPI rewrite installed.
Steve Wortham's Blog describes how he addressed the same issue.
He used Ionic's free Isapi Rewrite Filter
Just use a url rewriting ISAPI filter. It's painless. This page on Scott Gu's blog contains pretty much all the info you need.
Whilst a lot of people cite SEO as the reason for wanting to do this, I'd just like to make a minor observation:
Search for the word "Welcome" on my site
This word appears on the homepage (default.aspx), and in a blog post. At the moment I'm seeing three results for this search:
The middle one is my homepage, and you'll note that it's only listed once - yet you can easily visit, and stay on /default.aspx if you browse to it (and indeed, the navigation on the site links to /Default.aspx).
If anything it's more of an issue with reporting, as google analytics for example sees /default.aspx and /Default.aspx as different pages (but again, the search results are ignoring casing on a windows server):
So, all I'm saying is that search engines are often brighter than SEO consultants give them credit for...
A bigger problem for my site is the fact that I have three domain names pointing at it, and still haven't gotten around to putting a redirect inplace for the other ones.
I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.
Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.
A couple of additional notes:
Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.
Any ideas?
As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.
When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.
Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.
Try running diff on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.
Here's what I would do after I checked the robots.txt file to make sure it's fine to scrap the article and parsed the document as an XML tree:
Make sure the article is not broken into many pages. If it is, 'print view', 'single page' or 'mobile view' links may help to bring it to single page. Of course, don't bother if you only want the beginning of the article.
Find the main content frame. To do that, I would count the amount of information in every tag. Now, what we're looking is a node that is big but consists of many small subnodes.
Now I would try to filter out any noise inside the content frame. Well, the websites I read don't put any crap there, only useful images, but you do need to kill anything that has inline javascript and any external links.
Optionally, flatten that into plain text (that is, go into the tree and open all elements; block elements create a new paragraph).
Guess the header. It's usually something with h1, h2 or at least big font size, but you can simplify life by assuming that it somehow resembles the page title.
Finally, find the authors (something with names and email), the copyright notice (try metadata or the word copyright) and the site name. Assemble these somewhere together with the the link to original and state clearly it's probably fair use (or whatever legal doctrine you feel like applies to you.)
There is an almost perfect tool for this job, Boilerpipe.
In fact it has its own tag here though it's little used, boilerpipe. Here's the description right from the tag wiki:
The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The source is all there in the project if you just want to learn the algorithms and techniques, but in fact somebody has already ported it to C# which is quite possibly perfect for your needs: NBoilerpipe.
BTE (Body Text Extraction) is a Python module that finds the portion of a document with the highest ratio of text to tags on a page.
http://www.aidanf.net/archive/software/bte-body-text-extraction
It's a nice, simple way of getting real text out of a website.
Here's my a (probably naive) plan of how to approach this:
Assuming the RSS feed contains the opening words of the article, you could use these to locate the start of the article in the DOM. Walk back up the DOM a little (first parent DIV? first non-inline container element?) and snip. That should be the article.
Assuming you can get the document as a XML (HtmlAgilityPack can help here), you could (for instance) grab all descendant text from <p> elements with the following Linq2Xml:
document
.Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
.Select(
p=>p
.DescendantNodes()
.Where(n => n.NodeType == XmlNodeType.Text)
.Select(t=>t.ToString())
)
.Where(c=>c.Any())
.Select(c=>c.Aggregate((a,b)=>a+b))
.Aggregate((a,b)=>a+"\r\n\r\n"+b);
We successfully used this formula for scraping, but it seems like the terrain you have to cross is considerably more inhospitable.
Obviously not a whole solution, but instead of trying to find the relevant content, it might be easier to disqualify non-relevant content. You could classify certain types of noises and work on coming up with smaller solutions that eliminate them. You could have advertisement filters, navigation filters, etc.
I think that the larger question is do you need to have one solution work on a wide range of content, or are you willing to create a framework that you can extend and implement on a site by site basis? On top of that, how often are you expecting change to the underlying data sources (i.e. volatility)?
You might want to look at Latent Dirichlet Allocation which is an IR technique to generate topics from text data that you have. This should help you reduce noise and get some precise information on what the page is about.