In C#, how can I extract the URL's of any images found when performing a search with Google? I'm writing a little app to get the artwork for my ripped cd's. I played around with the Amazon service but found the results I received were erratic. I can't be bothered to learn the whole Amazon API just for this simple little app though, so thought I'd try Google instead.
So far, I've performed the search and got the result page's source, but I'm not sure how to extract the URL's from it. I know I have to use Regex but have no idea what expression to use. All the one's I've found seem to be broken. Any help would be appreciated.
Try using the HTML Agility Pack. It works wonders on scraping content.
It lives here on Codeplex.
I used it to scrape a user ranking list from so.com, and loved it.
It will let you select a node of html and then query subnodes using XSLT.
Related
I have a project where I need to create an HTML form (no problem) and then create a PDF file from the results using C#.
I have done this before in PHP using FPDF but this one needs to be C#. Ideally I want to put the code into a user control and then stick it in an Umbraco website.
Can anyone recommend a good way to do this? PDF doesn't need to be fancy, it'll just display text, we aim to create a generic purchase order based on what the customer wants from the form, which can then be emailed to them to print off on headed paper.
Thanks
There are a couple of recent problems with iTextSharp. The most annoying is that in the latest version they've deprecated the HTML parser. So now everything has to work through the XMLWorkerHelper singleton and parses through ParseXHtml. I find this a real pain, since HTML pages which aren't well formed appear fine on browser, parse OK in the old method and now crash out with an exception. So it necessitates an extra step to make sure your HTML is well formed (as XHTML) first. If you are generating your HTML from an ASPX page, then using Server.Execute() to get the stream, then this might be useful to you for iTextSharp:
http://jwcooney.com/2012/12/30/generate-a-pdf-from-an-asp-net-web-page-using-the-itextsharp-xmlworker-namespace/
Be mindful that iTextSharp has a distinct lack of any decent documentation of the modern changes (being mindful that the Java iText documents don't translate perfectly to C#), it makes the learning curve far too long and steep for any practical use in short spaces of time. I've basically given up on that platform, though may just create a baseline system to get something working lean whilst I then learn another framework.
As a result, I'm looking at PDFizer and PDFSharp libraries. If I have some success, I'll report back.
here is a library for converting HTML to PDF
http://pdfcrowd.com/web-html-to-pdf-net/
I like the PDFsharp library. Not sure how it would work for your needs, though.
I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
If site doesn't cointain feed or it's malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn't cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?
I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.
Hi
Am developing a small search engine kind of application. It searches for contents in word documents. I need to implement a "view as html" option as in gmail. When I click the link to the doc, it should open as a html page in a new browser. Is there any way to achieve this?
I was able to open the word doc in an iframe, but that does not suit my purpose.
My application uses Asp.Net and C#. Any help would be appreciated.
Regards
Vignesh
The easy, slow, memory-intensive, unscalable, unscalable (needs to be said twice) way of doing it would be to use the office COM API to load the file and save it as html (or text actually since all you want to do is a search on it), but I really doubt you can pull this off in even a moderately used web site.
Throwing that aside, you're left with open source parsers or using the IFilter interface to do it. I found an example of the latter: http://www.neowin.net/forum/topic/316480-reading-text-from-ms-word-files-in-c
some of my website urls are duplicated.
i need to know which of them are indexed by google
i need some function in c# to know which of my url is indexed.
In Google's search you can type:
site:yourdomain
And it will show you the results. you can use the Google Custom Search API programmatically to do this.
http://code.google.com/apis/customsearch/v1/overview.html
It returns JSON results that you can convert into C# objects using the DataContractSerializer.
You'll need to sign up for an API key if you go this route.
Edit
As for Html Agility Pack, I have a blog post that shows how you can extract the links on a page
Finding links on a Web page
I want to store Google search results (both title and link) into database. HTML code of search results is like:
<br/>
THETITLE
And each page has 10 results. Can anyone show me how to retrieve THEURL and THETITLE?
Thank you so much!
You should to give Html Agility Pack a try. An HTML parser is correct way to read HTML content, not regular expressions.
BUT, If you wanna try for your own risk:
<h3 class=r><a .*? href="(?<url>[^"]*)".*?>(?<title>.*?)</a></h3>
You'll have problems with:
Line breaks
Unmatched tags
Minor HTML changes
So, good luck!
For starters, I would not recommend using regex for this, use the 'Html Agility Pack' to do the parsing of the HTML document.
Hope this helps,
Best regards,
Tom.
Consider using the Google AJAX Search API instead. It will be easier on both you and Google's servers. There are some instructions for using it outside JavaScript environments. They don't give a C# example, but it shouldn't be difficult to adapt to your needs using one of the JSON APIs for C#.
If you do stick with HTML, I also recommend HTML Agility Pack.
You should also think about caching so you minimize both stale data and unnecessary requests.