some of my website urls are duplicated.
i need to know which of them are indexed by google
i need some function in c# to know which of my url is indexed.
In Google's search you can type:
site:yourdomain
And it will show you the results. you can use the Google Custom Search API programmatically to do this.
http://code.google.com/apis/customsearch/v1/overview.html
It returns JSON results that you can convert into C# objects using the DataContractSerializer.
You'll need to sign up for an API key if you go this route.
Edit
As for Html Agility Pack, I have a blog post that shows how you can extract the links on a page
Finding links on a Web page
Related
I have an api that can pass a search query to a website that I use to lookup products. I use the catalog number to obtain the device identifier. The response that is returned is HTML, and I need to extract one line from the HTML to write to a file. Is it possible to select a specific div in a web api?
My goal is to eventually loop over each product search, pull the one line I need, and then write it to an excel file.
Here is an example of the api searching a product, and the response. api working
Here is the single line of code that I need to extract out of the response, I then want to concatenate it to the url and write the whole link out with each specific device identifier Line of code I need
I hope this makes sense.
This is a parsing problem, and since the file/content you want to extract from is HTML, it would be a straightforward task.
You have three main steps to get this done.
Parse the content, whether it's on the web, or downloaded file.
Use a selector to get the "a" tag you're looking for.
Extract the URL from the "href" attribute from the "a" tag.
I see you're using C#, so I would recommend this library, you will use its parser to parse the file, then the selector along with a CSS selector, to get your data.
Let me know if you still need more details.
I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).
Hello i want to ask something ... Is there a way to read some information from website that i do not own from a code behind
Like i want to read title of every page in some web site ... Can i do it and how ?
Not a way of hacking just to read the clear text no html code want to read
I don't know what to do or how to do it i need an ideas
And is there a way to search for specific word in several website and an api to use it for search for a website
You still have to read the HTML since that's how the title is transmitted.
Use the HttpWebRequest class to make a request to the web server and the HttpWebResponse to get the response back and the GetResponseStream() method to the response. Then you need to parse it in some way.
Look at the HTMLAgilityPack in order to parse the HTML. You can use this to get the title element out of the HTML and read it. You can then get all the anchor elements within the page and determine which ones you want to visit next that are on their site to scan the titles.
There is powerful HTML parser available for .Net that you can use with XPATH to read HTML pages,
HTML Agility pack
Or
you can use built in WebClient class to get data from page as string and then do string manipulation.
I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
If site doesn't cointain feed or it's malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn't cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?
I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.
In C#, how can I extract the URL's of any images found when performing a search with Google? I'm writing a little app to get the artwork for my ripped cd's. I played around with the Amazon service but found the results I received were erratic. I can't be bothered to learn the whole Amazon API just for this simple little app though, so thought I'd try Google instead.
So far, I've performed the search and got the result page's source, but I'm not sure how to extract the URL's from it. I know I have to use Regex but have no idea what expression to use. All the one's I've found seem to be broken. Any help would be appreciated.
Try using the HTML Agility Pack. It works wonders on scraping content.
It lives here on Codeplex.
I used it to scrape a user ranking list from so.com, and loved it.
It will let you select a node of html and then query subnodes using XSLT.