I have an api that can pass a search query to a website that I use to lookup products. I use the catalog number to obtain the device identifier. The response that is returned is HTML, and I need to extract one line from the HTML to write to a file. Is it possible to select a specific div in a web api?
My goal is to eventually loop over each product search, pull the one line I need, and then write it to an excel file.
Here is an example of the api searching a product, and the response. api working
Here is the single line of code that I need to extract out of the response, I then want to concatenate it to the url and write the whole link out with each specific device identifier Line of code I need
I hope this makes sense.
This is a parsing problem, and since the file/content you want to extract from is HTML, it would be a straightforward task.
You have three main steps to get this done.
Parse the content, whether it's on the web, or downloaded file.
Use a selector to get the "a" tag you're looking for.
Extract the URL from the "href" attribute from the "a" tag.
I see you're using C#, so I would recommend this library, you will use its parser to parse the file, then the selector along with a CSS selector, to get your data.
Let me know if you still need more details.
Related
How do I retrieve RSS Feeds based on a date range?
Specifically, how do I prepare the url so that I can get items that were published past a certain date?
www.pwop.com/feed.aspx?show=dotnetrocks&filetype=master&tags=Craftsmanship
Your questions is more related to the HTTP API of the site, not RSS it self.
RSS is a predefined XML data format.
Most RSS urls doesn't support filters and introduce simple URL which returns in RSS format the last X results (x is usually between 10 to 50 results).
Some URL allow to specify categories or Tags like in your example, so the reutrn RSS XML will contain only results from this tags.
If you don't want to miss results, you need to keep query the RSS URL every X minutes/hours depends on the update speeds of the results.
Other option is to contact the site and request a full API access or even to implement a feature to filter by date.
Not all websites support it, but maybe there is a solution that can work:
Websites usually have a sitemap.xml (or sitemap.xml.gz or sitemap.gz) file that contains all the urls in bulk or grouped in some way (e.g., by category, tag, month). The sitemap.xml can contain links to additional xmls and so on.
The main sitemap is typically located in the root of the site (e.g., https://news.bitcoin.com/sitemap.xml), but you can find more information about sitemaps here: https://www.sitemaps.org/protocol.html.
If a website has such an xml file, perhaps processing it will make it easier to extract the needed information without any special site crawler or API.
I have a PHP source file, so it has both PHP and HTML code. Now I have to find whether the PHP variable is used inside a Html Element directly or as an Html Attribute value. Initially, I though Regex can help... but was unable to design the pattern as there may be several scenarios.
As HtmlAgility is a Html parser, will it able to perform the task? After searching a while, I'm still not able to figure it out!
P.S: I am processing PHP source code as text file in C#.
No, the web server will execute the PHP rendering HTML, JavaScript etc and supply that to whatever HTTP client is making the request.
So the point is, by the time you request the PHP and get a response back, it's no longer the raw PHP code. For that you'd have to have direct access to the web server, or some other mechanism.
Hello i want to ask something ... Is there a way to read some information from website that i do not own from a code behind
Like i want to read title of every page in some web site ... Can i do it and how ?
Not a way of hacking just to read the clear text no html code want to read
I don't know what to do or how to do it i need an ideas
And is there a way to search for specific word in several website and an api to use it for search for a website
You still have to read the HTML since that's how the title is transmitted.
Use the HttpWebRequest class to make a request to the web server and the HttpWebResponse to get the response back and the GetResponseStream() method to the response. Then you need to parse it in some way.
Look at the HTMLAgilityPack in order to parse the HTML. You can use this to get the title element out of the HTML and read it. You can then get all the anchor elements within the page and determine which ones you want to visit next that are on their site to scan the titles.
There is powerful HTML parser available for .Net that you can use with XPATH to read HTML pages,
HTML Agility pack
Or
you can use built in WebClient class to get data from page as string and then do string manipulation.
I am attempting to retrieve data that is dynamically loaded onto a webpage using hashed links i.e. http://www.westfield.com.au/au/retailers#page=5
My question is what technology is being used to load the data onto the page?
Secondly, how would one approach retrieving this data using C#?
My attempts so far have used WebClient to download the page at this link, unfortunately the html file only contains the data from the very first page, no matter what page link i use.
What technology is being used to load the data onto the page?
JavaScript is used to load the data from a server, parse it into HTML and put it in the right place in the DOM.
Secondly, how would one approach retrieving this data using C#?
Make a request to: http://www.westfield.com.au/api/v1/countries/au/retail-chains/search.json?page=5, it will return a structured JSON document containing the data you need.
If all you need is the JSON structure, Jon's answer sounds like a good place to start.
If you want a good stack for true rendered scraping I'd use a combination of phantomjs and Selenium to help bridge it to .net.
This article is a great place to start.
some of my website urls are duplicated.
i need to know which of them are indexed by google
i need some function in c# to know which of my url is indexed.
In Google's search you can type:
site:yourdomain
And it will show you the results. you can use the Google Custom Search API programmatically to do this.
http://code.google.com/apis/customsearch/v1/overview.html
It returns JSON results that you can convert into C# objects using the DataContractSerializer.
You'll need to sign up for an API key if you go this route.
Edit
As for Html Agility Pack, I have a blog post that shows how you can extract the links on a page
Finding links on a Web page