Can HtmlAgility be used to fetch from Html source containing PHP code? - c#

I have a PHP source file, so it has both PHP and HTML code. Now I have to find whether the PHP variable is used inside a Html Element directly or as an Html Attribute value. Initially, I though Regex can help... but was unable to design the pattern as there may be several scenarios.
As HtmlAgility is a Html parser, will it able to perform the task? After searching a while, I'm still not able to figure it out!
P.S: I am processing PHP source code as text file in C#.

No, the web server will execute the PHP rendering HTML, JavaScript etc and supply that to whatever HTTP client is making the request.
So the point is, by the time you request the PHP and get a response back, it's no longer the raw PHP code. For that you'd have to have direct access to the web server, or some other mechanism.

Related

How to select specific DIV in web api response (html)

I have an api that can pass a search query to a website that I use to lookup products. I use the catalog number to obtain the device identifier. The response that is returned is HTML, and I need to extract one line from the HTML to write to a file. Is it possible to select a specific div in a web api?
My goal is to eventually loop over each product search, pull the one line I need, and then write it to an excel file.
Here is an example of the api searching a product, and the response. api working
Here is the single line of code that I need to extract out of the response, I then want to concatenate it to the url and write the whole link out with each specific device identifier Line of code I need
I hope this makes sense.
This is a parsing problem, and since the file/content you want to extract from is HTML, it would be a straightforward task.
You have three main steps to get this done.
Parse the content, whether it's on the web, or downloaded file.
Use a selector to get the "a" tag you're looking for.
Extract the URL from the "href" attribute from the "a" tag.
I see you're using C#, so I would recommend this library, you will use its parser to parse the file, then the selector along with a CSS selector, to get your data.
Let me know if you still need more details.

Read something from website in code behind

Hello i want to ask something ... Is there a way to read some information from website that i do not own from a code behind
Like i want to read title of every page in some web site ... Can i do it and how ?
Not a way of hacking just to read the clear text no html code want to read
I don't know what to do or how to do it i need an ideas
And is there a way to search for specific word in several website and an api to use it for search for a website
You still have to read the HTML since that's how the title is transmitted.
Use the HttpWebRequest class to make a request to the web server and the HttpWebResponse to get the response back and the GetResponseStream() method to the response. Then you need to parse it in some way.
Look at the HTMLAgilityPack in order to parse the HTML. You can use this to get the title element out of the HTML and read it. You can then get all the anchor elements within the page and determine which ones you want to visit next that are on their site to scan the titles.
There is powerful HTML parser available for .Net that you can use with XPATH to read HTML pages,
HTML Agility pack
Or
you can use built in WebClient class to get data from page as string and then do string manipulation.

Getting rendered html final source

I am developing desktop application. I would like to grab remote html source. But remote page widely rendered by javascript after page load.
I'm searching for a few days but i could not find anything useful. I've studied and tried to apply the following suggestions. But I can get only base html codes.
WebBrowser Threads don't seem to be closing
Get the final generated html source using c# or vb.net
View Generated Source (After AJAX/JavaScript) in C#
Is there any way to get all data like html console's approach of firebug?
Thank you in advance.
What are you trying to do? A web browser will do more than just grab the HTML. It needs to be parsed (which will likely download further files) and rendered. You could use the WebKit C# wrapper [http://webkitdotnet.sourceforge.net/] - I have used this previously to get thumbnails of web pages.

Retrieving Dynamically Loaded Data From a Website

I am attempting to retrieve data that is dynamically loaded onto a webpage using hashed links i.e. http://www.westfield.com.au/au/retailers#page=5
My question is what technology is being used to load the data onto the page?
Secondly, how would one approach retrieving this data using C#?
My attempts so far have used WebClient to download the page at this link, unfortunately the html file only contains the data from the very first page, no matter what page link i use.
What technology is being used to load the data onto the page?
JavaScript is used to load the data from a server, parse it into HTML and put it in the right place in the DOM.
Secondly, how would one approach retrieving this data using C#?
Make a request to: http://www.westfield.com.au/api/v1/countries/au/retail-chains/search.json?page=5, it will return a structured JSON document containing the data you need.
If all you need is the JSON structure, Jon's answer sounds like a good place to start.
If you want a good stack for true rendered scraping I'd use a combination of phantomjs and Selenium to help bridge it to .net.
This article is a great place to start.

Using C# how do I get a list/array of all script tags (and their contents) on a webpage?

I am using HttpWebRequest to put a remote web page into a String and I want to make a list of all it's script tags (and their contents) for parsing.
What is the best method to do this?
The best method is to use an HTML parser such as the HTML Agilty Pack.
From the site:
It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Sample applications:
Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.
Use an XML parser to get all the script tags with their content.
Like this one: simple xml

Categories