Read something from website in code behind - c#

Hello i want to ask something ... Is there a way to read some information from website that i do not own from a code behind
Like i want to read title of every page in some web site ... Can i do it and how ?
Not a way of hacking just to read the clear text no html code want to read
I don't know what to do or how to do it i need an ideas
And is there a way to search for specific word in several website and an api to use it for search for a website

You still have to read the HTML since that's how the title is transmitted.
Use the HttpWebRequest class to make a request to the web server and the HttpWebResponse to get the response back and the GetResponseStream() method to the response. Then you need to parse it in some way.
Look at the HTMLAgilityPack in order to parse the HTML. You can use this to get the title element out of the HTML and read it. You can then get all the anchor elements within the page and determine which ones you want to visit next that are on their site to scan the titles.

There is powerful HTML parser available for .Net that you can use with XPATH to read HTML pages,
HTML Agility pack
Or
you can use built in WebClient class to get data from page as string and then do string manipulation.

Related

How to select specific DIV in web api response (html)

I have an api that can pass a search query to a website that I use to lookup products. I use the catalog number to obtain the device identifier. The response that is returned is HTML, and I need to extract one line from the HTML to write to a file. Is it possible to select a specific div in a web api?
My goal is to eventually loop over each product search, pull the one line I need, and then write it to an excel file.
Here is an example of the api searching a product, and the response. api working
Here is the single line of code that I need to extract out of the response, I then want to concatenate it to the url and write the whole link out with each specific device identifier Line of code I need
I hope this makes sense.
This is a parsing problem, and since the file/content you want to extract from is HTML, it would be a straightforward task.
You have three main steps to get this done.
Parse the content, whether it's on the web, or downloaded file.
Use a selector to get the "a" tag you're looking for.
Extract the URL from the "href" attribute from the "a" tag.
I see you're using C#, so I would recommend this library, you will use its parser to parse the file, then the selector along with a CSS selector, to get your data.
Let me know if you still need more details.

How to show appropriate page on mobile

I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).

How many times a word is present in a web page using htmlagility C#

I am a developing a C# application which can scrape the contents of a web page and return all the words of the page. I am using HTMLAGILITY pack for it.
I want to know how can i know how many times a word is present in a web page after scraping the contents of the page.
You could treat the whole page/web request as a string and do something like this:
https://msdn.microsoft.com/en-us/library/bb546166.aspx
It might not be efficient and it would search CSS classes and everything else but it might be a starting point.
Else you need to use the agility pack and scrape through each not and check each bit of public text .

c# how to download html which loads using ajax

now a days there are web pages which developed using some ajax based frameworks (dynamically or lazy loading). Just wondering if there is any way to download html contents of such pages as when i try to download using htmlAgilityPack but all i get is header and empty body part but when i try to inspect element then only i can see proper htmls/div but of that page when i try to look into view source i see empty body...
is there any third party like htmlAgilityPack or any other way?
You would need to be able to run the js that is inside. Which according to this answer is not possible with htmlAgilityPack.
You can see it.Getting web content by Html Agility Pack.https://code.msdn.microsoft.com/Getting-web-content-by-bb07d17d...

Can HtmlAgility be used to fetch from Html source containing PHP code?

I have a PHP source file, so it has both PHP and HTML code. Now I have to find whether the PHP variable is used inside a Html Element directly or as an Html Attribute value. Initially, I though Regex can help... but was unable to design the pattern as there may be several scenarios.
As HtmlAgility is a Html parser, will it able to perform the task? After searching a while, I'm still not able to figure it out!
P.S: I am processing PHP source code as text file in C#.
No, the web server will execute the PHP rendering HTML, JavaScript etc and supply that to whatever HTTP client is making the request.
So the point is, by the time you request the PHP and get a response back, it's no longer the raw PHP code. For that you'd have to have direct access to the web server, or some other mechanism.

Categories