c# - HTML Parsing - c#

I got a webpage that i need to parse through the entire HTML code to find any special tagging.
For example i want to pull out all the *.*.* element on that page. Whats the best way to do this in c#?
However these strings are dynamic because they are resulting from a search query. So i can't just pull out the source code and look for that string since they are in scripts that gets pulled in dynamically.
Is there a way to get these strings? I just need to check if they are in my already existing list. Maybe Selenium? or some other engine that i'm not aware off or a good approach to do this?
Thanks!

Related

Extract data from a website in winforms c#

i want to extract some data from a website. f.e. (https://www.chefkoch.de/rezepte/drucken/512261146932016/Annas-Rouladen-mit-Seidenkloessen.html). The text on the left side an the ingredients table on the right.
i tried several ways like with a webclient and regex the parts but the problem was here that if the table has more than one list like in my example i cant split them.
i also tried it with an htmldocument and get the elements but the
elements doesnt have an id; only a class.
so is there any way to get these two thing out of the website?
im pretty new too html and that kind of stuff..
You should consider using some sort of web scraping library like https://ironsoftware.com/csharp/webscraper/ or Selenium. Doing so, you'll be able to target HTML elements and css classes (to extract the data).

How to convert a string to html attribute value as interpreted by a browser in C#

I need to validate user input for an href on the server side and need to make sure only http:// and https:// are allowed as a protocol (if specified at all.) The objective is to eliminate possible malicious code like javascript:... or anything alike.
What makes it difficult is the number of ways the colon could be encoded in such string e.g. :, &#58, :, &#x0003A , :. I'd like to transform the value and see it as the browsers do before they render the page.
One option could be building a DOM document using AngleSharp as it does the perfect job when parsing attributes. Then I could retrieve the value and validate it but it seems somewhat of an overkill to build the whole DOM tree just to parse one value. Is there a way to use AngleSharp to parse just an attribute value? Or is there a lib which I could use just for this task?
I also found this question, but the method used in there does not really parse the URIs the way browsers do.
You want the HtmlDecode() method. You may need to add a reference to the project to use it.

Howto search within a Site with HTMLAgilityPack with changing search URLS

i am building a project to search for a specific driver at Lenovo Website (https://support.lenovo.com). This site changes the search URL while typing, if a suitable product category is found.
This means for example if you search for "ideapad" it uses:
http://pcsupport.lenovo.com/api_v2/de/de/Product/GetProducts?productId=ideapad
if you search for "T540p 20BE" the Url changes to:
http://pcsupport.lenovo.com/de/de/products/laptops-and-netbooks/thinkpad-t-series-laptops/thinkpad-t540p/20be?linkTrack=Homepage%3ABody_Search+Products&searchType=4&keyWordSearch=T540p%2520Laptop%2520%2528ThinkPad%2529%2520-%2520Type%252020BE
First i tried to use the url above http://pcsupport.lenovo.com/api_v2/de/de/Product/GetProducts?productId=[Searchpattern]. you get back a Json File which has further information to all modeltypes of that devices. Not the response i needed.
What i need is a way to get back all available drivers for some given Modell.
As Response to the search you get a Html Document, which contains all drivers in an embedded html page:
!]
I have tried different approches with selenium which works, but need a usabel way for my application. I tried with HTMLAgilityPack and xpath, but came not across the problem with changing search urls?
How can i get the Version and the Downloadlink ??
Update: Here is some example code. After not being able to get the correct syntax of the get Statemente, i tried to input the searchtext within the inputbox.
Goal: Be able to paste searchtext into the inputbox within the lenovo site, or overcome the changing urls (as mentioned above).
Extract the needed informations from the resulting driverpage
Edit: Just delete the unneeded code part. Can someone give me a hint for a working approach on this. If HtmlWeb is not the best solution, what would you prefer ?
You need to query for the corresponding elements and get the values you need. For example for version you need to query for the version DOM element (with tag, css class or any other attribute) and get the InnerText property. For the download links you need to query for the download element and get the href attribute.
If you have any problems during the development, add the code to the question to let us understand what you are doing and for what you are querying.
Edit: About the url search part. First of all, you need to understand that your HtmlDocument is not a browser and you cannot search products by filling the textbox in the site. You need to find other way to find the corresponding url for the inserted product. One option can be to get data from all available search urls, combine that data, and search inside of that combined one.

Deserializing schema.org microdata from HTML in C#

I am looking for a way to deserialize a set of offline HTML pages that have schema.org microdata embedded. How could I do this in C#? I have found Bam.Net.Schema.Org but there is almost no code that teaches me how to use it.
I have found several "parsers" for node.js but they are imperfect and not something I could use from C# - at least not in way I would prefer (semantic-schema-parser and node-microdata-scraper).
Suggestions are welcome. Should I simply create my own?
You can use HtmlAgilityPack to parse the html and than query the DOM for schema.org values.

Any way to associate a HtmlElement (.NET) to a JavaScript element?

I'm trying to make an extended version of a WebBrowser with stuff like highlighting text and getting properties or attributes of elements for a Web Scraper. WebBrowser functions doesn't help much at all, so if I could just find a way from HtmlElement to a JavaScript element (like the one returned by document.getElementById), and back, and then add JavaScript functions to the HTML from my application, it would make the job a lot easier. Right now I'm messing with the HTML of the code programmatically from C# and it's very messy. I was thinking about setting some unique Id to each HTML element from my program and then call the JavaScript document.getElementById to retrieve it. But that won't work, they might already have an Id assigned and I will mess up their HTML code. I don't know if I can give them some made up attribute like my_very_own_that_i_hope_no_web_page_on_the_world_ever_uses_attribute and then figure out if there is some JavaScript function getElementByWhateveAttributeIWant but I'm not sure if this would work. I read something about expansion or extended attributes on the DOM documentation in msdn but I'm not sure what that is about. Maybe some of you guys have a better way.
It would be much easier to use some rendering engine like trident to get the data from html document. Here is the Link for trident/MSHTML. you can do google and can have samples in c#
This is not nearly as hard as you imagine. You don't have to modify the document at all.
Once the WebBrowser has loaded a page, it's kept internally as a tree with the document node at the root. This node is available to your program, and you can find any element you want (or just enumerate them all) by walking the tree.
If you can give a concrete example, I can supply some code.

Categories