Extract data from a website in winforms c# - c#

i want to extract some data from a website. f.e. (https://www.chefkoch.de/rezepte/drucken/512261146932016/Annas-Rouladen-mit-Seidenkloessen.html). The text on the left side an the ingredients table on the right.
i tried several ways like with a webclient and regex the parts but the problem was here that if the table has more than one list like in my example i cant split them.
i also tried it with an htmldocument and get the elements but the
elements doesnt have an id; only a class.
so is there any way to get these two thing out of the website?
im pretty new too html and that kind of stuff..

You should consider using some sort of web scraping library like https://ironsoftware.com/csharp/webscraper/ or Selenium. Doing so, you'll be able to target HTML elements and css classes (to extract the data).

Related

Howto search within a Site with HTMLAgilityPack with changing search URLS

i am building a project to search for a specific driver at Lenovo Website (https://support.lenovo.com). This site changes the search URL while typing, if a suitable product category is found.
This means for example if you search for "ideapad" it uses:
http://pcsupport.lenovo.com/api_v2/de/de/Product/GetProducts?productId=ideapad
if you search for "T540p 20BE" the Url changes to:
http://pcsupport.lenovo.com/de/de/products/laptops-and-netbooks/thinkpad-t-series-laptops/thinkpad-t540p/20be?linkTrack=Homepage%3ABody_Search+Products&searchType=4&keyWordSearch=T540p%2520Laptop%2520%2528ThinkPad%2529%2520-%2520Type%252020BE
First i tried to use the url above http://pcsupport.lenovo.com/api_v2/de/de/Product/GetProducts?productId=[Searchpattern]. you get back a Json File which has further information to all modeltypes of that devices. Not the response i needed.
What i need is a way to get back all available drivers for some given Modell.
As Response to the search you get a Html Document, which contains all drivers in an embedded html page:
!]
I have tried different approches with selenium which works, but need a usabel way for my application. I tried with HTMLAgilityPack and xpath, but came not across the problem with changing search urls?
How can i get the Version and the Downloadlink ??
Update: Here is some example code. After not being able to get the correct syntax of the get Statemente, i tried to input the searchtext within the inputbox.
Goal: Be able to paste searchtext into the inputbox within the lenovo site, or overcome the changing urls (as mentioned above).
Extract the needed informations from the resulting driverpage
Edit: Just delete the unneeded code part. Can someone give me a hint for a working approach on this. If HtmlWeb is not the best solution, what would you prefer ?
You need to query for the corresponding elements and get the values you need. For example for version you need to query for the version DOM element (with tag, css class or any other attribute) and get the InnerText property. For the download links you need to query for the download element and get the href attribute.
If you have any problems during the development, add the code to the question to let us understand what you are doing and for what you are querying.
Edit: About the url search part. First of all, you need to understand that your HtmlDocument is not a browser and you cannot search products by filling the textbox in the site. You need to find other way to find the corresponding url for the inserted product. One option can be to get data from all available search urls, combine that data, and search inside of that combined one.

c# - HTML Parsing

I got a webpage that i need to parse through the entire HTML code to find any special tagging.
For example i want to pull out all the *.*.* element on that page. Whats the best way to do this in c#?
However these strings are dynamic because they are resulting from a search query. So i can't just pull out the source code and look for that string since they are in scripts that gets pulled in dynamically.
Is there a way to get these strings? I just need to check if they are in my already existing list. Maybe Selenium? or some other engine that i'm not aware off or a good approach to do this?
Thanks!

Web automation using C# WebBrowser

I'm in the very early stages of attempting to automate data entry and collection from a website. I have a 16,000 line CSV file. For each line, I'd like to enter data from that line into a textarea on a webpage. The webpage can then perform some calculations with that data and spit out an answer that I'd collect. Specifically, on the webpage http://www.mirbase.org/search.shtml, I'd like to enter a sequence in the sequence text box at the bottom, press the "Search miRNAs" button and then collect results on the next page.
My plan as of right now is to use a C# WebBrowser. My understanding is that I can access the individual elements in the HtmlDocument either by id, name or coordinate. The last option is not ideal, because if I distribute this program to other people I can't be sure they'd be using at the same coordinates. As for the other 2 options, the textarea has a name, but it's the same as the form name, so I don't know how to access it. The button I'd like to click has neither a name nor an id.
Does anyone have any ideas as to how to access the elements I need? I am by no means set on this method, so if there's an easier/better way I'm certainly open to suggestions.
The WebBrowser class is not designed for this, hence why you are coming up with your problems.
You need to look into a tool that is designed for web automation.
Since you are using C#, Selenium has a wonderful set of C# bindings, and it can solve your problems because you'll be to use different locators (locating an element by a CSS selector or XPath specifically).
http://docs.seleniumhq.org/
Check mshtml - Mshtml on msdn
You can use it with the WebBrowser object.
Add Microsoft.mshtml reference to your project and the using mshtml declaration in your class.
Using mshtml you can easily set and get elements properties.

Searching the name of web pages according to the word entered in a textbox

I have a textbox and a button in one page.I want to enter a word in the textbox and click the button. After clicking the button I want to display the name of the web pages containing the word entered in the textbox. So please tell me how to do it? I am using C#.
So you want to create a search engine internal to your website. There are a couple of different options
You can use something like google custom search which requires no coding and uses the google technology which I think we all agree does a pretty good job compared to other search engines. More information at http://www.google.com/cse/
Or you can implement it in .net which I will try to give some pointers about below.
A search engine in general exists out of (some of) the following parts:
a index which is searched against
a query system which allows searches to be specified and results shown
a way to get documents into the index like a crawler or some event thats handled when the documents are created/published/updated.
These are non trivial things to create especially if you want a rich feature set like stemming (returning documents containing plural forms of search terms), highlighting results, indexing different document formats like pdf, rtf, html etc... so you want to use something already made for this purpose. This would only leave the task of connecting and orchestrating the different parts, writing the flow control logic.
You could use Lucene.net a opensource project with a lot of features. http://usoniandream.blogspot.com/2007/10/tutorial-implementing-lucenenet-search.html explains how to get started with it.
The other option is Microsoft indexing service which comes with windows but I would advice against it since it's difficult to tweak to work like you want and the results are sub-optimal in my opinion.
You are going to need some sort of backing store, and full text indexing. To the best of my knowledge, C# alone is not enough.

An Easy Way to Consume a Twitter Search Feed With ASP.Net

Anyone have an pointers on an easy way to consume a search.twitter.com feed with ASP.Net? I tried using the RSSToolKit, but it doesn't provide anything for parsing the and other tags in the feed.
For example: I want to parse this feed: http://search.twitter.com/search.atom?q=c%23 and make it appear on a page just like it does in the twitter search results (links and all).
Depending on what sort of hammer you prefer, you could use:
An XSL transform, easy if you know
XSL, painful if you've never used it
Load it into an XmlDocument or
XPathDocument, then iterate the nodes you want
Put it into an XmlDataSource and
then bind that to a repeater
Many other options too, they're just some of my preferred hammers

Categories