Well i used to use htmlagilitypack as well as xPath to scrap some info from websites but i have read that css selectors are much faster so i searched for good engine for css and i found CsQuery; However, i am still confused as i don't know how to get the css path of an element.
In xPath i have used a firefox plugin called xPath checker that returned a fine xPaths like this
id('yt-masthead-signin')/button
But i can't find an equivalent one for CSS. So if someone helped my i will really appreciate it because i don't find and answer on google for my question specifically.
Install Firebug + Firepath
Click the selecting button to select something on the page, then it can generate either xpath or css selector. However, you need some changes to make the generated ones more efficient.
Related
I am working on a website where all other locator doesn't work expect using FindElements and take the 3rd a element. So I was curious to try xpath the first time.
I could get the xpath in chrome, but when I use in xpath, it says element not found.
I did a lot of search, still couldn't' find out what was wrong. So I tried in facebook page and use the login field as a try, the xpath is //*[#id="email"], it works perfectly in chrome, but same result in webdrive.
C# code: driver.findElement(By.xpath("//*[#id='email']"));
Please click for facebook picture and its location
Any advise?
I can give a complete solution on Python taking into account the features of React Native (used on Facebook)
But, you have C#. Therefore, it is possible to use a similar function driver.execute_script (execution of Javascript on Selenium)
driver.get("https://www.facebook.com/")
driver.execute_script('
document.getElementById("email").value = "lg#hosct.com";
document.getElementById("u_0_2").click();
')
I did another try with a more clear code:
driver.Url = "";
driver.findElement(By.xpath("//*[#id='email']"));
It works now, the only difference between this and my code before is: I was visiting some other pages before the facebook page. This seems to make difference. Anyway, above code works. If I encounter the issue again, I will post more detail code.
I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)
How can I force HtmlAgilityPack to use Chrome's interpretation of something in XPath?
for example these two lines of code point to the exact same thing on the web page, however the xpath is completely different.
for Chrome:
/html/body[#class=' hasGoogleVoiceExt']/div[#class='fjfe-bodywrapper']/div[#id='fjfe-real-body']/div[#id='fjfe-click-wrapper']/div[#id='appbar']/div[#class='elastic']/div[#class='appbar-center']/div[#class='appbar-snippet-primary']/span
for FireFox:
//*[#id='appbar']/div/div[2]/div[1]/span
I would like to use Chrome however I receive null for both queries.
The Html Agility Pack has no dependency on any browser whatsoever. It uses .NET XPATH implementation. You can't change this, unless you rewrite it completely.
The HTML you see in a browser can be very different from the HTML you download for an url, as the first one could have been modified by dynamic code (javascript, DHTML).
If you have an existing HTML or url, we could help you more.
Here is what I found using a copied XPATH from Chrome:
- I had to remove all of the tbody elements and double up forward slashes and then following code would return the proper element.
doc.DocumentNode.SelectSingleNode(
"//html//body//center//table[3]//tr//td//table//tr//td//table//tr//td//table[3]//tr[3]//td[3]//table//tr//td//table");
I'm trying to make an extended version of a WebBrowser with stuff like highlighting text and getting properties or attributes of elements for a Web Scraper. WebBrowser functions doesn't help much at all, so if I could just find a way from HtmlElement to a JavaScript element (like the one returned by document.getElementById), and back, and then add JavaScript functions to the HTML from my application, it would make the job a lot easier. Right now I'm messing with the HTML of the code programmatically from C# and it's very messy. I was thinking about setting some unique Id to each HTML element from my program and then call the JavaScript document.getElementById to retrieve it. But that won't work, they might already have an Id assigned and I will mess up their HTML code. I don't know if I can give them some made up attribute like my_very_own_that_i_hope_no_web_page_on_the_world_ever_uses_attribute and then figure out if there is some JavaScript function getElementByWhateveAttributeIWant but I'm not sure if this would work. I read something about expansion or extended attributes on the DOM documentation in msdn but I'm not sure what that is about. Maybe some of you guys have a better way.
It would be much easier to use some rendering engine like trident to get the data from html document. Here is the Link for trident/MSHTML. you can do google and can have samples in c#
This is not nearly as hard as you imagine. You don't have to modify the document at all.
Once the WebBrowser has loaded a page, it's kept internally as a tree with the document node at the root. This node is available to your program, and you can find any element you want (or just enumerate them all) by walking the tree.
If you can give a concrete example, I can supply some code.
I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP):
http://simplehtmldom.sourceforge.net/
Of course I know I could embed a browser control, but I am looking for something more efficient.
Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.
I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.
We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.
There is no way to get DOM with styles like that. Only option is "Selenium" framework that works with real browser.