I am new in c# programming. I am trying to scrape data from div (I want to display temperature from web page in Forms application).
This is my code:
private void btnOnet_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
var temperatura = doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[3]/div/section/div/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]");
onet.Text = temperatura.InnerText;
}
This is the exception:
System.NullReferenceException:
temperatura was null.
You can use this:
public static bool TryGetTemperature(HtmlAgilityPack.HtmlDocument doc, out int temperature)
{
temperature = 0;
var temp = doc.DocumentNode.SelectSingleNode(
"//div[contains(#class, 'temperature')]/div[contains(#class, 'temp')]");
if (temp == null)
{
return false;
}
var text = temp.InnerText.EndsWith("°") ?
temp.InnerText.Substring(0, temp.InnerText.Length - 5) :
temp.InnerText;
return int.TryParse(text, out temperature);
}
If you use XPath, you can select with more precission your target. With your query, a bit change in the HTML structure, your application will fail. Some points:
// is to search in any place of document
You search any div that contains a class "temperature" and, inside that node:
you search a div child with "temp" class
If you get that node (!= null), you try to convert the degrees (removing '°' before)
And check:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
if (TryGetTemperature(doc, out int temperature))
{
onet.Text = temperature.ToString();
}
UPDATE
I updated a bit the TryGetTemperature because the degrees are encoded. The main problem is the HTML. When you request the source code you get some HTML that browser update later dynamically. So the HTML that you get is not valid for you. It doesn't contains the temperature.
So, I see two alternatives:
You can use a browser control (in Common Controls -> WebBrowser, in the Form Tools with the Button, Label...), insert into your form and Navigate to the page. It's not difficult, but you need learn some things: wait to events for page downloaded and then get source code from the control. Also, I suppose you'll want to hide the browser control. Be carefully, sometimes the browser doesn't works correctly if you hide. In that case, you can use a visible Form outside desktop and manage activate events to avoid activate this window. Also, hide from Task Window (Alt+Tab). Things become harder in this way but sometimes is the only way.
The simple way is search the location that you want (ex: Madryt) and look in DevTools the request done (ex: https://pogoda.onet.pl/prognoza-pogody/madryt-396099). Use this Url and you get a valid HTML.
Related
I'm having trouble with datascraping on this web address: http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20.
The problem is: I've written a code that is supposed to grab the contents of a certain node and display it on console. However, the contents withing the node and the specific node itself seem to be unreachable, but I know they exists for the fact that I've created a condition within my code in order to let me know if nodes withing a certain body are being found and it is indeed being found but not displayed for some reason:
private static void getTextArt(string font, string word)
{
HtmlWeb web = new HtmlWeb();
//cureHtml method is just meant to return the http address
HtmlDocument htmlDoc = web.Load(cureHtml(font, word));
if(web.Load(cureHtml(font, word)) != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
var nodes = htmlDoc.DocumentNode.SelectSingleNode(nodeXpath).ChildNodes;
foreach(HtmlNode node in nodes)
{
if(node != null)
Console.WriteLine("Node Found.");
else
Console.WriteLine("Node not found!");
Console.WriteLine(node.OuterHtml);
}
}
private const string nodeXpath = "//div[#id='maincontent']";
}
The Html displayed by the website looks like this:
The Html code within the website. Arrows point at the node I'm trying to reach and the content within it I'm trying to display on the console
When I run my code on console to check for the node and its contents and try to display the OuterHtml string of the Xpath, this is how console will display it to me:
Console Window Display
I hope some of you are able to explain to me why is it behaving this way. I've tried all kinds of searches on google for two days trying to figure out the problem for no use. Thank you all in advance.
The content you desire is loaded dynamically.
Use the HtmlWeb.LoadFromBrowser() method instead. Also, check htmlDoc for null, instead of calling it twice. Your current logic doesn't guarantee your state.
HtmlDocument htmlDoc = web.LoadFromBrowser(cureHtml(font, word));
if (htmlDoc != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
Also, you'll need to decode the result.
Console.WriteLine(WebUtility.HtmlDecode(node.OuterHtml));
If this doesn't work, then your cureHtml() method is broken, or you're targeting .NET Core :)
I'm trying to determine if there's specific text on the page. I'm doing this:
public static void WaitForPageToLoad(this IWebDriver driver, string textOnPage)
{
var pageSource = driver.PageSource.ToLower();
var timeOut = 0;
while (timeOut < 60)
{
Thread.Sleep(1000);
if (pageSource.Contains(textOnPage.ToLower()))
{
timeOut = 60;
}
}
}
The problem is that the web driver's PageSource property isn't updated after the initial load. The page I'm navigating to loads a bunch of data via JS after the page has already loaded. I don't control the site, so I'm trying to figure out a method to get the updated HTML.
You are trying to solve the wrong problem. You need to wait for the text to appear using an XPath locator:
var wait = new WebDriverWait(driver);
var xpath = $"//*[contains(., '{textOnPage}')]";
wait.Until(ExpectedConditions.ElementIsVisible(By.XPath(xpath));
Do you really need to search entire page?
I'll reference you to here: https://stackoverflow.com/a/41223770/1387701
with this code:
String Verifytext= driver.findElement(By.tagName("body")).getText().trim();
You can then check to see if the Verifytext contains the string you're checking for.
This works MUCH better if you can narrow the location of the text down to a particular webElement other than the body.
I am trying to get the value of the div class "darkgreen" which is 46.98. I tried the following code but am getting a Null exception.
Below is the code I am trying:
private void button1_Click(object sender, EventArgs e)
{
var doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='darkgreen']");
foreach (HtmlAgilityPack.HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
If I run the same code but with doc.DocumentNode.SelectNodes("//div[#class='rgt-hdr colorize']") it does pull the header data with no error.
I am thinking that maybe child nodes may be a solution but I am not sure as I am unable to get it to work still.
Your problem is that the HTML your looking it is created by a javascript. And the HTML you load into your Document variable is pre-what-ever is created by the javascript. If you look at the page source in your web browser you will see the exact HTML that gets loaded in your HtmlDocument variable.
The example below will give you the data(JSON) that is used to create the table. I don't know whether that is enough for whatever you're trying to do.
public static void Main(string[] args)
{
Console.WriteLine("Program Started!");
HtmlDocument doc;
doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//section[#class='bdy content article full cflex reset long table-page']/following-sibling::script[1]");
int start = node.InnerText.IndexOf("[");
int length = node.InnerText.IndexOf("]") - start +1;
Console.WriteLine(node.InnerText.Substring(start, length));
Console.WriteLine("Program Ended!");
Console.ReadKey();
}
Alternative solution
Alternatively you can use Selenium with PhantomJS. And then load the HTML from the headless browser into your document variable and then your xpath will work.
I am using HTMLElementCollection, HtmlElement to iterate through a website and using Get/Set attributes of a website HTML and returning it to a ListView. Is it possible to get values from website a and website b to return it to the ListView?
HtmlElementCollection oCol1 = oDoc.Body.GetElementsByTagName("input");
foreach (HtmlElement oElement in oCol1)
{
if (oElement.GetAttribute("id").ToString() == "search")
{
oElement.SetAttribute("value", m_sPartNbr);
}
if (oElement.GetAttribute("id").ToString() == "submit")
{
oElement.InvokeMember("click");
}
}
HtmlElementCollection oCol1 = oDoc.Body.GetElementsByTagName("tr");
foreach (HtmlElement oElement1 in oCol1)
{
if (oElement1.GetAttribute("data-mpn").ToString() == m_sPartNbr.ToUpper())
{
HtmlElementCollection oCol2 = oElement1.GetElementsByTagName("td");
foreach (HtmlElement oElement2 in oCol2)
{
if (oElement2 != null)
{
if (oElement2.InnerText != null)
{
if (oElement2.InnerText.StartsWith("$"))
{
string sPrice = oElement2.InnerText.Replace("$", "").Trim();
double dblPrice = double.Parse(sPrice);
if (dblPrice > 0)
m_dblPrices.Add(dblPrice);
}
}
}
}
}
}
As one of the comments mentioned the better approach would be to use HttpWebRequest to send a get request to www.bestbuy.com or whatever site. What it returns is the full HTML code (what you see) which you can then parse through. This kind of approach keeps you from seinding too many requests and getting blacklisted. If you need to click a button or type in a text field its best to mimic human input to avoid being blacklisted also. I would suggest injecting a simple javascript into the page header or body and execute it from the app to send a 'onClick' event from the button (which would then reply with a new page to parse or display) or to modify the text property of something.
this example is in c++/cx but it originally came from a c# example. the script sets the username and password text fields then clicks the login button:
String^ script = "document.GetElementById('username-text').value='myUserName';document.getElementById('password-txt').value='myPassword';document.getElementById('btn-go').click();";
auto args = ref new Platform::Collections::Vector<Platform::String^>();
args->Append(script);
create_task(wv->InvokeScriptAsync("eval", args)).then([this](Platform::String^ response){
//LOGIN COMPLETE
});
//notes: wv = webview
EDIT:
as pointed out the absolute best approach would be to get/request an api. I was surprised to see that site mason pointed out for bestbuy developers. Personally I have only tried to work with auto part stores who either laugh while saying I can't afford it or have no idea what I'm asking for and hang up (when calling corporate).
EDIT 2: in my code the site used was autozone. I had to use chrome developer tools (f12) to get the names of the username, password, and button name. From the developer tools you can also watch what is sent from your computer to the site/server. This allows you to recreate everything and mimic javascript input and actions using post/get with HttpWebRequest.
I have a C# program that logs into a portal, and then needs to test if an element with a specific ID exists on the page or not. In order to test for this, I grab the HTML from the page and search for an element with a matching ID in the HTML.
However, whenever I try to access the HTML with this script, it always returns the HTML of the portal login page, and not the page after one logs in through the portal. I can confirm 100% that the program is logging into the portal, however for some reason it is still returning the wrong HTML.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString(this.currentStepUrl);
doc.LoadHtml(html);
var foo = (from bar in doc.DocumentNode.DescendantNodes()
where bar.GetAttributeValue("id", null) == expected
select bar).FirstOrDefault();
if (foo != null)
{
currentTestCaseResults[0]++;
}
else
{
currentTestCaseResults[1]++;
}
Easy fix I guess:
Replaced everything but the if clause with
HtmlElement expectedElement webBrowser2.Document.GetElementById(expected);