Having trouble displaying the node's content with HtmlAgilityPack - c#

I'm having trouble with datascraping on this web address: http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20.
The problem is: I've written a code that is supposed to grab the contents of a certain node and display it on console. However, the contents withing the node and the specific node itself seem to be unreachable, but I know they exists for the fact that I've created a condition within my code in order to let me know if nodes withing a certain body are being found and it is indeed being found but not displayed for some reason:
private static void getTextArt(string font, string word)
{
HtmlWeb web = new HtmlWeb();
//cureHtml method is just meant to return the http address
HtmlDocument htmlDoc = web.Load(cureHtml(font, word));
if(web.Load(cureHtml(font, word)) != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
var nodes = htmlDoc.DocumentNode.SelectSingleNode(nodeXpath).ChildNodes;
foreach(HtmlNode node in nodes)
{
if(node != null)
Console.WriteLine("Node Found.");
else
Console.WriteLine("Node not found!");
Console.WriteLine(node.OuterHtml);
}
}
private const string nodeXpath = "//div[#id='maincontent']";
}
The Html displayed by the website looks like this:
The Html code within the website. Arrows point at the node I'm trying to reach and the content within it I'm trying to display on the console
When I run my code on console to check for the node and its contents and try to display the OuterHtml string of the Xpath, this is how console will display it to me:
Console Window Display
I hope some of you are able to explain to me why is it behaving this way. I've tried all kinds of searches on google for two days trying to figure out the problem for no use. Thank you all in advance.

The content you desire is loaded dynamically.
Use the HtmlWeb.LoadFromBrowser() method instead. Also, check htmlDoc for null, instead of calling it twice. Your current logic doesn't guarantee your state.
HtmlDocument htmlDoc = web.LoadFromBrowser(cureHtml(font, word));
if (htmlDoc != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
Also, you'll need to decode the result.
Console.WriteLine(WebUtility.HtmlDecode(node.OuterHtml));
If this doesn't work, then your cureHtml() method is broken, or you're targeting .NET Core :)

Related

Scrape data from div in Windows.Form

I am new in c# programming. I am trying to scrape data from div (I want to display temperature from web page in Forms application).
This is my code:
private void btnOnet_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
var temperatura = doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[3]/div/section/div/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]");
onet.Text = temperatura.InnerText;
}
This is the exception:
System.NullReferenceException:
temperatura was null.
You can use this:
public static bool TryGetTemperature(HtmlAgilityPack.HtmlDocument doc, out int temperature)
{
temperature = 0;
var temp = doc.DocumentNode.SelectSingleNode(
"//div[contains(#class, 'temperature')]/div[contains(#class, 'temp')]");
if (temp == null)
{
return false;
}
var text = temp.InnerText.EndsWith("°") ?
temp.InnerText.Substring(0, temp.InnerText.Length - 5) :
temp.InnerText;
return int.TryParse(text, out temperature);
}
If you use XPath, you can select with more precission your target. With your query, a bit change in the HTML structure, your application will fail. Some points:
// is to search in any place of document
You search any div that contains a class "temperature" and, inside that node:
you search a div child with "temp" class
If you get that node (!= null), you try to convert the degrees (removing '°' before)
And check:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
if (TryGetTemperature(doc, out int temperature))
{
onet.Text = temperature.ToString();
}
UPDATE
I updated a bit the TryGetTemperature because the degrees are encoded. The main problem is the HTML. When you request the source code you get some HTML that browser update later dynamically. So the HTML that you get is not valid for you. It doesn't contains the temperature.
So, I see two alternatives:
You can use a browser control (in Common Controls -> WebBrowser, in the Form Tools with the Button, Label...), insert into your form and Navigate to the page. It's not difficult, but you need learn some things: wait to events for page downloaded and then get source code from the control. Also, I suppose you'll want to hide the browser control. Be carefully, sometimes the browser doesn't works correctly if you hide. In that case, you can use a visible Form outside desktop and manage activate events to avoid activate this window. Also, hide from Task Window (Alt+Tab). Things become harder in this way but sometimes is the only way.
The simple way is search the location that you want (ex: Madryt) and look in DevTools the request done (ex: https://pogoda.onet.pl/prognoza-pogody/madryt-396099). Use this Url and you get a valid HTML.

C# Looking to get obtain the <div> value using HtmlAgilityPack but receiving a System.NullReferenceException

I am trying to get the value of the div class "darkgreen" which is 46.98. I tried the following code but am getting a Null exception.
Below is the code I am trying:
private void button1_Click(object sender, EventArgs e)
{
var doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='darkgreen']");
foreach (HtmlAgilityPack.HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
If I run the same code but with doc.DocumentNode.SelectNodes("//div[#class='rgt-hdr colorize']") it does pull the header data with no error.
I am thinking that maybe child nodes may be a solution but I am not sure as I am unable to get it to work still.
Your problem is that the HTML your looking it is created by a javascript. And the HTML you load into your Document variable is pre-what-ever is created by the javascript. If you look at the page source in your web browser you will see the exact HTML that gets loaded in your HtmlDocument variable.
The example below will give you the data(JSON) that is used to create the table. I don't know whether that is enough for whatever you're trying to do.
public static void Main(string[] args)
{
Console.WriteLine("Program Started!");
HtmlDocument doc;
doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//section[#class='bdy content article full cflex reset long table-page']/following-sibling::script[1]");
int start = node.InnerText.IndexOf("[");
int length = node.InnerText.IndexOf("]") - start +1;
Console.WriteLine(node.InnerText.Substring(start, length));
Console.WriteLine("Program Ended!");
Console.ReadKey();
}
Alternative solution
Alternatively you can use Selenium with PhantomJS. And then load the HTML from the headless browser into your document variable and then your xpath will work.

xPath is wrong given by the Browser or HTMLAgilityPack cannot use xPath?

I'm trying to get all languages from Google Translate. When I Open Developer Tools and click one of the language when all languages are popped (when arrow clicked), It gives //*[#id=':7']/div/text() for Arabic, but it returns null when I try to get node:
async Task AddLanguages()
{
try
{
// //*[#id=":6"]/div/text()
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
for (int i = 6; i <= 9; i++)
{
//*[#id=":6"]/div/text() //*[#id=":6"]/div/div
Debug.WriteLine(i);
var element = document.DocumentNode.SelectSingleNode("//*[#id=':7']/div/text()");
Trace.WriteLine(element == null, "Element is null");
}
}
catch (Exception e)
{
this.ShowMessageAsync("Hata!", "Dilleri yüklerken hata ortaya çıktı.");
}
}
Element is null: True outputs all the times ( I was trying to use for loop to loop through languages but, it doesnt even work for single one!)
I guess your xpath is wrong. You can try something like:
string Url = "https://translate.google.com/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var arabic = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.ChildNodes.Any(node => node.Name.Equals("#text") && node.InnerText.Equals("Arabic")));
Since I can't comment yet...Have you tried clicking on the dropdwon first before looking for the elements?
Clicking on //*[#id='gt-sl-gms'] or it's inner div would make the elements visible..
That should work..
Anyway, I can't make $x work for the console in google chrome. I'm getting an Uncaught Type Error currently. Not sure if that has to do with anything..
Edit: Oh wait i think I know your problem..upon closer inspection of the element, it seems that the element (div) has another div before the text. so try /*[#id=':7']/div/text()[2]

Get the documentdata from a webbrowser control in another application

I'm looking for a way to get the document information (or document text) from another applications webbrowser control (and possibly alter it).
The other application is written in .net, but not by me.
I'm looking for an ability like this:
I would like an eventhandler for the OnDocumentCompleted that can get me the information of that document.
If possible, i would also like to intercept certain pages, add some html, and send them back to the second app to be displayed.
Searching the web pointed me towards using 'Hooks', but not much is found using hooks in this situation.
Hope you can help me out
Anthony
This code provides an example of html parsing that returns plain text (
the parsing depends on page content).
private string GetPlainText(WebBrowser webBrowser)
{
StringBuilder sb = new StringBuilder();
// Pick out a heading.
foreach (HtmlElement h1 in webBrowser.Document.GetElementsByTagName("H1"))
sb.Append(h1.InnerText + ". ");
// Select only some text, ignoring everything else.
foreach (HtmlElement div in webBrowser.Document.GetElementsByTagName("DIV"))
if (div.GetAttribute("classname") == "story-body")
foreach (HtmlElement p in div.GetElementsByTagName("P"))
{
string classname = p.GetAttribute("classname");
if (classname == "introduction" || classname == "") sb.Append(p.InnerText + " ");
}
return sb.ToString();
}
}

How to get XML-code of webpage that is opened in IE (without using WebRequest)?

I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code

Categories