So I am trying to access the title attribute in the colors section.
So if you were to hover over any of the small images to the right side of the product you will see that it says the Color name.
I've managed to navigate to there but I can't for the life of me figure out how to get the title attribute. Currently it's printing out nodes but I want to access the title attribute.
How do i properly access the title attribute and print out the corresponding color to the pictures?
This is the test link I am using (AliExpress)
Console.WriteLine("Product URL: ");
//Declare the URL
string url = Console.ReadLine();
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//li[#class = 'item-sku-image']");
foreach (var node in nodes)
{
//var colors = node.Attributes["/a[title]"].Value;
Console.WriteLine(node);
}
Console.ReadLine();
You can try and iterate over the following X-Path: //li[#class = 'item-sku-image']/a/img/#title, or else, replace this: //li[#class = 'item-sku-image'], with this: //li[#class = 'item-sku-image']/a/img, and then check the attributes of the nodes.
It should yield a series of strings which contain the title you are after.
Related
I want scrape element value from url adress by c# but dont work program.
I tested many codes but I could not get the element value
using HtmlAgilityPack;
var web = new HtmlWeb();
var doc = web.Load("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=5516102131364383");
var element = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'ltr inline')]");
MessageBox.Show(element.InnerText);
I am new in c# programming. I am trying to scrape data from div (I want to display temperature from web page in Forms application).
This is my code:
private void btnOnet_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
var temperatura = doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[3]/div/section/div/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]");
onet.Text = temperatura.InnerText;
}
This is the exception:
System.NullReferenceException:
temperatura was null.
You can use this:
public static bool TryGetTemperature(HtmlAgilityPack.HtmlDocument doc, out int temperature)
{
temperature = 0;
var temp = doc.DocumentNode.SelectSingleNode(
"//div[contains(#class, 'temperature')]/div[contains(#class, 'temp')]");
if (temp == null)
{
return false;
}
var text = temp.InnerText.EndsWith("°") ?
temp.InnerText.Substring(0, temp.InnerText.Length - 5) :
temp.InnerText;
return int.TryParse(text, out temperature);
}
If you use XPath, you can select with more precission your target. With your query, a bit change in the HTML structure, your application will fail. Some points:
// is to search in any place of document
You search any div that contains a class "temperature" and, inside that node:
you search a div child with "temp" class
If you get that node (!= null), you try to convert the degrees (removing '°' before)
And check:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
if (TryGetTemperature(doc, out int temperature))
{
onet.Text = temperature.ToString();
}
UPDATE
I updated a bit the TryGetTemperature because the degrees are encoded. The main problem is the HTML. When you request the source code you get some HTML that browser update later dynamically. So the HTML that you get is not valid for you. It doesn't contains the temperature.
So, I see two alternatives:
You can use a browser control (in Common Controls -> WebBrowser, in the Form Tools with the Button, Label...), insert into your form and Navigate to the page. It's not difficult, but you need learn some things: wait to events for page downloaded and then get source code from the control. Also, I suppose you'll want to hide the browser control. Be carefully, sometimes the browser doesn't works correctly if you hide. In that case, you can use a visible Form outside desktop and manage activate events to avoid activate this window. Also, hide from Task Window (Alt+Tab). Things become harder in this way but sometimes is the only way.
The simple way is search the location that you want (ex: Madryt) and look in DevTools the request done (ex: https://pogoda.onet.pl/prognoza-pogody/madryt-396099). Use this Url and you get a valid HTML.
i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?
If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;
This is the code to get the links:
private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{
List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[#href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;
}
Sometimes the links im getting are starting like "/" or:
"/videos?feature=mh"
Or
"//www.youtube.com/my_videos_upload"
Im not sure if just "/" meaning a proper site or a site that start with "/videoes?...
Or "//www.youtube...
I need to get each time the links from a website that start with http or https maybe just www also count as a proper site. The question is what i define as a proper site address and a link and whats not ?
Im sure my getLinks function is not good the code is not the proper way it should be.
This is how im adding the links to the List:
private List<string> test(string url, int levels , DoWorkEventArgs eve)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;// = new List<string>();
List<string> csFiles = new List<string>();
try
{
doc = hw.Load(url);
webSites = getLinks(doc);
webSites is a List
After few times i see in the List sites like "/" or as above "//videoes... or "//www....
not sure if understood your question but
/Videos means it is accessing Videos folder from the root of the host you are accessing
ex:
www.somesite.com/Videos
There are absolute and relative Urls - so you are getting different flavors from different links, you need to make them absolute url appropriately (Uri class mostly will handle it for you).
foo/bar.txt - relative url from the same path as current page
../foo/bar.txt - relative path from one folder above current
/foo/bar.txt - server-relative pat from root - same server, path starting from root
//www.sample.com/foo/bar.txt - absolute url with the same scheme (http/https) as current page
http://www.sample.com/foo/bar.txt - complete absolute url
It looks like you are using a library that is able to parse/read html tags.
For my understanding
var href = link.Attributes["href"].Value;
is doing nothing but reading the value of the "href" attribute.
So assuming the website's source code is using links like href="/news"
it will grab and save even the relative links to your list.
Just view the target website's sourcecode and check it against your results.
I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful