How to extract a link using xpath - c#

I'm trying to make an application where you input a web url (http://www.explosm.net/comics/3104/) and it automatically saves a string with the first link it finds given the xpath (//*[#id="maincontent"]/div[2]/div[2]/div[1]/img), which is a picture I want to download.
I honestly have no clue where to even begin with this. I've tried the HtmlAgilityPack and the WebBrowser class, but I couldn't find anything to help me understand what to do and how to do it.
Any help will be greatly appreciated.

It is pretty easy with HTMLAgilityPack.
var w = new HtmlWeb();
var doc = w.Load("http://www.explosm.net/comics/3104/");
var imgNode = doc.DocumentNode.SelectSingleNode("//*[#id=\"maincontent\"]/div[2]/div[2]/div[1]/img");
var src = imgNode.GetAttributeValue("src", "");
The variable src will have the value http://www.explosm.net/db/files/Comics/Matt/Dont-be-a-dickhead.png.
All you have to do then is download the image:
var request = (HttpWebRequest)WebRequest.Create(src);
var response = request.GetResponse();
var stream = response.GetResponseStream();
//Here you have an Image object
Image img = Image.FromStream(stream);
//And you can save it or do whatever you want
img.Save(#"C:\file.png");

Related

C# load html source as string

I'm trying to make an application where it will read certain texts from a website.
using AngleSharp.Parser.Html;
...
var source = #"
<html>
<head>
</head>
<body>
<td class=""period_slot_1"">
<strong>TG</strong>
</body>
</html>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var strong = document.QuerySelector("strong");
MessageBox.Show(strong.TextContent); // Display text
From googling, I've successfully done above. I have copy&pasted a part of html in a variable to see if I can get the value I'm looking for.
So it gets the value I want, which is string "TG".
However, the website will have different value to "TG" every time, so I need my program to refer straight to the html of the website at the time.
Is is possible for me to load the whole html source in the source variable and make it work, if can how can I do it and what would be best for me to get what I want?
Thank you so much for reading the question.
I assume you're saying you want to read directly from a page on the internet from a url. In which case you should do:
WebClient myClient = new WebClient();
Stream response = myClient.OpenRead("http://yahoo.com");
StreamReader reader = new StreamReader(response);
string source = reader.ReadToEnd();
var parser = new HtmlParser();
var document = parser.Parse(source);
var p = document.QuerySelector("p");
// I used 'p' instead of 'strong' because there's no
//strong on that page
MessageBox.Show(p.TextContent); // Display text
response.Close();

Image is not opening after converting with aspose

I used the below code to convert url stream into tiff image. But after conversion, the convert image is not opening for preview. Any Ideas?
var myRequest = (HttpWebRequest)WebRequest.Create("http://www.google.com");
myRequest.Method = "GET";
var myResponse = myRequest.GetResponse();
var responseStream = myResponse.GetResponseStream();
var memoryStream = new MemoryStream();
responseStream.CopyTo(memoryStream);
var loadOptions = new LoadOptions();
loadOptions.LoadFormat = LoadFormat.Html;
var doc = new Document(memoryStream, loadOptions);
var htmlOptions = new HtmlFixedSaveOptions();
htmlOptions.ExportEmbeddedCss = true;
htmlOptions.ExportEmbeddedFonts = true;
htmlOptions.ExportEmbeddedImages = true;
doc.Save(#"C:\out.tif", htmlOptions);
You are using HtmlFixedSaveOptions in Save() method, so it will save as HTML. Try to open the out.tif in any text editor, you will see HTML tags.
Please use ImageSaveOptions in Save() method, to save in image format. Even then, if you manually get the web page from URL in stream, it only gets the HTML. Without css, the saved image will not look good. I would recommend to let Aspose handle the URL.
// If you provide a URL in string, Aspose will load the web page
var doc = new Aspose.Words.Document("http://www.google.com");
// If you just provide the TIF extension, it will save as TIFF image
doc.Save(#"c:\out.tif");
// TO customize, you can use save options in save method
I work for Aspose as Developer Evangelist.

Is it possible to navigate a site by clicking links and then downloading the correct piece?

I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!
In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.
Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}

Read content of Web Browser in WPF

Hello Developers I want to read external content from Website such as element between tag . I am using Web Browser Control and here is my code however this Code just fills my Web browser control with the Web Page
public MainWindow()
{
InitializeComponent();
wbMain.Navigate(new Uri("http://www.annonymous.com", UriKind.RelativeOrAbsolute));
}
You can use the Html Agility Pack library to parse any HTML formatted data.
HtmlDocument doc = new HtmlDocument();
doc.Load(wbMain.DocumentText);
var nodes = doc.SelectNodes("//a[#href"]);
NOTE: The method SelectNode accepts XPath, not CSS or jQuery selectors.
var node = doc.SelectNodes("id('my_element_id')");
As I understood from your question, you are only trying to parse the HTML data, and you don't need to show the actual web page.
If that is the case than you can take a very simple approach and use HttpWebRequest:
var _plainText = string.Empty;
var _request = (HttpWebRequest)WebRequest.Create("http://www.google.com");
_request.Timeout = 5000;
_request.Method = "GET";
_request.ContentType = "text/plain";
using (var _webResponse = (HttpWebResponse)_request.GetResponse())
{
var _webResponseStatus = _webResponse.StatusCode;
var _stream = _webResponse.GetResponseStream();
using (var _streamReader = new StreamReader(_stream))
{
_plainText = _streamReader.ReadToEnd();
}
}
Try this:
dynamic doc = wbMain.Document;
var htmlText = doc.documentElement.InnerHtml;
edit: Taken from here.

Issue with HTMLAgilityPack parsing HTML using C#

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[#id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[#id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [....
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
Why won't you just use Descendants("a") method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.
If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

Categories