C# load html source as string - c#

I'm trying to make an application where it will read certain texts from a website.
using AngleSharp.Parser.Html;
...
var source = #"
<html>
<head>
</head>
<body>
<td class=""period_slot_1"">
<strong>TG</strong>
</body>
</html>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var strong = document.QuerySelector("strong");
MessageBox.Show(strong.TextContent); // Display text
From googling, I've successfully done above. I have copy&pasted a part of html in a variable to see if I can get the value I'm looking for.
So it gets the value I want, which is string "TG".
However, the website will have different value to "TG" every time, so I need my program to refer straight to the html of the website at the time.
Is is possible for me to load the whole html source in the source variable and make it work, if can how can I do it and what would be best for me to get what I want?
Thank you so much for reading the question.

I assume you're saying you want to read directly from a page on the internet from a url. In which case you should do:
WebClient myClient = new WebClient();
Stream response = myClient.OpenRead("http://yahoo.com");
StreamReader reader = new StreamReader(response);
string source = reader.ReadToEnd();
var parser = new HtmlParser();
var document = parser.Parse(source);
var p = document.QuerySelector("p");
// I used 'p' instead of 'strong' because there's no
//strong on that page
MessageBox.Show(p.TextContent); // Display text
response.Close();

Related

How to get only plain text from HTML using C#?

Hi guys.
I'm trying to create an app that will find the most frequently used words in the string.
In my case, a string is the HTML.
I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".
var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
Html variable has the same HTML as in the Source. That's well.
But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?
I want my application not to be only for BBC html, but for every HTML which I can get in the net.
I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p>.
As per This answer, try the following:
var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = #"<(.|\n)*?>";
//Replace all tag elements found using that regex with nothing
return Regex.Replace(htmlString, pattern, string.Empty);

How retrieve by c# page html generated with AngularJS

I need retrieve the HTML code for a page that uses the AngularJS to process some information and generate a graph. I could easily retrieve the html code using WebRequest, as the example below, but the content (graphic) generated by AngularJS does not come in the page code.
WebRequest request = WebRequest.Create("http://localhost:36789/minhaapp#/index");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Has anyone ever experienced this?
Thank you in advance for your support.
At the end of your Page_Load method, call this getHTMLContent() method:
public string getHTMLContent()
{
StringBuilder sb = new StringBuilder();
StringWriter tw = new StringWriter(sb);
HtmlTextWriter hw = new HtmlTextWriter(tw);
panel.RenderControl(hw);
String html = sb.ToString();
return html;
}
The entire page is contained in an asp:Panel called panel. The way this works is from the RenderControl() method which you can read a little bit more about here. Simply put, it gets all the content within the asp:Panel tags (the whole page) and once used after the Page_Load event has been executed, it will get all the raw HTML for the page.
There's a library called PhantomJS, which renders the js off the site first and then u can get the source after it got rendered. But obviously it will also slow down the process if you are doing a lot of websites

How to extract a link using xpath

I'm trying to make an application where you input a web url (http://www.explosm.net/comics/3104/) and it automatically saves a string with the first link it finds given the xpath (//*[#id="maincontent"]/div[2]/div[2]/div[1]/img), which is a picture I want to download.
I honestly have no clue where to even begin with this. I've tried the HtmlAgilityPack and the WebBrowser class, but I couldn't find anything to help me understand what to do and how to do it.
Any help will be greatly appreciated.
It is pretty easy with HTMLAgilityPack.
var w = new HtmlWeb();
var doc = w.Load("http://www.explosm.net/comics/3104/");
var imgNode = doc.DocumentNode.SelectSingleNode("//*[#id=\"maincontent\"]/div[2]/div[2]/div[1]/img");
var src = imgNode.GetAttributeValue("src", "");
The variable src will have the value http://www.explosm.net/db/files/Comics/Matt/Dont-be-a-dickhead.png.
All you have to do then is download the image:
var request = (HttpWebRequest)WebRequest.Create(src);
var response = request.GetResponse();
var stream = response.GetResponseStream();
//Here you have an Image object
Image img = Image.FromStream(stream);
//And you can save it or do whatever you want
img.Save(#"C:\file.png");

Is it possible to navigate a site by clicking links and then downloading the correct piece?

I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!
In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.
Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}

HtmlAgilityPack - how to grab <DIV> data in a large web page

I am trying to grab a data from a WEBPAGE , <DIV>particular class <DIV class="personal_info"> it has 10 similar <DIV>S and is of same Class "Personal_info" ( as shown in HTML Code and now i want to extract all the DIVs of Class personal_info which are in 10 - 15 in every webpage .
<div class="personal_info"><span class="bold">Rama Anand</span><br><br> Mobile: 9916184586<br>rama_asset#hotmail.com<br> Bangalore</div>
to do the needful i started using HTML AGILE PACK as suggested by some one in Stack overflow
and i stuck at the beginning it self bcoz of lack of knowledge in HtmlAgilePack my C# code goes like this
HtmlAgilityPack.HtmlDocument docHtml = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlWeb docHFile = new HtmlWeb();
docHtml = docHFile.Load("http://127.0.0.1/2.html");
then how to code further so that data from DIV whose class is "personal_info" can be grabbed ... suggestion with example will be appreciated
I can't check this right now, but isn't it:
var infos = from info in docHtml.DocumentNode.SelectNodes("//div[#class='personal_info']") select info;
To get a url loaded you can do something like:
var document = new HtmlAgilityPack.HtmlDocument();
var url = "http://www.google.com";
var request = (HttpWebRequest)WebRequest.Create(url);
using (var responseStream = request.GetResponse().GetResponseStream())
{
document.Load(responseStream, Encoding.UTF8);
}
Also note there is a fork to let you use jquery selectors in agility pack.
IEnumerable<HtmlNode> myList = document.QuerySelectorAll(".personal_info");
http://yosi-havia.blogspot.com/2010/10/using-jquery-selectors-on-server-sidec.html
What happened to Where?
node.DescendantNodes().Where(node_it => node_it.Name=="div");
if you want top node (root) you use page.DocumentNode as "node".

Categories