HtmlAgilityPack don't get xpath in c#

HtmlAgilityPack don't get xpath in c# - c#

before, I use this code, it can get xpath of website. But, today I debug code, I see, it don't get data html from website: webtruyen.com. I try to check website.com/robots.txt. but it don't suspect. And I try to add proxy to get data, but return data null. I don't know how to get xpath from website webtruyen.com. Who help me? I want to know how to read data from website http://webtruyen.com.
My code:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}
I debug, return:
InnerHtml 'doc.DocumentNode.InnerHtml' threw an exception of type 'System.NullReferenceException' string {System.NullReferenceException}
My code use proxy:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)";
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}

I have the same error using HtmlWeb.Load(), but I can easily solve your issue using HttpWebRequest (TLDR: See #3 for the working code).
Step 1) Using the following code:
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
You see that you actually get a 403 Forbidden error (WebException).
Step 2)
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
HtmlDocument doc = new HtmlDocument();
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
}
on doc.DocumentNode.OuterHtml, you see the HTML of the forbidden error with the JavaScript that sets the cookie on your browser and refreshes it.
3) So in order to load the page outside of a manual browser, you have to manually set that cookie and re-access it. Meaning, with:
string cookie = string.Empty;
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
}
hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
hwr.Headers.Add("Cookie", cookie);
HtmlDocument doc = new HtmlDocument();
using (Stream s = hwr.GetResponse().GetResponseStream())
using (StreamReader sr = new StreamReader(s))
{
doc.LoadHtml(sr.ReadToEnd());
}
You get the page :)
Moral of the story, if your browser can do it, so can you.

Related

HTML Document parse by HtmlAgilityPack return null

I am trying to download whatever is in between <code></code> tags on website, unfortunately, selecting nodes "//code" return null. I don't know why. This is my code:
public void TAF_download()
{
var html = #"https://www.aviationweather.gov/taf/data?ids=KDEN&format=raw&metars=off&layout=off/";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var TAF = doc.DocumentNode.SelectSingleNode("//code");
Console.WriteLine(TAF.OuterHtml);
}

The argument to HtmlDocument.LoadHtml(string html) needs to be HTML, not a URL.
You may try this (no exception handling included):
public void TAF_download()
{
var url = #"https://www.aviationweather.gov/taf/data?ids=KDEN&format=raw&metars=off&layout=off/";
string html;
var request = WebRequest.CreateHttp(url);
using (var response = request.GetResponse())
using (var reader = new StreamReader(response.GetResponseStream()))
{
html = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var TAF = doc.DocumentNode.SelectSingleNode("//code");
Console.WriteLine(TAF.OuterHtml);
}
There is also an HtmlAgilityPack.HtmlWeb class that supports downloading a URL, but I generally don't use it myself (I actually forgot about it).
For example:
public void TAF_download()
{
var url = #"https://www.aviationweather.gov/taf/data?ids=KDEN&format=raw&metars=off&layout=off/";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var TAF = doc.DocumentNode.SelectSingleNode("//code");
Console.WriteLine(TAF.OuterHtml);
}
With that said, you should look for a better data source, one that doesn't require scraping HTML... maybe one of the options listed here https://www.aviationweather.gov/dataserver

System.AccessViolationException is thrown when adding an "Accepted-Language" to a HtmlWeb pre-request's headers

When debugging my UWP-application, it throws a System.AccessViolationException with the message 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.' and the stack trace is null. The exception is thrown when trying to add an "Accept-Language" ("en-US") to the HtmlWeb object's pre-request header (see the picture and code below). Running the same code using xUnit works fine. Does someone recognize the problem?
Picture of thrown exception: https://i.imgur.com/gHkmR6q.png
public static HtmlNode GetHtmlNode(string url, string requestLanguage)
{
var htmlWeb = new HtmlWeb();
htmlWeb.PreRequest += (request) =>
{
// This line of code throws the exception (see the picture as well)
request.Headers.Add("Accept-Language", requestLanguage);
return true;
};
return htmlWeb.Load(url).DocumentNode;
}

I solved the issue by not using HtmlWeb at all, as shown below:
public static HtmlNode GetHtmlNode(string url, string requestLanguageCode)
{
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
request.UserAgent = "Mozilla";
request.Accept = "Accept: text/html";
request.Headers.Add("Accept-Language: " + requestLanguageCode);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
using (var reader = new StreamReader(stream))
{
string html = reader.ReadToEnd();
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc.DocumentNode;
}
}
catch (WebException)
{
return null;
}
}

C# HtmlAgilityPack timeout before download page

I want parse site https://russiarunning.com/events?d=run on C# with htmlagilitypack
I'm try this make
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
var doc = web.Load(url);
But I got a problem - content on site loading with timeout ~1000ms
therefore, when using the web.Load (url) I download the page without content.
How make timeout before download page with htmlagilitypack ?

Try this...
Create one class as below :
public class WebClientHelper : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
and use as below:
var data = new Helpers.WebClientHelper().DownloadString(Url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(data);

You can simply do this:
string url = "https://russiarunning.com/events?d=run";
var web = new HtmlWeb();
web.PreRequest = delegate(HttpWebRequest webReq)
{
webReq.Timeout = 4000; // number of milliseconds
return true;
};
var doc = web.Load(url);
More on Timeout property: https://learn.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout?view=netframework-4.7.2

Read Exchange Server 2003 account emails

Working on application that reads emails.
Is there a way besides webdav to get all emails from exchange server 2003 to my local machine.
The problem with webdav is that it does not get's the body of Undelivered emails.
CredentialCache creds = new CredentialCache();
creds.Add(new Uri(a), "NTLM",
new NetworkCredential("xxxxx", "xxxxxx", "xxxxx.com"));
List<Mail> unreadMail = new List<Mail>();
string reqStr =
#"<?xml version=""1.0""?>
<g:searchrequest xmlns:g=""DAV:"">
<g:sql>
SELECT
""urn:schemas:mailheader:from"",
""urn:schemas:mailheader:to"",
""urn:schemas:httpmail:textdescription""
FROM
""http://xxxx.com/exchange/xxxx/Inbox/""
WHERE
""urn:schemas:httpmail:subject"" = 'Undeliverable: xxxx'
</g:sql>
</g:searchrequest>";
byte[] reqBytes = Encoding.UTF8.GetBytes(reqStr);
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(a);
request.Credentials = creds;
request.Method = "SEARCH";
request.ContentLength = reqBytes.Length;
request.ContentType = "text/xml";
request.Timeout = 300000;
using (Stream requestStream = request.GetRequestStream())
{
try
{
requestStream.Write(reqBytes, 0, reqBytes.Length);
}
catch
{
}
finally
{
requestStream.Close();
}
}
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (Stream responseStream = response.GetResponseStream())
{
try
{
XmlDocument document = new XmlDocument();
document.Load(responseStream);
XmlNamespaceManager nsmgr = new XmlNamespaceManager(document.NameTable);
nsmgr.AddNamespace("a", "DAV:");
nsmgr.AddNamespace("b", "urn:uuid:c2f41010-65b3-11d1-a29f-00aa00c14882/");
nsmgr.AddNamespace("c", "xml:");
nsmgr.AddNamespace("d", "urn:schemas:mailheader:");
nsmgr.AddNamespace("e", "urn:schemas:httpmail:");
XmlNodeList responseNodes = document.GetElementsByTagName("a:response");
foreach (XmlNode responseNode in responseNodes)
{
XmlNode uriNode = responseNode.SelectSingleNode("child::a:href", nsmgr);
XmlNode propstatNode = responseNode.SelectSingleNode("descendant::a:propstat[a:status='HTTP/1.1 200 OK']", nsmgr);
if (propstatNode != null)
{
// read properties of this response, and load into a data object
XmlNode fromNode = propstatNode.SelectSingleNode("descendant::d:from", nsmgr);
XmlNode descNode = propstatNode.SelectSingleNode("descendant::e:textdescription", nsmgr);
XmlNode toNode = propstatNode.SelectSingleNode("descendant::d:to", nsmgr);
// make new data object
Mail mail = new Mail();
if (uriNode != null)
mail.Uri = uriNode.InnerText;
if (fromNode != null)
mail.From = fromNode.InnerText;
if (descNode != null)
mail.Body = descNode.InnerText;
if (toNode != null)
mail.To = toNode.InnerText;
unreadMail.Add(mail);
}
}
var ac = unreadMail;
}
catch (Exception e)
{
string msg = e.Message;
}
finally
{
responseStream.Close();
}
}
in the output xml i get empty text description for undelivered emails:
<a:status>HTTP/1.1 404 Resource Not Found</a:status><a:prop><e:textdescription /></a:prop></a:propstat></a:response>

I see several options to communicate with Exchange servers - WebDAV is rather hard to use and is not well supported in later version (2010), MS provides EWS but these don't work with older versions.
From my POV you can use any of the following components (commercial!):
http://www.independentsoft.de/webdavex/index.html (WebDAV-based)
http://www.dimastr.com/redemption/home.htm (COM- / Extended MAPI-based)
http://www.afterlogic.com/mailbee-net/imap-component (IMAP4-based)
Another point:
When handling Undeliverable mails I made the experience that the body is sometimes provided as an attachment - in WebDAV this needs to be accessed via the X-MS-ENUMATT verb (but BEWARE: specific "attachments" like winmail.dat are automagically "decoded" by Outlook on display).

you can try sending HTTP request to the server same way as OWA does (specifying mail ID in it of course) - and then you will get HTML you can parse.
also - check if original message is in attachments array of the "undelivered" email.

C# Downloading website into string using C# WebClient or HttpWebRequest

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.
Here is the code I was originally using.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
I also tried alternate implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
doc.Load(read, true);
}
From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.
http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding
The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States).
Although I have tried a number of other wikipedia articles and have not seen this issue.

Using the built-in loader in HtmlAgilityPack worked for me:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here
Edit:
Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
Also see this SO thread: WebClient forbids opening wikipedia page?

The response is gzip encoded.
Try the following to decode the stream:
UPDATE
Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
Old/Manual solution:
string source;
var response = req.GetResponse();
var stream = response.GetResponseStream();
try
{
if (response.Headers.AllKeys.Contains("Content-Encoding")
&& response.Headers["Content-Encoding"].Contains("gzip"))
{
stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
}
using (StreamReader reader = new StreamReader(stream))
{
source = reader.ReadToEnd();
}
}
finally
{
if (stream != null)
stream.Dispose();
}

This is how I usually grab a page into a string (its VB, but should translate easily):
req = Net.WebRequest.Create("http://www.cnn.com")
Dim resp As Net.HttpWebResponse = req.GetResponse()
sr = New IO.StreamReader(resp.GetResponseStream())
lcResults = sr.ReadToEnd.ToString
and haven't had the problems you are.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack don't get xpath in c# - c#

Related

HTML Document parse by HtmlAgilityPack return null

System.AccessViolationException is thrown when adding an "Accepted-Language" to a HtmlWeb pre-request's headers

C# HtmlAgilityPack timeout before download page

Read Exchange Server 2003 account emails

C# Downloading website into string using C# WebClient or HttpWebRequest

Categories

Resources