I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.
Here is the code I was originally using.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
I also tried alternate implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
doc.Load(read, true);
}
From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.
http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding
The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States).
Although I have tried a number of other wikipedia articles and have not seen this issue.
Using the built-in loader in HtmlAgilityPack worked for me:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here
Edit:
Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
Also see this SO thread: WebClient forbids opening wikipedia page?
The response is gzip encoded.
Try the following to decode the stream:
UPDATE
Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
Old/Manual solution:
string source;
var response = req.GetResponse();
var stream = response.GetResponseStream();
try
{
if (response.Headers.AllKeys.Contains("Content-Encoding")
&& response.Headers["Content-Encoding"].Contains("gzip"))
{
stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
}
using (StreamReader reader = new StreamReader(stream))
{
source = reader.ReadToEnd();
}
}
finally
{
if (stream != null)
stream.Dispose();
}
This is how I usually grab a page into a string (its VB, but should translate easily):
req = Net.WebRequest.Create("http://www.cnn.com")
Dim resp As Net.HttpWebResponse = req.GetResponse()
sr = New IO.StreamReader(resp.GetResponseStream())
lcResults = sr.ReadToEnd.ToString
and haven't had the problems you are.
Related
I am using the following method to connect to an rss feed.
var url = "http://blogs.mysite.com/feed/";
var sourceXmlFeed = "";
using (var wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
sourceXmlFeed = wc.DownloadString(url);
}
var xrs = new XmlReaderSettings();
xrs.CheckCharacters = false;
var xtr = new XmlTextReader(new System.IO.StringReader(sourceXmlFeed));
var xmlReader = XmlReader.Create(xtr, xrs);
SyndicationFeed feed = SyndicationFeed.Load(xmlReader);
However I am getting bad characters (as below) in the output and I assume it is something to do with the encoding.
eg. for 2015 we are â?ogoing for goldâ?
Anyone know how to fix this?
By the way, I am doing things this way because I have been unable to use a more direct approach (as below) without causing The remote server returned an error: (443).
var xmlReader = XmlReader.Create("http://blogs.mysite.com/feed);
SyndicationFeed feed = SyndicationFeed.Load(xmlReader);
Problem with encoding is caused by reading xml as string because encoding detection in XML differs from encoding detection in strings.
WebClient webClient = null;
XmlReader xmlReader = null;
try
{
webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
xmlReader = XmlReader.Create(webClient.OpenRead(url));
// Read XML here because in a finaly block the response stream and the reader will be closed.
}
finally
{
if (webClient != null)
{ webClient.Dispose(); }
if (xmlReader != null)
{ xmlReader .Dispose(); }
}
before, I use this code, it can get xpath of website. But, today I debug code, I see, it don't get data html from website: webtruyen.com. I try to check website.com/robots.txt. but it don't suspect. And I try to add proxy to get data, but return data null. I don't know how to get xpath from website webtruyen.com. Who help me? I want to know how to read data from website http://webtruyen.com.
My code:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}
I debug, return:
InnerHtml 'doc.DocumentNode.InnerHtml' threw an exception of type 'System.NullReferenceException' string {System.NullReferenceException}
My code use proxy:
string url = "http://webtruyen.com";
var web = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)";
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
temps = node.InnerHtml;
}
I have the same error using HtmlWeb.Load(), but I can easily solve your issue using HttpWebRequest (TLDR: See #3 for the working code).
Step 1) Using the following code:
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
You see that you actually get a 403 Forbidden error (WebException).
Step 2)
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
HtmlDocument doc = new HtmlDocument();
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
}
on doc.DocumentNode.OuterHtml, you see the HTML of the forbidden error with the JavaScript that sets the cookie on your browser and refreshes it.
3) So in order to load the page outside of a manual browser, you have to manually set that cookie and re-access it. Meaning, with:
string cookie = string.Empty;
HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
try
{
using (Stream s = hwr.GetResponse().GetResponseStream())
{ }
}
catch (WebException wx)
{
cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
}
hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
hwr.Headers.Add("Cookie", cookie);
HtmlDocument doc = new HtmlDocument();
using (Stream s = hwr.GetResponse().GetResponseStream())
using (StreamReader sr = new StreamReader(s))
{
doc.LoadHtml(sr.ReadToEnd());
}
You get the page :)
Moral of the story, if your browser can do it, so can you.
I am attempting to load a page I've received from an RSS feed and I receive the following WebException:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
with an inner exception:
Invalid URI: The hostname could not be parsed.
Here's the code I'm using:
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(url);
string source = String.Empty;
Uri responseURI;
try
{
req.UserAgent=#"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:31.0) Gecko/20100101 Firefox/31.0";
req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
req.AllowAutoRedirect = true;
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
responseURI = httpWebResponse.ResponseUri;
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
reader.Close();
}
}
}
catch (WebException we)
{
Console.WriteLine(url + "\n--\n" + we.Message);
return null;
}
I'm not sure if I'm doing something wrong or if there's something extra I need to be doing. Any help would be greatly appreciated! let me know if there's more information that you need.
############ UPDATE
So after following Jim Mischel's suggestions I've narrowed it down to a UriFormatException that claims Invalid URI: The hostname could not be parsed.
Here's the URL that's in the last "Location" Header: http:////www-nc.nytimes.com/
I guess I can see why it fails, but I'm not sure why it gives me trouble here but when I take the original url it processes it just fine in my browser. Is there something I'm missing/not doing that I should be in order to handle this strange URL?
I am trying to get the result of the following json webservice https://mtgox.com/code/data/getDepth.php into a string using the following code.
using (WebClient client = new WebClient())
{
string data = client.DownloadString("https://mtgox.com/code/data/getDepth.php");
}
but it always returns a timeout exception and no data. I plan to use fastjson to turn the response into objects and expected that to be that hard part not the returning of the content of the page.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://mtgox.com/code/data/getDepth.php");
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string data = sr.ReadToEnd();
}
}
Also resulted in the same error. Can anyone point out what i am doing wrong?
Hmm, strage, this works great for me:
class Program
{
static void Main()
{
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
var result = client.DownloadString("https://mtgox.com/code/data/getDepth.php");
Console.WriteLine(result);
}
}
}
Notice that I am specifying a User Agent HTTP header as it seems that the site is expecting it.
I had similar issue before. request.KeepAlive = false solved my problem. Try this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://mtgox.com/code/data/getDepth.php");
request.KeepAlive = false;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string data = sr.ReadToEnd();
}
}
Okay so here is the deal. As the question states, I'm trying to POST a file to a webserver and am having a few issues.
I've tried posting this same file to the same webserver using Curl.exe and have had no issues. I've posted the flags I used with curl just incase they might point out any potential reasons why I'm having trouble with the .NET classes.
curl.exe --user "myUser:myPass" --header "Content-Type: application/gzip"
--data-binary "#filename.txt.gz" --cookie "data=service; data-ver=2; date=20100212;
time=0900; location=1234" --output "out.txt" --dump-header "header.txt"
http://mysite/receive
I'm trying to use a .NET class like WebClient or HttpWebRequest to do the same thing. Here is a sample of the code I've tried. With the WebClient I get a 505 HTTP Version Not Supported error and with the HttpWebRequest I get a 501 Not Implemented.
When trying it with a WebClient:
public void sendFileClient(string path){
string url = "http://mysite/receive";
WebClient wc = new WebClient();
string USERNAME = "myUser";
string PSSWD = "myPass";
NetworkCredential creds = new NetworkCredential(USERNAME, PSSWD);
wc.Credentials = creds;
wc.Headers.Set(HttpRequestHeader.ContentType, "application/gzip");
wc.Headers.Set("Cookie", "location=1234; date=20100226; time=1630; data=service; data-ver=2");
wc.UploadFile(url, "POST", path);
}
And while using a HttpRequest:
public Stream sendFile(string path)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://myserver/receive");
string USERNAME = "myUser";
string PSSWD = "myPass";
NetworkCredential creds = new NetworkCredential(USERNAME, PSSWD);
request.Credentials = creds;
request.Method = "POST";
request.ContentType = "application/gzip";
request.Headers.Set("Cookie", "location=1234; date=20100226; time=1630; data=service; data-ver=2");
FileInfo fInfo = new FileInfo(path);
long numBytes = fInfo.Length;
FileStream fStream = new FileStream(path, FileMode.Open, FileAccess.Read);
BinaryReader br = new BinaryReader(fStream);
byte[] data = br.ReadBytes((int)numBytes);
br.Close();
fStream.Close();
fStream.Dispose();
Stream wrStream = request.GetRequestStream();
BinaryWriter bw = new BinaryWriter(wrStream);
bw.Write(data);
bw.Close();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response.GetResponseStream();
}
First, use something like fiddler and inspect the requests and responses to see what differs between curl and System.Net.WebClient.
Also, you can try (although inspecting with the debugging proxy should allow you to pinpoint the difference):
Use the credential cache to set your credentials for basic authentication:
var cc= new CredentialCache();
cc.Add(new Uri(url),
"Basic",
new NetworkCredential("USERNAME", "PASSWORD"));
wc.Credentials = cc;
Set a user agent header:
string _UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
wc.Headers.Add(HttpRequestHeader.UserAgent, _UserAgent);
Change the protocol version on the WebRequest:
reqeust.KeepAlive = false;
request.ProtocolVersion=HttpVersion.Version10;
There might be another 2 reasons when a 501 accord.
----------1---------
when the postdate contain some Chinese Chracter or some other character.
e.g.
postDate = "type=user&username=计算机学院&password=123&Submit=+登录+"
in order post the right message,you may also add following 2 line;
Request.SendChunked = true;
Request.TransferEncoding = "GB2312";
this also lead to a 501.
in that occasion,you can delete the 2 line,and modify postDate like so.
postDate = "type=user&username=%BC%C6%CB%E3%BB%FA%D1%A7%D4%BA&password=123&Submit=+%C8%B7%C8%CF+"
maybe this is a solution to modify the postDate,however i havn't test yet.
string str = Encoding.GetEncoding("gb2312").GetString(tmpBytes);
----------2---------
if Response.StatusCode == HttpStatusCode.Redirect Redirect is equals to 302.
following line is a must:
Request.AllowAutoRedirect = false;