Can't download utf-8 web content - c#

I have simple code for getting response from a vietnamese website: http://vnexpress.net , but there is a small problem. For the first time, it downloads ok, but after that, the content contains unknown symbols like this:�\b\0\0\0\0\0\0�\a`I�%&/m.... What is the problem?
string address = "http://vnexpress.net";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
webClient.Encoding = System.Text.Encoding.UTF8;
return webClient.DownloadString(address);

You'll find that the response is GZipped. There doesn't appear to be a way to download that with WebClient, unless you create a derived class and modify the underlying HttpWebRequest to allow automatic decompression.
Here's how you'd do that:
public class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
var req = base.GetWebRequest(address) as HttpWebRequest;
req.AutomaticDecompression = DecompressionMethods.GZip;
return req;
}
}
And to use it:
string address = "http://vnexpress.net";
MyWebClient webClient = new MyWebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
webClient.Encoding = System.Text.Encoding.UTF8;
return webClient.DownloadString(address);

try with code and you'll be fine:
string address = "http://vnexpress.net";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
return Encoding.UTF8.GetString(Encoding.Default.GetBytes(webClient.DownloadString(address)));

DownloadString requires that the server correctly indicate the charset in the Content-Type response header. If you watch in Fiddler, you'll see that the server instead sends the charset inside a META Tag in the HTML response body:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you need to handle responses like this, you need to either parse the HTML yourself or use a library like FiddlerCore to do this for you.

Related

Downloading JSON with WebClient results in weird unicode-like characters?

So I can make a request in my browser to,
https://search.snapchat.com/lookupStory?id=itsmaxwyatt
and it will give me back JSON, but if I do it via web client, it seems to give me back a very obfuscated string? I can provide it all, but have truncated for now:
�x��ƽ���������o�Cj񦌁�_�����˗��89:�/�[��/� h��#l���ٗC��U.�gH�,����qOv�_� �_����σҭ
So, here is the Csharp code:
using var webClient = new WebClient();
webClient.Headers.Add ("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:89.0) Gecko/20100101 Firefox/89.0");
webClient.Headers.Add("Host", "search.snapchat.com");
webClient.DownloadString("https://search.snapchat.com/lookupStory?id=itsmaxwyatt")
I have also tried in a http rest client without any headers, and it still returns JSON.
Tried with encoding:
using var webClient = new WebClient();
webClient.Headers[HttpRequestHeader.AcceptEncoding] = "gzip";
webClient.Encoding = Encoding.UTF8;
webClient.Headers.Add ("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:89.0) Gecko/20100101 Firefox/89.0");
webClient.Headers.Add("Host", "search.snapchat.com");
Console.WriteLine(Encoding.UTF8.GetString(webClient.DownloadData("https://search.snapchat.com/lookupStory?id=itsmaxwyatt")));
Following #Progman comment, all you need is to do the following:
// You can define other methods, fields, classes and namespaces here
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
void Main()
{
using var webClient = new MyWebClient();
webClient.Headers.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:89.0) Gecko/20100101 Firefox/89.0");
webClient.Headers.Add("Host", "search.snapchat.com");
var str = webClient.DownloadString("https://search.snapchat.com/lookupStory?id=itsmaxwyatt");
Debug.WriteLine(str);
}

C# Empty WebClient Downloadstring

I'm trying to download the html string of a website. The website has te following url:
https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de
First I tried to do a simple WebClient Request:
var wc = new WebClient();
string websitenstring = "";
websitenstring = wc.DownloadString("http://www.gastrosg.ch/default.asp?id=3020000&siteid=1&langid=de");
But, the websiteString was empty. Then, I read in some posts, that I have to send some additional headerinformations :
var wc = new WebClient();
string websitenstring = "";
wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate, br";
wc.Headers[HttpRequestHeader.AcceptLanguage] = "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7";
wc.Headers[HttpRequestHeader.CacheControl] = "max-age=0";
wc.Headers[HttpRequestHeader.Host] = "www.gastrobern.ch";
wc.Headers[HttpRequestHeader.Upgrade] = "www.gastrobern.ch";
wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
websitenstring = wc.DownloadString("https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de");
I tried this, but no answer. Then, I also tried to set some cookies:
wc.Headers.Add(HttpRequestHeader.Cookie,
"CFID=10609582;" +
"CFTOKEN=32721418;" +
"_ga=GA1.2.37" +
"_ga=GA1.2.379124242.1539000256;" +
"_gid=GA1.2.358798732.1539000256;" +
"_dc_gtm_UA-1237799-1=1;");
But this also didn't work. I also found out, that the Browser is somehow doing multiple requests, and my C-Sharp Application is just doing one and showing the first response headers.
But I don't know how I can make a following up request. I'm thankful for every answer.
Try HttpClient instead
Here is an Example On how to use it
public async static Task<string> GetString(string url)
{
HttpClient client = new HttpClient();
// Way around to avoid Deadlock
HttpResponseMessage message = await client.GetAsync(url).ConfigureAwait(false);
return await message.Content.ReadAsStringAsync().ConfigureAwait(false);
}
To call this Method
string dataFromServer = GetString("https://www.gastrobern.ch/de/service/aus-weiterbildung/wirtekurs/234/?oid=1937&lang=de").Result;
I checked Here dataFromServer has HTML content to that page

WPF c# WebCrawler

I´m trying to develop a web crawler to extract some information from my company's web site, but I'm getting the error as below:
An exception of type System.Net.WebException occurred in System.dll but was not handled in user code
Additional information: The remote server returned an error: (500) Internal Server Error.
Here is the requestHeaders of the website:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:WASReqURL=https://:9446/ProcessPortal/jsp/index.jsp; com_ibm_bpm_process_portal_hash=null
Host:ca8webp.itau:9446
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17
Here is my code:
public bool Acessar(CredencialWAO credencial)
{
bool logado = true;
#region REQUISIÇÃO PAGINA INICIAL
ParametrosCrawler parametrosRequest = new ParametrosCrawler("https://ca8webp.itau:9446/ProcessPortal/login.jsp");
parametrosRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
parametrosRequest.AcceptEncoding = "gzip, deflate, sdch";
parametrosRequest.AcceptLanguage = "en-US,en;q=0.8";
parametrosRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17";
parametrosRequest.CacheControl = "max-age=0";
}
protected override WebResponse GetWebResponse(WebRequest request)
{
var response = base.GetWebResponse(request);
AtualizarCookies(((HttpWebResponse)response).Cookies);
return response;
}
The error occurs when calling "base.GetWebResponse(request)"

I am getting different page in httpwebrequest in C#

I am doing a httpwebrequest to recevie a web data from americalapperal.com using this code
var request = (HttpWebRequest)WebRequest.Create("http://store.americanapparel.net/en/sports-bra_rsaak301?c=White");
request.UserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36";
var response = request.GetResponse();
//cli.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
using (var reader = new StreamReader(response.GetResponseStream()))
{
var data = reader.ReadToEnd();
return data;
}
I am receiving data from this url
http://store.americanapparel.net/en/sports-bra_rsaak301?c=White
But this live data is different and the data received my httpwebrequest is different
how could i get exact page data in c#?

UPS: AWB tracking via web crawling

Problem
Here at work, people spend a lot of time tracking AWB (Air way bill) from diferent sources (UPS, FedEx, DHL, ...). So, I was required to improve the process in order save valuable time, I was thinking to accomplish this using Excel as platform with Excel-DNA & C# but I have been trying some tests (crawling UPS) with no success.
Tests
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://wwwapps.ups.com/WebTracking/track?HTMLVersion=5.0&loc=es_MX&Requester=UPSHome&WBPM_lid=homepage%2Fct1.html_pnl_trk&trackNums=5007052424&track.x=Rastrear");
request.Method = "GET";
request.UserAgent = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.Headers.Add("Accept-Language: es-ES,es;q=0.8");
request.Headers.Add("Accept-Encoding: gzip,deflate,sdch");
request.KeepAlive = false;
request.Referer = #"http://www.ups.com/";
request.ContentType = "text/html; charset=utf-8";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
Or...
using (var client = new WebClient())
{
var values = new NameValueCollection();
values.Add("HTMLVersion", "5.0");
values.Add("loc", "es_MX");
values.Add("Requester", "UPSHome");
values.Add("WBPM_lid", "homepage/ct1.html_pnl_trk");
values.Add("trackNums", "5007052424");
values.Add("track.x", "Rastrear");
client.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
client.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
client.Headers[HttpRequestHeader.AcceptLanguage] = "es-ES,es;q=0.8";
client.Headers[HttpRequestHeader.Referer] = #"http://www.ups.com/";
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36";
string url = #"https://wwwapps.ups.com/WebTracking/track?";
byte[] result = client.UploadValues(url, values);
System.IO.File.WriteAllText(#"C:\UPSText.txt", Encoding.UTF8.GetString(result));
}
But none of the above examples worked as expected.
Question
Is it possible to web-crawl UPS in order to keep a track of AWB?
Note
Currently, I have no access to UPS API.
I just finished writing my script for it. The trick is that there is another url where you can just include the tracking number in the url and land directly on the page. You will then have to parse the tables as xml tags won't work. Just offset off of a header.

Categories