Could someone help with parsing website.
I have parsed lots of sites but this one is interesting, the inner code is generated dynamically with php file. So I tried to use WebClient like this:
WebClient client = new WebClient();
string postData = "getProducts=1&category=340&brand=0";
byte[] byteArray = Encoding.UTF8.GetBytes(postData);
client.Headers.Add("POST", "/ajax.php HTTP/1.1");
client.Headers.Add("Host", site);
client.Headers.Add("Connection", "keep-alive");
client.Headers.Add("Origin", "http://massup.ru");
client.Headers.Add("X-Requested-With", "XMLHttpRequest");
client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11");
client.Headers.Add("Accept", "*/*");
client.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
client.Headers.Add("Content-length", byteArray.Length.ToString());
client.Headers.Add("Referer", "http://massup.ru/category/proteini");
client.Headers.Add("Accept-Encoding", "gzip,deflate,sdch");
client.Headers.Add("Accept-Language", "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4");
client.Headers.Add("Accept-Charset", "windows-1251,utf-8;q=0.7,*;q=0.3");
client.Headers.Add("Cookie", "cart=933a71dfee2baf8573dfc2094a937f0d; r_v=YToyOntpOjA7YTo3OntzOjU6Im1vZGVsIjtzOjI2OiIxMDAlIFdoZXkgUHJvdGVpbiA5MDgg0LPRgCI7czozOiJ1cmwiO3M6MzQ6Im11bHRpcG93ZXItMTAwLXdoZXktcHJvdGVpbi05MDgtZ3IiO3M6NToiYnJhbmQiO3M6MTA6Ik11bHRpcG93ZXIiO3M6ODoiY2F0ZWdvcnkiO3M6Mzk6ItCh0YvQstC%2B0YDQvtGC0L7Rh9C90YvQtSDQuNC30L7Qu9GP0YLRiyI7czo5OiJzY2F0ZWdvcnkiO3M6Mzc6ItCh0YvQstC%2B0YDQvtGC0L7Rh9C90YvQuSDQuNC30L7Qu9GP0YIiO3M6NToicHJpY2UiO3M6MToiMCI7czo0OiJpY29uIjtzOjM3OiJodHRwOi8vbWFzc3VwLnJ1L2ltYWdlcy9pY29uXzQ3NTIuanBnIjt9aToxO2E6Nzp7czo1OiJtb2RlbCI7czoxNzoiTWF0cml4IDIuMCA5ODQg0LMiO3M6MzoidXJsIjtzOjE2OiJtYXRyaXgtMi0wLTk4NC1nIjtzOjU6ImJyYW5kIjtzOjc6IlN5bnRyYXgiO3M6ODoiY2F0ZWdvcnkiO3M6Mzk6ItCh0YvQstC%2B0YDQvtGC0L7Rh9C90YvQtSDQuNC30L7Qu9GP0YLRiyI7czo5OiJzY2F0ZWdvcnkiO3M6Mzc6ItCh0YvQstC%2B0YDQvtGC0L7Rh9C90YvQuSDQuNC30L7Qu9GP0YIiO3M6NToicHJpY2UiO3M6NDoiMTE5MCI7czo0OiJpY29uIjtzOjM3OiJodHRwOi8vbWFzc3VwLnJ1L2ltYWdlcy9pY29uXzEwMDguanBnIjt9fQ%3D%3D; PHPSESSID=933a71dfee2baf8573dfc2094a937f0d");
Stream data = client.OpenRead("http://massup.ru/ajax.php");
StreamReader reader = new StreamReader(data);
string s = reader.ReadToEnd();
Console.WriteLine(s);
data.Close();
reader.Close();
But it gives me an error!
Could someone help me with this kind of parsing.
See my answer to C# https login and download file, which has working code that correctly handles HTTP POSTs, then clean up what you have based on it. After you've done that, if you still need help, post your updated code and a clearer description of what specific errors or exceptions you are seeing.
Related
I can't seem to get the hang of my HTTP POST methods. I have just learned how to do GET methods to retrieve webpages but now i'm trying to fill in information on the webpage and can't seem to get it working. The source code that comes back is always an invalid page (full of broken images/not the right information)
public static void jsonPOST(string url)
{
url = "http://treasurer.maricopa.gov/Parcel/TaxReceipt.aspx/GetTaxReceipt";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(new Uri(url));
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Accept = "application/json, text/javascript, */*; q=0.01";
httpWebRequest.Headers.Add("Accept-Encoding: gzip, deflate");
httpWebRequest.CookieContainer = cookieJar;
httpWebRequest.Method = "POST";
httpWebRequest.Headers.Add("Accept-Language: en-US,en;q=0.5");
httpWebRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW65; Trident/7.0; MAM5; rv:11.0) like Gecko";
httpWebRequest.Referer = "http://treasurer.maricopa.gov/Parcel/TaxReceipt.aspx";
string postData = "{\"startDate\":\"1/1/2013\",\"parcelNumber\":\"17609419\"}";
byte[] bytes = System.Text.Encoding.ASCII.GetBytes(postData);
httpWebRequest.ContentLength = bytes.Length;
System.IO.Stream os = httpWebRequest.GetRequestStream();
os.Write(bytes, 0, bytes.Length); //Push it out there
os.Close();
System.Net.WebResponse resp = httpWebRequest.GetResponse();
if (resp == null)
{
Console.WriteLine("null");
}
System.IO.StreamReader sr = new System.IO.StreamReader(resp.GetResponseStream());
string source = sr.ReadToEnd().Trim();
}
EDIT: I updated the code to reflect my new problem. The problem i have now is that the source code is not what is coming back to me. I am getting just the raw JSON information in the source. Which i can use to deserialize the information i need to obtain, but i'm curious why the actual source code isn't coming back to me
The source code that comes back is always an invalid page (full of broken images/not the right information)
It sounds like you just get the Source code without thinking of relative paths. As long as there are relative paths on the site it will not show correctly at your copy. You have to replace all the relative paths before it is useful.
http://webdesign.about.com/od/beginningtutorials/a/aa040502a.htm
Remember crossdomain ajax can be a problem in that situation.
The main problem I think is that I am trying to get an output of a php script on an ssl protected website. Why doesn't the following code work?
string URL = "https://mtgox.com/api/0/data/ticker.php";
HttpWebRequest myRequest =
(HttpWebRequest)WebRequest.Create(URL);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader _sr = new StreamReader(myResponse.GetResponseStream(),
System.Text.Encoding.UTF8);
string result = _sr.ReadToEnd();
//Console.WriteLine(result);
result = result.Replace('\n', ' ');
_sr.Close();
myResponse.Close();
Console.WriteLine(result);
It hangs at WebException was unhandeled The operation has timed out
You're hitting the wrong url. ssl is https://, but you're hitting http:// (note the lack of S). The site does redirect to the SSL version of the page, but your code is apparently not following that redirect.
Have added myRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"; everything started working
I tried to search previous discussion about this issue but I didn't find one, maybe it's because I didn't use right keywords.
I am writing a small program which posts data to a webpage and gets the response. The site I'm posting data to does not provide an API. After some Googling I came up to the use of HttpWebRequest and HttpWebResponse. The code looks like this:
HttpWebRequest httpRequest = (HttpWebRequest)WebRequest.Create("https://www.site.com/index.aspx");
CookieContainer cookie = new CookieContainer();
httpRequest.CookieContainer = cookie;
String sRequest = "SomeDataHere";
httpRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
httpRequest.Headers.Add("Accept-Encoding: gzip, deflate");
httpRequest.Headers.Add("Accept-Language: en-us,en;q=0.5");
httpRequest.Headers.Add("Cookie: SomecookieHere");
httpRequest.Host = "www.site.com";
httpRequest.Referer = "https://www.site.com/";
httpRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1";
httpRequest.ContentType = "application/x-www-form-urlencoded";
//httpRequest.Connection = "keep-alive";
httpRequest.ContentLength = sRequest.Length;
byte[] bytedata = Encoding.UTF8.GetBytes(sRequest);
httpRequest.ContentLength = bytedata.Length;
httpRequest.Method = "POST";
Stream requestStream = httpRequest.GetRequestStream();
requestStream.Write(bytedata, 0, bytedata.Length);
requestStream.Flush();
requestStream.Close();
HttpWebResponse httpWebResponse = (HttpWebResponse)httpRequest.GetResponse();
string sResponse;
using (Stream stream = httpWebResponse.GetResponseStream())
{
StreamReader reader = new StreamReader(stream, System.Text.Encoding.GetEncoding("iso-8859-1"));
sResponse = reader.ReadToEnd();
}
return sResponse;
I used firefox's firebug to get the header and data to post.
My question is, when I store and display the response using a string, all I got are garbled characters, like:
?????*??????xV?J-4Si1?]R?r)f?|??;????2+g???6?N-?????7??? ?6?? x???q v ??? j?Ro??_*?e*??tZN^? 4s?????? ??Pwc??3???|??_????_??9???^??#?Y??"?k??,?a?H?Lp?A?$ ;???C#????e6'?N???L7?j#???ph??y=?I??=(e?V?6C??
By reading the response header using FireBug I got the content type of response:
Content-Type text/html; charset=ISO-8859-1
And it is reflected in my code. I have even tried other encoding such as utf-8 and ascii, still no luck. Maybe I am in the wrong direction.
Please advise. A small code snippet will be even better.
Thanks you.
You're telling the server that you can accept compressed responses with httpRequest.Headers.Add("Accept-Encoding: gzip, deflate");. Try removing that line, and you should get a clear-text response.
HttpWebRequest does have built in support for gzip and deflate if you want to allow compressed responses. Remove the Accept-Encoding header line, and replace it with
httpRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate
This will take care of adding the appropriate Accept-Encoding header for you, and handle decompressing the content automatically when you receive it.
I am attempting to view the source of http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/ using the code:
String URL = "http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/";
WebClient webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
webClient.Encoding = Encoding.GetEncoding("Windows-1255");
string download = webClient.DownloadString(URL);
webClient.Dispose();
Console.WriteLine(download);
When I run this, the console returns a bunch of nonsense that looks like it's been decoded incorrectly.
I've also attempted adding headers with no avail:
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
webClient.Headers.Add("Accept-Encoding", "gzip,deflate");
Other websites all returned the proper html source. I can also view the page's source through Chrome. What's going on here?
Response of that URL is gzipped, you should decompress it or set empty Accept-Encoding header, you don't need that user-agent field.
String URL = "http://simpledesktops.com/browse/desktops/2012/may/17/where-the-wild-things-are/";
WebClient webClient = new WebClient();
webClient.Headers.Add("Accept-Encoding", "");
string download = webClient.DownloadString(URL);
I've had the same thing bug me today.
Using a WebClient object to check whether a URL is returning something.
But my experience is different. I tried removing the Accept-Encoding, basically using the code #Antonio Bakula gave in his answer. But I kept getting the same error every time (InvalidOperationException)
So this did not work:
WebClient wc = new WebClient();
wc.Headers.Add("Accept-Encoding", "");
string result = wc.DownloadString(url);
But adding 'any' text as a User Agent instead did do the trick. This worked fine:
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.UserAgent, "My User Agent String");
System.IO.Stream stream = wc.OpenRead(url);
Your mileage may vary obviously, also of note. I'm using ASP.NET 4.0.30319.
I have an HttpWebRequest with a StreamReader that works very fine without using a WebProxy. When I use WebProxy, the StreamReader reads strange character instead of the actual html. Here is the code.
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("https://URL");
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10";
req.Accept = "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
req.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.3");
req.Headers.Add("Accept-Encoding", "gzip,deflate,sdch");
req.Headers.Add("Accept-Language", "en-US,en;q=0.8");
req.Method = "GET";
req.CookieContainer = new CookieContainer();
WebProxy proxy = new WebProxy("proxyIP:proxyPort");
proxy.Credentials = new NetworkCredential("proxyUser", "proxyPass");
req.Proxy = this.proxy;
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader reader = new StreamReader(res.GetResponseStream());
string html = reader.ReadToEnd();
Without using the WebProxy, the variable html holds the expected html string from the URL. But with a WebProxy, html holds a value like that:
"�\b\0\0\0\0\0\0��]r����s�Y����\0\tP\"]ki���ػ��-��X�\0\f���/�!�HU���>Cr���P$%�nR�� z�g��3�t�~q3�ٵȋ(M���14&?\r�d:�ex�j��p������.��Y��o�|��ӎu�OO.�����\v]?}�~������E:�b��Lן�Ԙ6+�l���岳�Y��y'ͧ��~#5ϩ�it�2��5��%�p��E�L����t&x0:-�2��i�C���$M��_6��zU�t.J�>C-��GY��k�O�R$�P�T��8+�*]HY\"���$Ō�-�r�ʙ�H3\f8Jd���Q(:�G�E���r���Rܔ�ڨ�����W�<]$����i>8\b�p� �\= 4\f�> �&��$��\v��C��C�vC��x�p�|\"b9�ʤ�\r%i��w#��\t�r�M�� �����!�G�jP�8.D�k�Xʹt�J��/\v!�r��y\f7<����\",\a�/IK���ۚ�r�����ҿ5�;���}h��+Q��IO]�8��c����n�AGڟu2>�
Since you are passing
req.Headers.Add("Accept-Encoding", "gzip,deflate,sdch");
I would say your proxy compress the stream before sending it back to you.
Inspect the headers of the Response to check the encoding.
Just use Gzip to decompress it.