I have a question that seems to have been asked before, but is a bit different. I'm trying to scrape data from this website but the problem is that is seems like it's loaded with AJAX. Because of that my application is unable to find the id's and classes in the HTML that I'm looking for.
You can reproduce this by inspecting an element or viewing the source. Whilst viewing the source I'm seeing a lot less than whilst inspecting an element.
I thought that I could track down the file that contains the AJAX to load this html by pressing F12, going to the network tab and selecting XHR, but I'm unable to find it.
My question is: how do I retrieve this data or find out what file is
used to collect the data?
An example of my code (I'm unable to find the Timetable_toolbar_elementSelect_popup0):
private async Task GetHtmlDocument(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
//request.Credentials = new LoginCredentials().Credentials;
try
{
WebResponse myResponse = await request.GetResponseAsync();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(myResponse.GetResponseStream());
var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0");
}
catch (Exception e)
{
}
}
I was going to leave this as a comment. But it got too big and too badly formatted. So here we go.
Firstly. The site is updated dynamically using javascript that is called with an ajaxcommand.
If you can open up a session and store the cookie containing the SESSIONID and the now "encrypted" schoolname then you can call the ajax commands as such.
https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2
This does however require you to know what elementType is and what elementId is.
In this case elementId refers to Klas when it is equal to 1GLD. And formatID(7) refers Roosterformaat when it is equal to "Beknopt". You have to figure out what the remaining variables does. Even more important is that if you succeed in being able to make valid ajax commands to the server then you wont get html back as a response you will receive the data in JSON.
The easiest way to do what you want is to have all the classes in a separate file. And use that as reference point. Same goes for the other options.
And then use a headless browser like phantomjs.org with Selenium. This way you can find and click on the classes you want to scrape. Load the html into a HtmlAgilityPack.HtmlDocument and then do what you need to do. Selenium/PhantomJS till keep track of your cookies.
This method is slower - but a lot easier to do.
EDIT Storing cookies from a webrequest - the easy way.
I am not keen on this subject. But OP asked. If anybody has a better way of doing it please edit.
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2";
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
//cookies.Add(httpResponse.Cookies);
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
foreach(Cookie c in httpResponse.Cookies)
{
Console.WriteLine(c.ToString());
}
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
Solution where you call the ajax method using a webrequest.
So I got bored and figured most of it out. What is missing below is how to identify the Klase by id. The below example will fetch the klase '1GLD'. The reason why we need cookies is in order for the request to know which school we are fetching the Klase from. Also the below code only returns JSON - and not HTML since it is an ajax method we call.
CookieContainer cookies = new CookieContainer();
try
{
string webAddr = "https://roosters.windesheim.nl/";
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/json; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri));
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
//we are now ready to call the ajax method and get the JSON.
try
{
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8";
httpWebRequest.Method = "POST";
httpWebRequest.CookieContainer = cookies;
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest");
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2";
//The command below will return a JSON datastructure containing all the klases and their relevant ID.
//string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2"
streamWriter.Write(json);
streamWriter.Flush();
}
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var responseText = streamReader.ReadToEnd();
//THE RESULTS GETS PRINTED HERE.
Console.Write(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
Other solution with Selenium with Firefox driver.
This is way easier to do. but it also takes some time. Not all the thread sleeps are necessary. This will give an HTML to work with isntead just like you requested. But I found it necessary in the last foreach loop.
public static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then.
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request.
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100;
string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString();
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = #"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
driver.Navigate().GoToUrl(webAddr);
driver.FindElement(By.XPath("//input[#id='school']")).SendKeys("Windesheim"+Keys.Enter);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//span[#id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click();
driver.FindElement(By.XPath("//td[#id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click();
Thread.Sleep(2000);
driver.FindElement(By.XPath("//div[#id='widget_Timetable_toolbar_elementSelect']//input[#class='dijitReset dijitInputField dijitArrowButtonInner']")).Click();
//we get all the options for Klase
doc.LoadHtml(driver.PageSource);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#id='Timetable_toolbar_elementSelect_popup']/div[#item]");
List<String> options = new List<String>();
foreach (HtmlNode n in nodes)
{
options.Add(n.InnerText);
}
foreach(string s in options)
{
driver.FindElement(By.XPath("//input[#id='Timetable_toolbar_elementSelect']")).Clear();
driver.FindElement(By.XPath("//input[#id='Timetable_toolbar_elementSelect']")).SendKeys(s);
Thread.Sleep(2000);
driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter);
Thread.Sleep(2000);
doc.LoadHtml(driver.PageSource);
//Console.WriteLine(driver.Url); //Now we can see the id of the current Klase
}
Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
}
Last update
Using the Selenium solution I was able to get the ID's for all courses. I have included the file here so you can use it with your ajax and web requests.
Related
Pretty standard implementation of HttpWebRequest, whenever I pass a certain URL to get the html it comes back with nothing but special characters. An example of what comes back is below.
Now this site is SSL so I'm wondering if that has something to do with it but I've never had this problem before and I've used this with other SSL sites.
�
ServicePointManager.ServerCertificateValidationCallback = new System.Net.Security.RemoteCertificateValidationCallback(AcceptAllCertifications);
var request = (HttpWebRequest)WebRequest.Create(url);
using (var response = (HttpWebResponse)request.GetResponse())
{
Stream data = response.GetResponseStream();
HtmlDocument hDoc = new HtmlDocument();
using (StreamReader readURLContent = new StreamReader(data))
{
html = readURLContent.ReadToEnd();
hDoc.LoadHtml(html);
}
}
I can't really find anything for this specific issue so I'm kind of lost if anybody could point me in the right direction that would be awesome.
Edit: here's an image of what it looks like since I can't copy paste it
My guess is that the response is compressed. If you use a WebDebugger like Charles or Fiddler. You can see how the requests and structured and what data they contain - it makes it a lot easier to replicate the http requests later on when programming them. Try the following code.
try
{
string webAddr = url;
var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr);
httpWebRequest.ContentType = "text/html; charset=utf-8";
httpWebRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0";
httpWebRequest.AllowAutoRedirect = true;
httpWebRequest.Method = "GET";
httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream(), Encoding.UTF8))
{
var responseText = streamReader.ReadToEnd();
doc.LoadHtml(responseText);
}
}
catch (WebException ex)
{
Console.WriteLine(ex.Message);
}
The code sets the encoding on the requsts. You an also set the encoding at the streamreader when reading the response. And automatic decompression is enabled.
First, i try to post the script in PostMan tool.
{"AO":"ECHO"}
It working fine. Then i'm writing this request in C# but it not working.
And more i wrote the request again in Python, and it working well.
But my project is in Microsoft C#. I dont want to run script Python in C# at all.
==== Python =========
import httplib
import json
import sys
data = '{"AO":"ECHO"}'
headers = {"Content-Type": "application/json", "Connection": "Keep-Alive" }
conn = httplib.HTTPConnection("http://10.10.10.1",1040)
conn.request("POST", "/guardian", data, headers)
response = conn.getresponse()
print response.status, response.reason
print response.msg
==== C# ============
var httpWebRequest = (HttpWebRequest)WebRequest.Create("http://10.10.10.1:1040/guardian");
httpWebRequest.Method = "POST";
httpWebRequest.ContentType = "application/json";
using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream()))
{
string json = "{\"AO\":\"ECHO\"}";
streamWriter.Write(json);
streamWriter.Flush();
streamWriter.Close();
var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var result = streamReader.ReadToEnd();
Console.WriteLine(result);
}
}
I try to put "ContentLength" but it still timeout exception.
And i try to using RestSharp, it's not timeout but return null.
Any one please help...
var client = new RestClient("http://10.10.10.1:1040/guardian");
var request = new RestRequest();
request.Method = Method.POST;
request.AddHeader("Content-Type", "application/json");
request.Parameters.Clear();
request.RequestFormat = DataFormat.Json;
request.AddBody(new { AO = "ECHO" });
var response = client.Execute(request);
var content = response.Content;
Please help me,
I dont understand why it working fine in python.
But why it not working in C#.
I try to find many request in C# but it got error exception with timeout.
Python will automatically add Content-Length http header.
https://docs.python.org/2/library/httplib.html#httpconnection-objects
I think you might have to set this header manually in C#.
httpWebRequest.ContentLength = json.length;
Depending on the server, you may have to set UserAgent as well.
httpWebRequest.UserAgent=".NET Framework Test Client";
I have done something similar to this before however I'm not sure how to do this with a bigger project.
I'm trying to return the titles of all the stuff on the front page of reddit.
From this site:
http://www.reddit.com/r/all.json
I pasted the data into
http://json2csharp.com/#
to find out the class I need.
From here though, I'm not too sure on how to proceed. If I wanted to return an array of all this data so I can easily get information, how could I do it.
Sorry for the vagueness of this question but I'm just at a loss and don't know what to do.
Use
using (var webClient = new System.Net.WebClient()) {
var json = webClient.DownloadString("http://www.reddit.com/r/all.json");
}
For old .Net:
var request = WebRequest.Create(url);
string text;
request.ContentType = "application/json; charset=utf-8";
var response = (HttpWebResponse) request.GetResponse();
using (var sr = new StreamReader(response.GetResponseStream()))
{
text = sr.ReadToEnd();
}
i wrote a simple C# function to retrieve trade history from MtGox with following API call:
https://data.mtgox.com/api/1/BTCUSD/trades?since=<trade_id>
documented here: https://en.bitcoin.it/wiki/MtGox/API/HTTP/v1#Multi_currency_trades
here's the function:
string GetTradesOnline(Int64 tid)
{
Thread.Sleep(30000);
// communicate
string url = "https://data.mtgox.com/api/1/BTCUSD/trades?since=" + tid.ToString();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string json = reader.ReadToEnd();
reader.Close();
reader.Dispose();
response.Close();
return json;
}
i'm starting at tid=0 (trade id) to get the data (from the very beginning). for each request, i receive a response containing 1000 trade details. i always send the trade id from the previous response for the next request. it works fine for exactly 4 requests & responses. but after that, the following line throws a "System.Net.WebException", saying that "The operation has timed out":
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
here are the facts:
catching the exception and retying keeps causing the same exception
the default HttpWebRequest .Timeout and .ReadWriteTimeout are already high enough (over a minute)
changing HttpWebRequest.KeepAlive to false didn't solve anything either
it seems to always work in the browser even while the function is failing
it has no problems retrieveing the response from https://www.google.com
the amount of successful responses before the exceptions varies from day to day (but browser always works)
starting at the trade id that failed last time causes the exception immediately
calling this function from the main thread instead still caused the exception
running it on a different machine didn't work
running it from a different IP didn't work
increasing Thread.Sleep inbetween requests does not help
any ideas of what could be wrong?
I had the very same issue.
For me the fix was as simple as wrapping the HttpWebResponse code in using block.
using (HttpWebResponse response = (HttpWebResponse) request.GetResponse())
{
// Do your processings here....
}
Details: This issue usually happens when several requests are made to the same host, and WebResponse is not disposed properly. That is where using block will properly dispose the WebResponse object properly and thus solving the issue.
There are two kind of timeouts. Client timeout and server timeout. Have you tried doing something like this:
request.Timeout = Timeout.Infinite;
request.KeepAlive = true;
Try something like this...
I just had similar troubles calling a REST Service on a LINUX Server thru ssl. After trying many different configuration scenarios I found out that I had to send a UserAgent in the http head.
Here is my final method for calling the REST API.
private static string RunWebRequest(string url, string json)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
// Header
request.ContentType = "application/json";
request.Method = "POST";
request.AllowAutoRedirect = false;
request.KeepAlive = false;
request.Timeout = 30000;
request.ReadWriteTimeout = 30000;
request.UserAgent = "test.net";
request.Accept = "application/json";
request.ProtocolVersion = HttpVersion.Version11;
request.Headers.Add("Accept-Language","de_DE");
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls;
ServicePointManager.ServerCertificateValidationCallback = delegate { return true; };
byte[] bytes = Encoding.UTF8.GetBytes(json);
request.ContentLength = bytes.Length;
using (var writer = request.GetRequestStream())
{
writer.Write(bytes, 0, bytes.Length);
writer.Flush();
writer.Close();
}
var httpResponse = (HttpWebResponse)request.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var jsonReturn = streamReader.ReadToEnd();
return jsonReturn;
}
}
This is not a solution, but just an alternative:
These days i almost only use WebClient instead of HttpWebRequest. Especially WebClient.UploadString for POST and PUT and WebClient.DownloadString. These simply take and return strings. This way i don't have to deal with streams objects, except when i get a WebException. i can also set the content type with WebClient.Headers["Content-type"] if necessary. The using statement also makes life easier by calling Dispose for me.
Rarely for performance, i set System.Net.ServicePointManager.DefaultConnectionLimit high and instead use HttpClient with it's Async methods for simultaneous calls.
This is how i would do it now
string GetTradesOnline(Int64 tid)
{
using (var wc = new WebClient())
{
return wc.DownloadString("https://data.mtgox.com/api/1/BTCUSD/trades?since=" + tid.ToString());
}
}
2 more POST examples
// POST
string SubmitData(string data)
{
string response;
using (var wc = new WebClient())
{
wc.Headers["Content-type"] = "text/plain";
response = wc.UploadString("https://data.mtgox.com/api/1/BTCUSD/trades", "POST", data);
}
return response;
}
// POST: easily url encode multiple parameters
string SubmitForm(string project, string subject, string sender, string message)
{
// url encoded query
NameValueCollection query = HttpUtility.ParseQueryString(string.Empty);
query.Add("project", project);
query.Add("subject", subject);
// url encoded data
NameValueCollection data = HttpUtility.ParseQueryString(string.Empty);
data.Add("sender", sender);
data.Add("message", message);
string response;
using (var wc = new WebClient())
{
wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
response = wc.UploadString( "https://data.mtgox.com/api/1/BTCUSD/trades?"+query.ToString()
, WebRequestMethods.Http.Post
, data.ToString()
);
}
return response;
}
Error handling
try
{
Console.WriteLine(GetTradesOnline(0));
string data = File.ReadAllText(#"C:\mydata.txt");
Console.WriteLine(SubmitData(data));
Console.WriteLine(SubmitForm("The Big Project", "Progress", "John Smith", "almost done"));
}
catch (WebException ex)
{
string msg;
if (ex.Response != null)
{
// read response HTTP body
using (var sr = new StreamReader(ex.Response.GetResponseStream())) msg = sr.ReadToEnd();
}
else
{
msg = ex.Message;
}
Log(msg);
}
For what it's worth, I was experiencing the same issues with timeouts every time I used it, even though calls went through to the server I was calling. The problem in my case was that I had Expect set to application/json, when that wasn't what the server was returning.
My method looks like this:
public string Request(string action, NameValueCollection parameters, uint? timeoutInSeconds = null)
{
parameters = parameters ?? new NameValueCollection();
ProvideCredentialsFor(ref parameters);
var data = parameters.ToUrlParams(); // my extension method converts the collection to a string, works well
byte[] dataStream = Encoding.UTF8.GetBytes(data);
string request = ServiceUrl + action;
var webRequest = (HttpWebRequest)WebRequest.Create(request);
webRequest.AllowAutoRedirect = false;
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = dataStream.Length;
webRequest.Timeout = (int)(timeoutInSeconds == null ? DefaultTimeoutMs : timeoutInSeconds * 1000);
webRequest.Proxy = null; // should make it faster...
using (var newStream = webRequest.GetRequestStream())
{
newStream.Write(dataStream, 0, dataStream.Length);
}
var webResponse = (HttpWebResponse)webRequest.GetResponse();
string uri = webResponse.Headers["Location"];
string result;
using (var sr = new StreamReader(webResponse.GetResponseStream()))
{
result = sr.ReadToEnd();
}
return result;
}
The server sends JSON in response. It works fine for small JSON, but when I request a large one - something goes wrong. By large one I mean something that takes 1-2 minutes to appear in a browser (google chrome, including server side generation time). It's actually 412KB of text. When I try to ask for the same JSON with the method above I get a web exception (timeout). I changed the timeout to 10 minutes (at least 5 times longer than chrome). Still the same.
Any ideas?
EDIT
This seems to have something to do with MS technologies. On IE this JSON also won't load.
Make sure you close your request. Otherwise, once you hit the maximum number of allowed connections (as low as four, for me, on one occasion), you have to wait for the earlier ones to time out. This is best done using
using (var response = webRequest.GetResponse()) {...