Scraping of a website on a AJAX based Request C# - c#

I have a Website like this http://www.lfp.fr/ligue1/feuille_match/52255 and i want to switches between the tabs infoMatch and Statistiques but it shows me the Data of 1st page only and when i use the firebug to check the reponse it gives me this:
GET showStatsJoueursMatchmatchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
string url="http://www.lfp.fr/ligue1/feuille_match/52255";
string getData = "?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes";
System.Uri uriObj = new System.Uri(url);
String Methode = "GET";
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest.Method = Methode;
lgRequest.ContentType = "text/html";
SetRequestHeader("Accept", "text/html");
SetRequestHeader("Cache-Control", "no-cache");
SetRequestHeader("Content-Length", getData.Length.ToString());
StreamWriter stream = new StreamWriter
(lgRequest.GetRequestStream(), Encoding.ASCII);
stream.Write(body);
stream.Close();
lgResponse = (HttpWebResponse)lgRequest.GetResponse();
But it gives me the error "Cannot send a content-body with this verb-type." And when i use the "POST" in Method, it gives the Response of HTML but only the First Page Data not Statistiques.

Try at the following address: http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
Just like that:
using System;
using System.Net;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes");
Console.WriteLine(result);
}
}
}
Notice that I have used a WebClient instead of WebRequest which makes the code much shorter and easier to understand.
Once you have downloaded the HTML from the remote site you might consider using an HTML parsing library such as HTML Agility Pack to extract the useful information from the markup you have scraped.

Related

Getting the HTML code of a webpage

I am trying to get the HTML code of a webpage using it's url. I have written the following code, it works, but comparing the resulting string it doesn't match the code I see when I use google chrome's inspect. I am not an HTML gru, but it seems to be different.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://fantasy.premierleague.com/a/leagues/standings/517292/classic");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader stream = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(response.CharacterSet));
string PageScript = stream.ReadToEnd();
The resulting script is as follows: https://ideone.com/DXzfKy
I am using those two lines to set the security protocol
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
If someone can tell me what am I looking at and what might be wrong, I will be grateful.
All you need to do is to create an instance of a WebClient and using that you can read the data from URI, than convert it into StreamReader and finally in Plain Text Format.
WebClient client = new WebClient();
Stream dataFromPage = client.OpenRead(new Uri("https://ideone.com/DXzfKy"));
StreamReader reader = new StreamReader(dataFromPage);
string htmlContent = reader.ReadToEnd();

C# Webclient returning error 404

I'm using below script to retrieve HTML from an URL.
string webURL = #"https://nl.wiktionary.org/wiki/" + word.ToLower();
using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString(webURL);
}
The variable word can be any word. In case there is no WIKI page for the "word" be retrieved the code is ending in error with code 404, while retrievng the URL with a browser opens a WIKI page, saying there is no page for this item yet.
What I want is that the code always gets the HTML, also when the WIKI page says there is no info yet. I do not want to avoid the error 404 with a try and catch.
Does anyone has an idea why this is not working with a Webclient?
try this. You can catch the 404 error content in a try catch block.
var word = Console.ReadLine();
string webURL = #"https://nl.wiktionary.org/wiki/" + word.ToLower();
using (WebClient client = new WebClient() { })
{
try
{
string htmlCode = client.DownloadString(webURL);
}
catch (WebException exception)
{
string responseText=string.Empty;
var responseStream = exception.Response?.GetResponseStream();
if (responseStream != null)
{
using (var reader = new StreamReader(responseStream))
{
responseText = reader.ReadToEnd();
}
}
Console.WriteLine(responseText);
}
}
Console.ReadLine();
Since this WIKI-server use case-sensitive url mapping, just don't modify case of URL to harvest (remove ".ToLower()" from you code).
Ex.:
Lower case:
https://nl.wiktionary.org/wiki/categorie:onderwerpen_in_het_nynorsk
Result: HTTP 404(Not Found)
Normal (unmodified) case:
https://nl.wiktionary.org/wiki/Categorie:Onderwerpen_in_het_Nynorsk
Result: HTTP 200(OK)
Also, keep in mind what most (if not all) WiKi servers (including this one) generates custom 404 pages, so in browser they looks like "normal" pages, but, despite this, they are serving with 404 http code.

Can't download webpage via C# webclient and via request/respond

I want to download webpages html code, but have problems with several links. For example: http://www.business-top.info/, http://azerizv.az/
I recieve no html at all using this:
1. WebClient:
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
string result = client.DownloadString(resultUrl);
Console.WriteLine(result);
Console.ReadLine();
}
2. Http request/response
var request = (HttpWebRequest)WebRequest.Create(resultUrl);
request.Method = "POST";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
StreamReader sr = new StreamReader(stream, Encoding.UTF8);
string data = sr.ReadToEnd();
Console.WriteLine(data);
Console.ReadLine();
}
}
There are many such links, so I can't download html manually just via sourse code of web page via browser
Some pages load in stages. First they load the core of the page and only then they evaluate any JavaScript inside which loads further content via AJAX. To scrape these pages you will need more advanced content scraping libraries, than just simple HTTP request sender.
EDIT:
Here is a question in SO about the same problem that you are having now:
Jquery Ajax Web page scraping using c#

how to read content from webpage

I want to access a webpage & store the contents of the webpage into a database
this is the code I have tried for reading the contents of the webpage
public static WebClient wClient = new WebClient();
public static TextWriter textWriter;
public static String readFromLink()
{
string url = "http://www.ncedc.org/cgi-bin/catalog-search2.pl";
HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
webRequest.Method = "POST";
System.Net.WebClient client = new System.Net.WebClient();
byte[] data = client.DownloadData(url);
string html = System.Text.Encoding.UTF8.GetString(data);
return html;
}
public static bool WriteTextFile(String fileName, String t)
{
try
{
textWriter = new StreamWriter(fileName);
}
catch (Exception)
{
return false;
Console.WriteLine("Data Save Unsuccessful: Could Not create File");
}
try
{
textWriter.WriteLine(t);
}
catch (Exception)
{
return false;
Console.WriteLine("Data Save UnSuccessful: Could Not Save Data");
}
textWriter.Close();
return true;
Console.WriteLine("Data Save Successful");
}
static void Main(string[] args)
{
String saveFile = "E:/test.txt";
String reSultString = readFromLink();
WriteTextFile(saveFile, reSultString);
Console.ReadKey();
}
but this code gives me an o/p as- This script should be referenced with a METHOD of POST. REQUEST_METHOD=GET
please tell me how to resolve this
You are mixing HttpWebRequest with System.Net.WebClient code. They are a different. You can use WebClient.UploadValues to send a POST with WebClient. You will also need to provide some POST data:
System.Net.WebClient client = new System.Net.WebClient();
NameValueCollection postData = new NameValueCollection();
postData.Add("format","ncread");
postData.Add("mintime","2002/01/01,00:00:00");
postData.Add("minmag","3.0");
postData.Add("etype","E");
postData.Add("outputloc","web");
postData.Add("searchlimit","100000");
byte[] data = client.UploadValues(url, "POST", postData);
string html = System.Text.Encoding.UTF8.GetString(data);
You can find out what parameters to pass by inspecting the POST message in Fiddler. And yes, as commented by #Chris Pitman, use File.WriteAllText(path, html);
I'm not sure if it's a fault on your side as I get the same message just by opening the page. The page source does not contain any html so I don't think you can do webRequest.Method = "POST". Have you spoken to the administrators of the site?
The .NET framework provides a rich set of methods to access data stored on the web. First you will have to include the right namespaces:
using System.Text;
using System.Net;
using System.IO;
The HttpWebRequest object allows us to create a request to the URL, and the WebResponse allows us to read the response to the request.
We’ll use a StreamReader object to read the response into a string variable.
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(URL);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
In this code sample, the URL variable should contain the URL that you want to get, and the result variable will contain the contents of the web page. You may want to add some error handling as well for a real application.
As far as I see, the URL you're requesting is a perl script. I think it demands POST to get search arguments and therefore delivers search results.

Reading remote file [C#]

I am trying to read a remote file using HttpWebRequest in a C# Console Application. But for some reason the request is empty - it never finds the URL.
This is my code:
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://uo.neverlandsreborn.org:8000/botticus/status.ecl");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
How come this is not possible?
The file only contains a string. Nothing more!
How are you reading the response data? Does it come back as successful but empty, or is there an error status?
If that doesn't help, try Wireshark, which will let you see what's happening at the network level.
Also, consider using WebClient instead of WebRequest - it does make it incredibly easy when you don't need to do anything sophisticated:
string url = "http://uo.neverlandsreborn.org:8000/botticus/status.ecl";
WebClient wc = new WebClient();
string data = wc.DownloadString(url);
You have to get the response stream and read the data out of that. Here's a function I wrote for one project that does just that:
private static string GetUrl(string url)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
if (response.StatusCode != HttpStatusCode.OK)
throw new ServerException("Server returned an error code (" + ((int)response.StatusCode).ToString() +
") while trying to retrieve a new key: " + response.StatusDescription);
using (var sr = new StreamReader(response.GetResponseStream()))
{
return sr.ReadToEnd();
}
}
}

Categories