I want to download webpages html code, but have problems with several links. For example: http://www.business-top.info/, http://azerizv.az/
I recieve no html at all using this:
1. WebClient:
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
string result = client.DownloadString(resultUrl);
Console.WriteLine(result);
Console.ReadLine();
}
2. Http request/response
var request = (HttpWebRequest)WebRequest.Create(resultUrl);
request.Method = "POST";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
StreamReader sr = new StreamReader(stream, Encoding.UTF8);
string data = sr.ReadToEnd();
Console.WriteLine(data);
Console.ReadLine();
}
}
There are many such links, so I can't download html manually just via sourse code of web page via browser
Some pages load in stages. First they load the core of the page and only then they evaluate any JavaScript inside which loads further content via AJAX. To scrape these pages you will need more advanced content scraping libraries, than just simple HTTP request sender.
EDIT:
Here is a question in SO about the same problem that you are having now:
Jquery Ajax Web page scraping using c#
Related
I follow the instruction at: HTTP request with post to get the audio file from site: http://www.tudienabc.com/phat-am-tieng-nhat (This site allow us to input the english or japanese word/phrase/ sentence and generate the audio file, look like "/pronunciations/audio?file_name=1431134309.002.mp3&file_type=mp3" at line 129 of HTML code after postback).
However, the audio file which i get from my own application is not same with the one generated from this website. The audio file (mp3) generated from this website can play at www.tudienabc.com/pronunciations/ (such as: www.tudienabc.com/pronunciations/audio?file_name=1431141268.9947.mp3&file_type=mp3), but the audio file generated from my application can not play (such as: www.tudienabc.com/pronunciations/audio?file_name=1431141475.4908.mp3&file_type=mp3).
So, what wrong? And how to get the exact audio file?
Here is my code:
var request = (HttpWebRequest)WebRequest.Create("http://www.tudienabc.com/phat-am-tieng-nhat");
var postData = "_method=POST&data[Pronun][text]=hello&data[Pronun][type]=3";
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
var response = (HttpWebResponse)request.GetResponse();
var responseString = new StreamReader(response.GetResponseStream()).ReadToEnd();
int m = responseString.IndexOf("pronunciations/audio?file_name=")+"pronunciations/audio?file_name=".Length;
int n = responseString.IndexOf("&file_type=mp3");
string filename = responseString.Substring(m, n - m);
return filename;
Thank you,
Their website processes the audio using ECMAScript
<script>
var wait = new waitGenerateAudio(
'#progress_audio_placeholder',
'/pronunciations/checkFinish/1431151184.739',
'aGVsbG8gZnlyeWU=',
'/pronunciations/audio?file_name=1431151184.739.mp3&file_type=mp3',
'T?o file l?i'
);
</script>
You will need to be able to process the JavaScript for the audio file to be created.
Checkout
C# httpwebrequest and javascript
or
WebClient runs javascript
For utilizing a headless browser.
I suggest looking into a more versatile library for text to audio.
https://gist.github.com/alotaiba/1728771
i am developing a small crawler which will be only used for our company web site. crawler will take a url and crawl that url, read content of that url and also extract others urls in that page and start crawl those url....same way process goes and read file content and as well as crawl other url and read their content too.
i want to do all these task simultaneously. more than 1 yrs back i developed a multi-thread file downloader which downloads files simultaneousely.
here is bit snippet for downloading files simultaneousely.
var list = new[]
{
"http://google.com",
"http://yahoo.com",
"http://stackoverflow.com"
};
var tasks = Parallel.ForEach(list,
s =>
{
using (var client = new WebClient())
{
Console.WriteLine("starting to download {0}", s);
string result = client.DownloadString((string)s);
Console.WriteLine("finished downloading {0}", s);
}
});
it would be very helpful if some one guide me how to code for achiving my goal. thanks
Getting the HTML
public string getHTML(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
response.Close();
return html;
}
To parse the code use a parser like: HTML Agility Pack
I need to send data to some url using POST but from C# code. Then I would like to display result as page.
By now I have some sample code, but it is'nt work properly as I want to.
I'm using asp.net mvc as client.
using (WebClient client = new WebClient())
{
byte[] response = client.UploadValues(url,
new NameValueCollection()
{
{"desc", payment.Description},
{"first_name", payment.FirstName},
{"last_name", payment.FirstName},
{"email", payment.Email},
{"client_ip", payment.ClientIp},
});
var str = System.Text.Encoding.Default.GetString(response);
Response.Write(str);
}
I have a Website like this http://www.lfp.fr/ligue1/feuille_match/52255 and i want to switches between the tabs infoMatch and Statistiques but it shows me the Data of 1st page only and when i use the firebug to check the reponse it gives me this:
GET showStatsJoueursMatchmatchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
string url="http://www.lfp.fr/ligue1/feuille_match/52255";
string getData = "?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes";
System.Uri uriObj = new System.Uri(url);
String Methode = "GET";
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest.Method = Methode;
lgRequest.ContentType = "text/html";
SetRequestHeader("Accept", "text/html");
SetRequestHeader("Cache-Control", "no-cache");
SetRequestHeader("Content-Length", getData.Length.ToString());
StreamWriter stream = new StreamWriter
(lgRequest.GetRequestStream(), Encoding.ASCII);
stream.Write(body);
stream.Close();
lgResponse = (HttpWebResponse)lgRequest.GetResponse();
But it gives me the error "Cannot send a content-body with this verb-type." And when i use the "POST" in Method, it gives the Response of HTML but only the First Page Data not Statistiques.
Try at the following address: http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
Just like that:
using System;
using System.Net;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes");
Console.WriteLine(result);
}
}
}
Notice that I have used a WebClient instead of WebRequest which makes the code much shorter and easier to understand.
Once you have downloaded the HTML from the remote site you might consider using an HTML parsing library such as HTML Agility Pack to extract the useful information from the markup you have scraped.
I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?
Here is my current code:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream,
CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream,
CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{
//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
Here's a sample URL:
http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4
You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.
To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.
Webkit also has .NET bindings but I haven't used them.
The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).
Try this question:
What's a good tool to screen-scrape with Javascript support?
You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.
The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.