How to read & crawl url & web site file content simultaneously c#

How to read & crawl url & web site file content simultaneously c# - c#

i am developing a small crawler which will be only used for our company web site. crawler will take a url and crawl that url, read content of that url and also extract others urls in that page and start crawl those url....same way process goes and read file content and as well as crawl other url and read their content too.
i want to do all these task simultaneously. more than 1 yrs back i developed a multi-thread file downloader which downloads files simultaneousely.
here is bit snippet for downloading files simultaneousely.
var list = new[]
{
"http://google.com",
"http://yahoo.com",
"http://stackoverflow.com"
};
var tasks = Parallel.ForEach(list,
s =>
{
using (var client = new WebClient())
{
Console.WriteLine("starting to download {0}", s);
string result = client.DownloadString((string)s);
Console.WriteLine("finished downloading {0}", s);
}
});
it would be very helpful if some one guide me how to code for achiving my goal. thanks

Getting the HTML
public string getHTML(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
response.Close();
return html;
}
To parse the code use a parser like: HTML Agility Pack

Related

Can't download webpage via C# webclient and via request/respond

I want to download webpages html code, but have problems with several links. For example: http://www.business-top.info/, http://azerizv.az/
I recieve no html at all using this:
1. WebClient:
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
string result = client.DownloadString(resultUrl);
Console.WriteLine(result);
Console.ReadLine();
}
2. Http request/response
var request = (HttpWebRequest)WebRequest.Create(resultUrl);
request.Method = "POST";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
StreamReader sr = new StreamReader(stream, Encoding.UTF8);
string data = sr.ReadToEnd();
Console.WriteLine(data);
Console.ReadLine();
}
}
There are many such links, so I can't download html manually just via sourse code of web page via browser

Some pages load in stages. First they load the core of the page and only then they evaluate any JavaScript inside which loads further content via AJAX. To scrape these pages you will need more advanced content scraping libraries, than just simple HTTP request sender.
EDIT:
Here is a question in SO about the same problem that you are having now:
Jquery Ajax Web page scraping using c#

Some error when get data from HTTP request with post

I follow the instruction at: HTTP request with post to get the audio file from site: http://www.tudienabc.com/phat-am-tieng-nhat (This site allow us to input the english or japanese word/phrase/ sentence and generate the audio file, look like "/pronunciations/audio?file_name=1431134309.002.mp3&file_type=mp3" at line 129 of HTML code after postback).
However, the audio file which i get from my own application is not same with the one generated from this website. The audio file (mp3) generated from this website can play at www.tudienabc.com/pronunciations/ (such as: www.tudienabc.com/pronunciations/audio?file_name=1431141268.9947.mp3&file_type=mp3), but the audio file generated from my application can not play (such as: www.tudienabc.com/pronunciations/audio?file_name=1431141475.4908.mp3&file_type=mp3).
So, what wrong? And how to get the exact audio file?
Here is my code:
var request = (HttpWebRequest)WebRequest.Create("http://www.tudienabc.com/phat-am-tieng-nhat");
var postData = "_method=POST&data[Pronun][text]=hello&data[Pronun][type]=3";
var data = Encoding.ASCII.GetBytes(postData);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
using (var stream = request.GetRequestStream())
{
stream.Write(data, 0, data.Length);
}
var response = (HttpWebResponse)request.GetResponse();
var responseString = new StreamReader(response.GetResponseStream()).ReadToEnd();
int m = responseString.IndexOf("pronunciations/audio?file_name=")+"pronunciations/audio?file_name=".Length;
int n = responseString.IndexOf("&file_type=mp3");
string filename = responseString.Substring(m, n - m);
return filename;
Thank you,

Their website processes the audio using ECMAScript
<script>
var wait = new waitGenerateAudio(
'#progress_audio_placeholder',
'/pronunciations/checkFinish/1431151184.739',
'aGVsbG8gZnlyeWU=',
'/pronunciations/audio?file_name=1431151184.739.mp3&file_type=mp3',
'T?o file l?i'
);
</script>
You will need to be able to process the JavaScript for the audio file to be created.
Checkout
C# httpwebrequest and javascript
or
WebClient runs javascript
For utilizing a headless browser.
I suggest looking into a more versatile library for text to audio.
https://gist.github.com/alotaiba/1728771

Scraping of a website on a AJAX based Request C#

I have a Website like this http://www.lfp.fr/ligue1/feuille_match/52255 and i want to switches between the tabs infoMatch and Statistiques but it shows me the Data of 1st page only and when i use the firebug to check the reponse it gives me this:
GET showStatsJoueursMatchmatchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
string url="http://www.lfp.fr/ligue1/feuille_match/52255";
string getData = "?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes";
System.Uri uriObj = new System.Uri(url);
String Methode = "GET";
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest = (HttpWebRequest)WebRequest.CreateDefault(uriObj);
lgRequest.Method = Methode;
lgRequest.ContentType = "text/html";
SetRequestHeader("Accept", "text/html");
SetRequestHeader("Cache-Control", "no-cache");
SetRequestHeader("Content-Length", getData.Length.ToString());
StreamWriter stream = new StreamWriter
(lgRequest.GetRequestStream(), Encoding.ASCII);
stream.Write(body);
stream.Close();
lgResponse = (HttpWebResponse)lgRequest.GetResponse();
But it gives me the error "Cannot send a content-body with this verb-type." And when i use the "POST" in Method, it gives the Response of HTML but only the First Page Data not Statistiques.

Try at the following address: http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes
Just like that:
using System;
using System.Net;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
string result = client.DownloadString("http://www.lfp.fr/ligue1/feuille_match/showStatsJoueursMatch?matchId=52255&domId=112&extId=24&live=0&domNomClub=AJ+Auxerre&extNomClub=FC+Nantes");
Console.WriteLine(result);
}
}
}
Notice that I have used a WebClient instead of WebRequest which makes the code much shorter and easier to understand.
Once you have downloaded the HTML from the remote site you might consider using an HTML parsing library such as HTML Agility Pack to extract the useful information from the markup you have scraped.

How to get PDF from http request/response stream

I have a url like this "https://site.com/cgi-bin/somescript.pl?file=12345.pdf&type=application/pdf". When i go to this url it will dispaly an pdf in my browser in an iframe. Is it possible to save this pdf? I am thinking of getting the http response stream and dump it into a file. Please advice. Thanks.

Something like this would work
const string FILE_PATH = "C:\\foo.pdf";
const string DOWNLOADER_URI = "https://site.com/cgi-bin/somescript.pl?file=12345.pdf&type=application/pdf";
using (var writeStream = File.OpenWrite(FILE_PATH))
{
var httpRequest = WebRequest.Create(DOWNLOADER_URI) as HttpWebRequest;
var httpResponse = httpRequest.GetResponse();
httpResponse.GetResponseStream().CopyTo(writeStream);
writeStream.Close();
}

Download files from Google Books

I'm writing very simple application. It is supposed to download files from internet. I have URLs and names for files to save in tables. But my code doesn't work.
for (int i = 1; i < links.Length; i++)
{
Uri uri = new Uri(links[i]);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(uri);
webRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
StreamReader responseStreamReader = new StreamReader(responseStream);
String result = responseStreamReader.ReadToEnd();
StreamWriter w = new StreamWriter(savepath + names[i]);
w.Write(result);
w.Close();
break;
}
example url:
http://books.google.pl/books?id=yOz1ePt39WQC&pg=PA2&img=1&zoom=3&hl=pl&sig=ACfU3U0MDQtXGU_3YVqGvcsDiWLLcKh0KA&w=800&gbd=1
example name:
002.png
Files are to be saved as PNG image but instead I get something that begins with
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Second qestion. How can I detect HTTP 404 error when trying to download?
EDIT:
My bad. my links were incorrect. After replacing & with & they are correct.
Example link (correctted):
http://books.google.pl/books?id=yOz1ePt39WQC&pg=PA2&img=1&zoom=3&hl=pl&sig=ACfU3U0MDQtXGU_3YVqGvcsDiWLLcKh0KA&w=800&gbd=1
Despite of that I can't still download PNGs correctly.
They are not opening. But at least they are not HTML pages.
I'm thinking that trying to save them as a string is not good idea. But I don't know how else I could do that. Maybe using byte[] or something?

Have you tried WebClient.DownloadFile ?
string url = "http://books.google.pl/books?id=yOz1ePt39WQC&pg=PA2&img=1&zoom=3&hl=pl&sig=ACfU3U0MDQtXGU_3YVqGvcsDiWLLcKh0KA&w=800&gbd=1";
string file = "002.png";
WebClient wc = new WebClient();
wc.DownloadFile(url, file);
will save the image in the application directory as 002.png.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read & crawl url & web site file content simultaneously c# - c#

Related

Can't download webpage via C# webclient and via request/respond

Some error when get data from HTTP request with post

Scraping of a website on a AJAX based Request C#

How to get PDF from http request/response stream

Download files from Google Books

Categories

Resources