Download and encode HTML page into file - c#

I like to download some web pages which use charset="UTF-8"
This page is a sample: http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2003
I always end up with special characters like this:
Beyoncé instead of Beyoncé
I tried the following code:
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
webClient.DownloadFile(url, fileName);
or this one:
WebClient client = new WebClient();
Byte[] pageData = client.DownloadData(url);
string pageHtml = Encoding.UTF8.GetString(pageData);
System.IO.File.WriteAllText(fileName, pageHtml);
What do I do wrong?
I just want an easy way to download web pages and write them to files. After that is done I will extract data from these files and obviously I want "normal" characters like I see on the original web-page and not some special characters.

The problem is that the WriteAllText Method don't write the encoded Text in UTF-8 in the File.
You should add the Encoding:
System.IO.File.WriteAllText(fileName, pageHtml, Encoding.UTF8);

Related

Some image URLs don't work in webClient downloadData method in C#

Can anyone tell why some URLs as below table are fail to download? But some URLs are OK with same hosting.
URLs
http://hositing/mylibrary/image.jpd.png
http://hositing/mylibrary/a_asd.png
http://hositing/mylibrary/a?asd.png
//is it because of the special characters?
If it is because of the special characters, is there a way to solve it? The URL is entered by client, so there is a case to destroy my program.
If it is no way to solve it, i will disable the process if the URL contains any special character.
some code for reference:
WebClient wc = new WebClient();
byte[] bytes = wc.DownloadData(#"http://hositing/mylibrary/image.jpd.png");
MemoryStream ms = new MemoryStream(bytes);
Image myImage = Image.FromStream(ms);

How to save string content to a local file

I am developing an application which is showing web pages through a web browser control.
When I click the save button, the web page with images should be stored in local storage. It should be save in .html format.
I have the following code:
WebRequest request = WebRequest.Create(txtURL.Text);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Now string html contains the webpage content. I need to save this into D:\Cache\
How do i save the html contents to disk?
You can use this code to write your HTML string to a file:
var path= #"D:\Cache\myfile.html";
File.WriteAllText(path, html);
Further refinement: Extract the filename from your (textual) URL.
Update:
See Get file name from URI string in C# for details. The idea is:
var uri = new Uri(txtUrl.Text);
var filename = uri.IsFile
? System.IO.Path.GetFileName(uri.LocalPath)
: "unknown-file.html";
you have to write below code on save button
File.WriteAllText(path, browser.Document.Body.Parent.OuterHtml, Encoding.GetEncoding(browser.Document.Encoding));
Now the 'Body.parent' must save whole the page instead of just saving only part.
check it.
There is nothing built-in to the .NET Framework as far I know.
So my approach would be like below:
Use System.NET.HttpWebRequest to get the main HTML document as a
string or stream (easy). (Which you have done already)
Load this into a HTMLAgilityPack document where you can now easily
query the document to get lists of all image elements, stylesheet
links, etc.
Then make a separate web request for each of these files and save
them to a subdirectory.
Finally update all relevent links in the main page to point to the
items in the subdirectory.

Set encoding between PHP soap server and c# soap client

I have a PHP SOAP server (using nuSOAP with wsdl) that send the content of a html page. Of course, the HTML can be coded with differents encoding and here is when the problems appear. If I used a PHP SOAP client I can send the encoding like this:
$clienteSOAP = new SoapClient ("http://test.mine.com/wsdl/filemanager?wsdl",
array ('encoding' => 'ISO-8859-15'));
$clienteSOAP->__soapCall ("Test.uploadHTML",
array (file_get_contents ('/home/КОЛЛЕКЦИЯ_РОДНИК_ПРЕМИУМ.html')));
And if I put the correct encoding, has never failed so far. But when I use a C# client, how can I put the encoding in the web service petition? In C# the code is:
System.IO.StreamReader html = new System.IO.StreamReader (
"C:\\Documents and Settings\\КОЛЛЕКЦИЯ_РОДНИК_ПРЕМИУМ.html"
,System.Text.Encoding.GetEncoding("iso-8859-15"));
string contenido = html.ReadToEnd();
html.Close();
Test.FileManager upload = new Test.FileManager();
string resultado = upload.TestUploadHTML (contenido);
Test.FileManager is a Web reference of the wsdl, and when I see the "upload html" some characters aren't correct.
Thanks in advance.
nusoap internally uses the php function xml_parser_create, that only supports: ISO-8859-1, UTF-8 and US-ASCII. For this reason, this library don't works well with other encoding. Great PacHecoPe...
UPDATE: The best option, in my case, is read the archive in its original encoding and transform it to utf-8:
System.IO.StreamReader html = new System.IO.StreamReader (
"C:\\Documents and Settings\\КОЛЛЕКЦИЯ_РОДНИК_ПРЕМИУМ.html"
,System.Text.Encoding.GetEncoding("iso-8859-15"));
string contenido = html.ReadToEnd();
html.Close();
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
byte[] bytes = System.Text.Encoding.UTF8.GetBytes (contenido);
string contenidoUTF8 = encoder.GetString(bytes);
upload.RequestEncoding = System.Text.Encoding.GetEncoding("UTF-8");
Test.FileManager upload = new Test.FileManager();
string resultado = upload.TestUploadHTML (contenidoUTF8);
UPDATE2: With encoding that not supported in UTF-8 like big5, don't work very well the above code. For this reason, it's better don't make the transform to UTF-8 and set the parameter with the content of html like base64Binary, in the wsdl.

Unable to print languages other than English in System.windows.Forms.WebBrowser

I am trying to use System.windows.Forms.WebBrowser to display a content in the languages other than English, but the resulting encoding is incorrect. What should I do to display for example Russian?
I am downloading and displaying a string as following:
System.Net.WebClient wc = new System.Net.WebClient();
webBrsr.DocumentText = wc.DownloadString(url);
The problem is with the WebClient and how it is interpreting the string encoding. One solution is to download the data as raw bytes and parse it out manually:
Bytes[] bytes = wc.DownloadData("http://news.google.com/news?edchanged=1&ned=ru_ru");
//You should really inspect the headers from the response to determine the exact encoding to use,
// this example just assumes UTF-8 which might work in most scenarios
String t = System.Text.Encoding.UTF8.GetString(bytes);
webBrsr.DocumentText = t;

Handle special chars

When I do
WebClient wc = new WebClient();
string content = wc.DownloadString(url);
File.WriteAllText(path, content);
And I open the file in path with Internet Explorer, special characters like ó apear like ó.
Is there a way for interpreting correctly those characters?
You're downloading it in whatever content encoding is specified, but then saving it as UTF-8. If you want to save it to disk anyway, I suggest you use WebClient.DownloadFile directly instead. Then so long as the encoding is also specified in the HTML (correctly) it should be okay.

Categories