How to properly get the content of a website? - c#

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)
I already do this:
string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
siteContent = client.DownloadString(edtReadFromUrl.Text);
}
It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.
Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.

You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:
HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText

Related

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

how to remove htmldocument.cs not found error in html agility pack

class Response:
public string WebResponse(string url) //class through which i'll have link of website and will parse some divs in method of this class
{
string html = string.Empty;
try
{
HtmlDocument doc = new HtmlDocument(); //when code comes here it gives an error htmldocument.cs not found,and open window for browsing source
WebClient client = new WebClient(); // even if i put htmlWeb there it still look for HtmlWeb.cs not found
html = client.DownloadString(url); //is this from some breakpoint error coz i set only one in method where i am parsing,
doc.LoadHtml(html);
}
catch (Exception)
{
html = string.Empty;
}
return html; //please help me to remove this error using html agility pack with console application
}
even if i make new project and run code it stuck here and i have added DLL too still it is giving me this error please help me to remove this error
WebResponse is an abstract class meaning it is a reserved word first of all. Second - In order to use WebResponse a class has to inherit from WebResponse ie.
public class WR : WebResponse
{
//Code
}
Also. Your current code has nothing to with Html Agility Pack. If you want to load the html of a webpage into a HtmlDocument - do the following:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
Then in order to get nodes in the Html Document you have to use xPath like so:
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//body");
Console.WriteLine(node.InnerText);
that error is sometimes because of version of you are using of Nuget html agility pack,update your nuget in the visual studio gallery then try installing html agility pack and run in your project
You can try cleaning and re-building the solution.This may fix the issue.

How to get unicode string with WebClient DownloadData?

Sorry for my bad English.
I am trying to get a string data with this code:
WebClient wc = new WebClient();
byte[] buffer = wc.DownloadData("http://......);
string xml = Encoding.UTF8.GetString(buffer);
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
the string has Unicode data. when I get this with my browser like firefox every things are ok.
But in my code the string is broken and xml file is useless. Some characters changed to their
decimal value and when reading xml file they are only characters that we can read. and others
changed to strange signs.
Do you know how can I do?
Put your data into a stream:
var stream = new MemoryStream(buffer);
And load it with the Load method:
doc.Load(stream);
This will try to detect the correct encoding.
Or maybe WebClient.DownloadString will work as well.

HtmlAgilityPack - How to set custom encoding when loading pages

Is it possible to set custom encoding when loading pages with the method below?
HtmlWeb hwWeb = new HtmlWeb();
HtmlDocument hd = hwWeb.load("myurl");
I want to set encoding to "iso-8859-9".
I use C# 4.0 and WPF.
Edit: The question has been answered on MSDN.
I suppose you could try overriding the encoding in the HtmlWeb object.
Try this:
var web = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = myEncoding,
};
var doc = web.Load(myUrl);
Note: It appears that the OverrideEncoding property was added to HTML agility pack in revision 76610 so it is not available in the current release v1.4 (66017). The next best thing to do would be to read the page manually with the encodings overridden.
var document = new HtmlDocument();
using (var client = new WebClient())
{
using (var stream = client.OpenRead(url))
{
var reader = new StreamReader(stream, Encoding.GetEncoding("iso-8859-9"));
var html = reader.ReadToEnd();
document.LoadHtml(html);
}
}
This is a simple version of the solution answered here (for some reasons it got deleted)
A decent answer is over here which handles auto-detecting the encoding as well as some other nifty features:
C# and HtmlAgilityPack encoding problem

Extract data webpage

Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..
To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}
If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser

Categories