XML - help with RSS UTF-8 support - c#

I used this solution to read and parse a RSS feed from an ASP.NET website. This worked perfectly. However, when trying it on another site, an error occurs because "System does not support 'utf8' encoding." Below I have included an extract of my code.
private void Form1_Load(object sender, EventArgs e)
{
lblFeed.Text = ProcessRSS("http://buypoe.com/external.php?type=RSS2", "ScottGq");
}
public static string ProcessRSS(string rssURL, string feed)
{
WebRequest request = WebRequest.Create(rssURL);
WebResponse response = request.GetResponse();
StringBuilder sb = new StringBuilder("");
Stream rssStream = response.GetResponseStream();
XmlDocument rssDoc = new XmlDocument();
rssDoc.Load(rssStream);
XmlNodeList rssItems = rssDoc.SelectNodes("rss/channel/item");
string title = "";
string link = "";
...
The error occurs at "rssDoc.Load(rssStream);". Any help in encoding the xml correctly would be appreciated.

use the following code for encoding
System.IO.StreamReader stream = new System.IO.StreamReader
(response.GetResponseStream(), System.Text.Encoding.GetEncoding("utf-8"));

Related

How search for HTML elements in StreamReader or String

I've been searching a simple web crawler, and i need search an elements inside my StreamBuilder or string. Example, i need get all content inside an div with id "bodyDiv". Which tool helper me with this?
private static string GetPage(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "Simple crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
I would use HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlText);
var div = doc.DocumentNode.SelectSingleNode("//div[#id='bodyDiv']");
if(div!=null)
{
var yourtext = div.InnerText;
}

Need Help in Understanding a confusion while making Web Crawlers to get total Links Count

I have tried to get a starting to making a web crawler. Was progressing well till I got this confusion that I can't understand. I have written the following code:
I am passing http://www.google.com as the string URL
public void crawlURL(string URL, string depth)
{
if (!checkPageHasBeenCrawled(URL))
{
PageContent = getURLContent(URL);
MatchCollection matches = Regex.Matches(PageContent, "href=\"", RegexOptions.IgnoreCase);
int count = matches.Count;
}
}
private string getURLContent(string URL)
{
string content;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
request.UserAgent = "Fetching contents Data";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
content = reader.ReadToEnd();
reader.Close();
stream.Close();
return content;
}
Problem:
I am trying to get all the Links of the page(http://www.google.com or any other website) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
I fixed and tested your regex pattern. The following should work more efficiently. It gets 11 matches from google.ca
public void crawlURL(string URL)
{
PageContent = getURLContent(URL);
MatchCollection matches = Regex.Matches(PageContent, "(href=\"https?://[a-z0-9-._~:/?#\\[\\]#!$&'()*+,;=]+(?=\"|$))", RegexOptions.IgnoreCase);
foreach (Match match in matches)
Console.WriteLine(match.Value);
int count = matches.Count;
}
private string getURLContent(string URL)
{
string content;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
request.UserAgent = "Fetching contents Data";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
content = reader.ReadToEnd();
reader.Close();
stream.Close();
return content;
}

how to copy all text from a certain webpage and save it to notepad C#

I have a C# Windows Forms app that launches a webpage based on some criteria.
Now I would like my app to automatically copy all the text from that page (which is in CSV format) and paste and save it in notepad.
Here is a link to an example of the data that needs to be copied:
http://www.wunderground.com/history/airport/FAJS/2012/10/28/DailyHistory.html?req_city=Johannesburg&req_state=&req_statename=South+Africa&format=1
Any Help will be appreciated.
You can use the new toy HttpClient from .NET 4.5, example how to get google page:
var httpClient = new HttpClient();
File.WriteAllText("C:\\google.txt",
httpClient.GetStringAsync("http://www.google.com")
.Result);
http://msdn.microsoft.com/en-us/library/fhd1f0sw.aspx combined with http://www.dotnetspider.com/resources/21720-Writing-string-content-file.aspx
public static void DownloadString ()
{
WebClient client = new WebClient();
string reply = client.DownloadString("http://www.wunderground.com/history/airport/FAJS/2012/10/28/DailyHistory.html?req_city=Johannesburg&req_state=&req_statename=South+Africa&format=1");
StringBuilder stringData = new StringBuilder();
stringData = reply;
FileStream fs = new FileStream(#"C:\Temp\tmp.txt", FileMode.Create);
byte[] buffer = new byte[stringData.Length];
for (int i = 0; i < stringData.Length; i++)
{
buffer[i] = (byte)stringData[i];
}
fs.Write(buffer, 0, buffer.Length);
fs.Close();
}
Edit Adil uses the WriteAllText method, which is even better. So you will get something like this:
WebClient client = new WebClient();
string reply = client.DownloadString("http://www.wunderground.com/history/airport/FAJS/2012/10/28/DailyHistory.html?req_city=Johannesburg&req_state=&req_statename=South+Africa&format=1");
System.IO.File.WriteAllText (#"C:\Temp\tmp.txt", reply);
Simple way: use WebClient.DownloadFile and save as a .txt file:
var webClient = new WebClient();
webClient.DownloadFile("http://www.google.com",#"c:\google.txt");
You need WebRequest to read the stream of and save to string to text file. You can use File.WriteAllText to write it to file.
WebRequest request = WebRequest.Create ("http://www.contoso.com/default.html");
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();
Console.WriteLine (response.StatusDescription);
Stream dataStream = response.GetResponseStream ();
StreamReader reader = new StreamReader (dataStream);
string responseFromServer = reader.ReadToEnd ();
System.IO.File.WriteAllText (#"D:\path.txt", responseFromServer );
You may use a webclient to do this:
System.Net.WebClient wc = new System.Net.WebClient();
byte[] raw = wc.DownloadData("http://www.wunderground.com/history/airport/FAJS/2012/10/28/DailyHistory.html?req_city=Johannesburg&req_state=&req_statename=South+Africa&format=1");
string webData = System.Text.Encoding.UTF8.GetString(raw);
then the string webData contains the complete text of the webpage

Current Page HTML output using c#

I am working in an asp.net website. I need to get the current page HTML output in the Page Load event. I tried the following code. But I am not getting any output, it executes continuously.
protected void Page_Load(object sender, EventArgs e)
{
Http(Request.Url.ToString());
}
public void Http(string url)
{
if (url.Length > 0)
{
Uri myUri = new Uri(url);
// Create a 'HttpWebRequest' object for the specified url.
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(myUri);
// Set the user agent as if we were a web browser
myHttpWebRequest.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4";
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
var stream = myHttpWebResponse.GetResponseStream();
var reader = new StreamReader(stream);
var html = reader.ReadToEnd();
// Release resources of response object.
myHttpWebResponse.Close();
Response.Write(html);
}
}
What is wrong here?
Is there is any other way to get current page HTML output using c#?
I tried the following code also:
protected void Page_Load(object sender, EventArgs e)
{
Page pp = this.Page;
StringWriter tw = new StringWriter();
HtmlTextWriter hw = new HtmlTextWriter(tw);
pp.RenderControl(hw);
string theOut = tw.ToString().Trim();
string FilePath = #"D:\Home.txt";
Stream s = new FileStream(FilePath, FileMode.Create);
StreamWriter sw = new StreamWriter(s);
sw.WriteLine(theOut);
sw.Close();
}
By using the code i am able to get the HTML in the ".txt" file.But execution of this code causes "A page can have only one server-side Form tag." error. Can anybody help me to solve this?
well, you will have to bend space-time continuum, because in Page_Load event there is no html output, and naturally your request in http method (isn't that really bad name?) will call Page_Load again.
It's a joke, you can't have html output in Page_Load event since it's not been produced yet.
Update:
You can make changes on produced output by page with HttpFilter, look at this SO answer :
https://stackoverflow.com/a/10215626/351383
Page_Render event is responsible for generating HTML for the page and Unload event gets called after this. In this event you should be able to get HTML output of the page.
You can try this...
public override void Render(HtmlTextWriter writer):
{
StringBuilder renderedOutput = new StringBuilder();
Streamwriter strWriter = new StringWriter(renderedOutput);
HtmlTextWriter tWriter = new HtmlTextWriter(strWriter);
base.Render(tWriter);
string html = tWriter.InnerWriter.ToString();
string filename = Server.MapPath(".") + "\\data.txt";
outputStream = new FileStream(filename, FileMode.Create);
StreamWriter sWriter = new StreamWriter(outputStream);
sWriter.Write(renderedOutput.ToString());
sWriter.Flush();
//render for output
writer.Write(renderedOutput.ToString());
}

How can I download an XML file using C#?

Given this URL:
http://www.dreamincode.net/forums/xml.php?showuser=1253
How can I download the resulting XML file and have it loaded to memory so I can grab information from it using Linq?
Thanks for the help.
Why complicate things? This works:
var xml = XDocument.Load("http://www.dreamincode.net/forums/xml.php?showuser=1253");
Load string:
string xml = new WebClient().DownloadString(url);
Then load into XML:
XDocument doc = XDocument.Parse(xml);
For example:
[Test]
public void TestSample()
{
string url = "http://www.dreamincode.net/forums/xml.php?showuser=1253";
string xml;
using (var webClient = new WebClient())
{
xml = webClient.DownloadString(url);
}
XDocument doc = XDocument.Parse(xml);
// in the result profile with id name is 'Nate'
string name = doc.XPathSelectElement("/ipb/profile[id='1253']/name").Value;
Assert.That(name, Is.EqualTo("Nate"));
}
You can use the WebClient class:
WebClient client = new WebClient ();
Stream data = client.OpenRead ("http://example.com");
StreamReader reader = new StreamReader (data);
string s = reader.ReadToEnd ();
Console.WriteLine (s);
data.Close ();
reader.Close ();
Though using DownloadString is easier:
WebClient client = new WebClient ();
string s = client.DownloadString("http://example.com");
You can load the resulting string into an XmlDocument.

Categories