Bad characters in output from rss feed - c#

I am using the following method to connect to an rss feed.
var url = "http://blogs.mysite.com/feed/";
var sourceXmlFeed = "";
using (var wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
sourceXmlFeed = wc.DownloadString(url);
}
var xrs = new XmlReaderSettings();
xrs.CheckCharacters = false;
var xtr = new XmlTextReader(new System.IO.StringReader(sourceXmlFeed));
var xmlReader = XmlReader.Create(xtr, xrs);
SyndicationFeed feed = SyndicationFeed.Load(xmlReader);
However I am getting bad characters (as below) in the output and I assume it is something to do with the encoding.
eg. for 2015 we are â?ogoing for goldâ?
Anyone know how to fix this?
By the way, I am doing things this way because I have been unable to use a more direct approach (as below) without causing The remote server returned an error: (443).
var xmlReader = XmlReader.Create("http://blogs.mysite.com/feed);
SyndicationFeed feed = SyndicationFeed.Load(xmlReader);

Problem with encoding is caused by reading xml as string because encoding detection in XML differs from encoding detection in strings.
WebClient webClient = null;
XmlReader xmlReader = null;
try
{
webClient = new WebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
xmlReader = XmlReader.Create(webClient.OpenRead(url));
// Read XML here because in a finaly block the response stream and the reader will be closed.
}
finally
{
if (webClient != null)
{ webClient.Dispose(); }
if (xmlReader != null)
{ xmlReader .Dispose(); }
}

Related

StreamReader into DataTable

I want to read atom xml, and used following code
string str1 = "http://moss:133333/_vti_bin/ExcelRest.aspx/Document Library/OrdersExcel.xlsx/Model/Tables('Table1')?$format=atom";
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(str1);
req.UseDefaultCredentials = true;
req.PreAuthenticate = true;
req.Credentials = CredentialCache.DefaultCredentials;
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)";
WebResponse response = req.GetResponse();
Encoding enc = System.Text.Encoding.GetEncoding(1252);
StreamReader loResponseStream = new StreamReader(response.GetResponseStream(), enc);
string Response = loResponseStream.ReadToEnd();
The last line in above code basically reads whole stream reader into the string Response.
Now I do not know how do I read the above string's atom xml into the data table.
DataTable.ReadXml reads XML schema and data into the DataTable using the specified TextReader.
var reader = new System.IO.StreamReader(xmlStream);
var newTable = new DataTable();
newTable.ReadXml(reader);

httpwebrequest fails to load rss feed

I am attempting to load a page I've received from an RSS feed and I receive the following WebException:
Cannot handle redirect from HTTP/HTTPS protocols to other dissimilar ones.
with an inner exception:
Invalid URI: The hostname could not be parsed.
Here's the code I'm using:
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(url);
string source = String.Empty;
Uri responseURI;
try
{
req.UserAgent=#"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:31.0) Gecko/20100101 Firefox/31.0";
req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
req.AllowAutoRedirect = true;
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
responseURI = httpWebResponse.ResponseUri;
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
reader.Close();
}
}
}
catch (WebException we)
{
Console.WriteLine(url + "\n--\n" + we.Message);
return null;
}
I'm not sure if I'm doing something wrong or if there's something extra I need to be doing. Any help would be greatly appreciated! let me know if there's more information that you need.
############ UPDATE
So after following Jim Mischel's suggestions I've narrowed it down to a UriFormatException that claims Invalid URI: The hostname could not be parsed.
Here's the URL that's in the last "Location" Header: http:////www-nc.nytimes.com/
I guess I can see why it fails, but I'm not sure why it gives me trouble here but when I take the original url it processes it just fine in my browser. Is there something I'm missing/not doing that I should be in order to handle this strange URL?

View a PDF from SSRS / RDL

I want to access an RDL to obtain the report as PDF. I have
static void Main()
{
string pdfOutputFIleName = #"C:\Development\testoutputAsPdf.pdf";
var urlAsPdf = #"http://serverName/ReportServer/Reserved.ReportViewerWebControl.axd?ExecutionID=xxx&Culture=1033&CultureOverrides=False&UICulture=9&UICultureOverrides=False&ReportStack=1&ControlID=yyy&OpType=Export&FileName=Bug+Status&ContentDisposition=OnlyHtmlInline&Format=PDF";
var client = new WebClient();
client.UseDefaultCredentials = true;
//client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
client.Headers.Add("Content-Type", "application/pdf");
Process(client, urlAsPdf, pdfOutputFIleName);
Console.Read();
}
private static void Process(WebClient client, string url, string outputFileName)
{
Stream data = client.OpenRead(url);
using (var reader = new StreamReader(data))
{
string output = reader.ReadToEnd();
using (Stream s = File.Create(outputFileName))
{
var writer = new StreamWriter(s);
writer.Write(output);
Console.WriteLine(output);
}
}
}
The URL works fine in my browser. The program runs. When I to open the PDF I receive an error in Adobe:
There was an error opening this document. The file is damaged and
could not be repaired.
Since you already specify the output file name and the url to read from, perhaps you can use the WebClient.DownloadFile(string address, string filename) member.
private static void Process(WebClient client, string url, string outputFileName)
{
client.DownloadFile(url, outputFileName);
}

C# Downloading website into string using C# WebClient or HttpWebRequest

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.
Here is the code I was originally using.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
I also tried alternate implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
doc.Load(read, true);
}
From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.
http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding
The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States).
Although I have tried a number of other wikipedia articles and have not seen this issue.
Using the built-in loader in HtmlAgilityPack worked for me:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here
Edit:
Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
Also see this SO thread: WebClient forbids opening wikipedia page?
The response is gzip encoded.
Try the following to decode the stream:
UPDATE
Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
Old/Manual solution:
string source;
var response = req.GetResponse();
var stream = response.GetResponseStream();
try
{
if (response.Headers.AllKeys.Contains("Content-Encoding")
&& response.Headers["Content-Encoding"].Contains("gzip"))
{
stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
}
using (StreamReader reader = new StreamReader(stream))
{
source = reader.ReadToEnd();
}
}
finally
{
if (stream != null)
stream.Dispose();
}
This is how I usually grab a page into a string (its VB, but should translate easily):
req = Net.WebRequest.Create("http://www.cnn.com")
Dim resp As Net.HttpWebResponse = req.GetResponse()
sr = New IO.StreamReader(resp.GetResponseStream())
lcResults = sr.ReadToEnd.ToString
and haven't had the problems you are.

Dotnet webclient timesout but browser works file for json webservice

I am trying to get the result of the following json webservice https://mtgox.com/code/data/getDepth.php into a string using the following code.
using (WebClient client = new WebClient())
{
string data = client.DownloadString("https://mtgox.com/code/data/getDepth.php");
}
but it always returns a timeout exception and no data. I plan to use fastjson to turn the response into objects and expected that to be that hard part not the returning of the content of the page.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://mtgox.com/code/data/getDepth.php");
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string data = sr.ReadToEnd();
}
}
Also resulted in the same error. Can anyone point out what i am doing wrong?
Hmm, strage, this works great for me:
class Program
{
static void Main()
{
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0";
var result = client.DownloadString("https://mtgox.com/code/data/getDepth.php");
Console.WriteLine(result);
}
}
}
Notice that I am specifying a User Agent HTTP header as it seems that the site is expecting it.
I had similar issue before. request.KeepAlive = false solved my problem. Try this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://mtgox.com/code/data/getDepth.php");
request.KeepAlive = false;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string data = sr.ReadToEnd();
}
}

Categories