"An error occurred while parsing EntityName" while Loading an XmlDocument - c#

I have written some code to parse RSS feeds for a ASP.NET C# application and it works fine for all RSS feeds that I have tried, until I tried Facebook.
My code fails at the last line below...
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Stream rss = response.GetResponseStream();
XmlDocument xml = new XmlDocument();
xml.Load(rss);
...with the error "An error occurred while parsing EntityName. Line 12, position 53."
It is hard to work out what is at thhat position of the XML file as the entire file is all in one line, but it is straight from Facebook and all characters appear to be encoded properly except possibly one character (♥).
I don't particularly want to rewrite my RSS parser to use a different method. Any suggestions for how to bypass this error? Is there a way of turning off checking of the file?

Look at the downloaded stream. It doesn't contain the RSS feed, but a HTML page with message about incompatible browser. That's because when downloading the URL like this, the user agent header is not set. If you do that, your code should work:
var request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "MyApplication";
var xml = new XmlDocument();
using (var response = request.GetResponse())
using (var rss = response.GetResponseStream())
{
xml.Load(rss);
}

Related

Strange characters as a result of HttpWebResponse [duplicate]

This question already has an answer here:
Garbled httpWebResponse string when posting data to web form programmatically
(1 answer)
Closed 4 years ago.
I'm trying to create site parser for telegram bot. The exact code is:
var link = "https://www.detmir.ru/";
var request = HttpWebRequest.Create(link);
var resp = (HttpWebResponse)request.GetResponse();
string result;
using (var stream = resp.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.GetEncoding(resp.CharacterSet)))
result = reader.ReadToEnd();
}
File.WriteAllText(#"d:\1.txt", result);
Result is a set of strange symbols:
As far as I get - the main clue in encoding. I've tried to use Encoding.Defult, Encoding.UTF8 with the same result.
But with other sites it works perfectly. Is there any trick to solve issue with this certain website?
Update
In Google Chrome the source code of webpage shows correctly:
Google Chrome webpage source code
The contents of the response is UTF-8, as the site reports, but it is compressed to increase throughput performance.
Enable automatic decompression:
var request = (HttpWebRequest)HttpWebRequest.Create(link);
request.AutomaticDecompression = DecompressionMethods.GZip;

Getting link from pastebin and downloading from link

I'm trying to get a link from a pastebin. Where the link is the only text in the raw paste. Then I want to download a file from the link in pastebin.
WebRequest request = WebRequest.Create("http://pastebin.com/raw/Dtdf2qMp");
WebResponse response = request.GetResponse();
System.IO.StreamReader reader = new
System.IO.StreamReader(response.GetResponseStream());
Console.WriteLine(reader.ReadToEnd());
WebClient client = new WebClient();
client.DownloadFile (Link gotten from pastebin here, "c:\\File");
System.Threading.Thread.Sleep(5000);
Instead of dumping the text read to console output, you should assign it to a variable.
var pastebinOutput = reader.ReadToEnd();
Then just pass that as the link for the DownloadFile method. If you want to do verification that it's actually a URL you got from the original pastebin, you can look into System.Uri's TryCreate method.
I've got a solution - assuming you have your link in the raw pastebin link (mine is a .txt file saying 'it worked') I suggest you copy and paste the code below exactly - if you get an file saying 'it worked' then you can change the pastebin link & file names. If you don't want to open the file then remove Process.Start - if you want to change the delay just change the number (it's in milliseconds) Also, you can change the format from .txt to .exe or whatever your file is (or you can remove it so its the defualt name in the download link):
WebRequest request = WebRequest.Create("https://pastebin.com/raw/QAWufg1z");
WebResponse response = request.GetResponse();
System.IO.StreamReader reader = new
System.IO.StreamReader(response.GetResponseStream());
var pastebinOutput = reader.ReadToEnd();
WebClient client = new WebClient();
client.DownloadFile(pastebinOutput, #".\downloaded.txt");
MessageBox.Show("File should open automatically in the next minute. Please wait...");
await Task.Delay(3000); //3000 = 3 seconds
Process.Start(#".\downloaded.txt");

Reading information from a website c#

In the project I have in mind I want to be able to look at a website, retrieve text from that website, and do something with that information later.
My question is what is the best way to retrieve the data(text) from the website. I am unsure about how to do this when dealing with a static page vs dealing with a dynamic page.
From some searching I found this:
WebRequest request = WebRequest.Create("anysite.com");
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Display the status.
Console.WriteLine(response.StatusDescription);
Console.WriteLine();
// Get the stream containing content returned by the server.
using (Stream dataStream = response.GetResponseStream())
{
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream, Encoding.UTF8);
// Read the content.
string responseString = reader.ReadToEnd();
// Display the content.
Console.WriteLine(responseString);
reader.Close();
}
response.Close();
So from running this on my own I can see it returns the html code from a website, not exactly what I'm looking for. I eventually want to be able to type in a site (such as a news article), and return the contents of the article. Is this possible in c# or Java?
Thanks
I hate to brake this to you but that's how webpages looks, it's a long stream of html markup/content. This gets rendered by the browser as what you see on your screen. The only way I can think of is to parse to html by yourself.
After a quick search on google I found this stack overflow article.
What is the best way to parse html in C#?
but I'm betting you figured this would be a bit easier than you expected, but that's the fun in programming always challenging problems
You can just use a WebClient:
using(var webClient = new WebClient())
{
string htmlFromPage = webClient.DownloadString("http://myurl.com");
}
In the above example htmlFromPage will contain the HTML which you can then parse to find the data you're looking for.
What you are describing is called web scraping, and there are plenty of libraries that do just that for both Java and C#. It doesn't really matter if the target site is static or dynamic since both output HTML in the end. JavaScript or Flash heavy sites on the other hand tend to be problematic.
Please try this,
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("anysite.com");

StreamReader issues

i was trying to break my newly made servlet earlier and ended up breaking my own application, i kind of wish i hadn't bothered now!
response = (HttpWebResponse)request.GetResponse();
reader = new StreamReader(response.GetResponseStream());
String streamedXML = reader.ReadToEnd(); //
XmlDocument doc = new XmlDocument();
doc.LoadXml(streamedXML);
If i open up 10 windows or so, then rapidly request data from my servlets (this is the same 10 windows, returning the same data) then i get an xml exception being thrown;
Unexpected end of file has occurred. The following elements are not closed:
The thing is, if i run this one at a time, or with a large gap between requests then if completes fine. Is this because my streamreader is being overwehlmed by requests and starting new ones before others have finished? If so, is there a better way of writing this data?
Thanks.
You could try to fix this code or leave it to the experts and use a WebClient:
using (var client = new WebClient())
{
string streamedXML = client.DownloadString(sourceUrl);
...
}
And personally I would use XDocument instead of XmlDocument, but that depends.
The StreamReader isn't overwhelmed. (It could only block or raise IO Exceptions / Out Of Memory exceptions)
However, it would seem that the server it is talking to is overwhelmed.
Find out with fiddler or in the server logs
You could start by disposing of everything correctly and seeing if that helps:
using(response = (HttpWebResponse)request.GetResponse())
using(reader = new StreamReader(response.GetResponseStream()))
{
String streamedXML = reader.ReadToEnd(); //
XmlDocument doc = new XmlDocument();
doc.LoadXml(streamedXML);
}

How can I import a raw RSS feed in C#?

Does anyone know an easy way to import a raw, XML RSS feed into C#? Am looking for an easy way to get the XML as a string so I can parse it with a Regex.
Thanks,
-Greg
This should be enough to get you going...
using System.Net
WebClient wc = new WebClient();
Stream st = wc.OpenRead(“http://example.com/feed.rss”);
using (StreamReader sr = new StreamReader(st)) {
string rss = sr.ReadToEnd();
}
If you're on .NET 3.5 you now got built-in support for syndication feeds (RSS and ATOM). Check out this MSDN Magazine Article for a good introduction.
If you really want to parse the string using regex (and parsing XML is not what regex was intended for), the easiest way to get the content is to use the WebClient class.It got a download string which is straight forward to use. Just give it the URL of your feed. Check this link for an example of how to use it.
I would load the feed into an XmlDocument and use XPATH instead of regex, like so:
XmlDocument doc = new XmlDocument();
HttpWebRequest request = WebRequest.Create(feedUrl) as HttpWebRequest;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
StreamReader reader = new StreamReader(response.GetResponseStream());
doc.Load(reader);
<parse with XPATH>
}
What are you trying to accomplish?
I found the System.ServiceModel.Syndication classes very helpful when working with feeds.
You might want to have a look at this: http://www.codeproject.com/KB/cs/rssframework.aspx
XmlDocument (located in System.Xml, you will need to add a reference to the dll if it isn't added for you) is what you would use for getting the xml into C#. At that point, just call the InnerXml property which gives the inner Xml in string format then parse with the Regex.
The best way to grab an RSS feed as the requested string would be to use the System.Net.HttpWebRequest class. Once you've set up the HttpWebRequest's parameters (URL, etc.), call the HttpWebRequest.GetResponse() method. From there, you can get a Stream with WebResponse.GetResponseStream(). Then, you can wrap that stream in a System.IO.StreamReader, and call the StreamReader.ReadToEnd(). Voila.
The RSS is just xml and can be streamed to disk easily. Go with Darrel's example - it's all you'll need.

Categories