UTF-8 encoding issue - c#

I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:
public static XmlDocument FromUri(string uri)
{
XmlDocument xmlDoc;
WebClient webClient = new WebClient();
using (Stream rssStream = webClient.OpenRead(uri))
{
XmlTextReader reader = new XmlTextReader(rssStream);
xmlDoc = new XmlDocument();
xmlDoc.XmlResolver = null;
xmlDoc.Load(reader);
}
return xmlDoc;
}
Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.
How can I solve it?

The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.
If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.
If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.
EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:
string text = element.Value.Replace("š", "š")
.Replace(...);
Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

Related

Read XML Value from a XML String with DOCTYPE [duplicate]

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.

Insert double quotes around all html tag attribute

am trying to convert html to xml , but double quotes of html tag attribute doesn't work
so when convert it to xml gives me error
so how can i add double quotes to all to my xml file ,
am using vb.net windows form application
converting an html to xml would not work..There are various corner cases where your html to xml conversion may fail
The best way to convert html to xml would be to:
1>Extract relevant data from the html using parsers like htmlagilitypack
2>Store those extracted data into xml using various xml api's like XmlWriter or Linq2Xml.
I wonder what method you use to convert. You say nothing abour that. Nevertheless, it's obviously this method which is the core problem. And maybe also what you plan to do once the html is converted into xml ?
To tell the truth, no conversion is needed given that html is already xml (well-formed html at least). Simply load your html in a XDocument for example... and that's it. Nothing special to do.
Try this please :
install SgmlReader from nuget
in case you have a string variable like below you will have to convert it into a TextReader object.
Now we are going to use the package installed.
static XmlDocument HTMLTEST()
{
string html = "<table frame=all><tgroup></tgroup></table>";
TextReader reader = new StringReader(html);
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All;
sgmlReader.InputStream = reader;
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true; //false if you dont want whitespace
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Input is string html format, and the return will be doc XmlDocument format.
Your frame=all from html will become frame="all".

Parsing web pages via C#, XmlDocument.LoadXml

I'm trying to download a web page and parse it. I need to reach every node of html document. So I used WebClient to download, which works perfectly. Then I use following code segment to parse the document:
WebClient client = new WebClient();
Stream data = client.OpenRead("http://web.cs.hacettepe.edu.tr/~bil339/");
StreamReader reader = new StreamReader(data);
string xml = reader.ReadToEnd();
data.Close();
reader.Close();
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.loadXml(xml);
In last line, program waits for some time, then crashes. It says there are errors in HTML code, this wasn't expected, that shouldn't be here, etc.
Any suggestions to fix this? Other techniques to parse HTML code are welcome (In C#, of course.)
Use the HTMLAgilityPack to parse HTML. Well-formed HTML is not XML and can't be parsed as such. For instance, it lacks the <?xml version="1.0" encoding="UTF-8"?> preamble that all XML files require. The HTML Agility Pack is more forgiving.

finding the namespace from an xml stream in C#

I am having an app that gets an xml stream continuosly and then use it to process some information. So far i had only one name space for all the streams and i did it easily as
doc = new XPathDocument(ds + "/probe");
navigator = doc.CreateNavigator();
ns = new XmlNamespaceManager(navigator.NameTable);
ns.AddNamespace("m", "urn:namsp.org:namSpDev:1.1");
nodes = navigator.Select("//m:DataItem", ns);
while (nodes.MoveNext())
{
node = nodes.Current;
}
But now i have a problem. THere is another stream that has the namespace
"urn:namsp.org:namSpDev:1.2"
So in my application i have to check the stream and see which namespace it is and then only i can add the app name space using
ns.AddNamespace("m", "urn:namsp.org:namSpDev:1.1");
How should i do this?
I tried converting the doc.toString() and used .contains() to check if any one of this passes but it doesnt work.
These links may be useful:
Detecting Xml namespace fast
Parsing XML with elements containing colon / namespace
How to Select XML Nodes with XML Namespaces from an XmlDocument?
What i finally did is retrieved the xml stream and converted into a string. Then using
string.contains("xmlns")
I splitted the tag and used the tag identifier to get the value of the name space. This works for me as there will not be much difference in the name spaces in the stream that i use.

How do I write an xml document to an asp.net response formatted nicely?

I have xml documents in a database field. The xml documents have no whitespace between the elements (no line feeds, no indenting).
I'd like to output them to the browser, formatted nicely. I would simply like linefeeds in there with some indenting. Is there an easy, preferably built-in way to do this?
I am using ASP.NET 3.5 and C#. This is what I have so far, which is outputting the document all in one line:
I'm about 99.9977% sure I am using the XmlWriter incorrectly. What I am accomplishing now can be done by writing directly to the response. But am I on the right track at least? :)
int id = Convert.ToInt32(Request.QueryString["id"]);
var auditLog = webController.DB.Manager.AuditLog.GetByKey(id);
var xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Indent = true;
xmlWriterSettings.OmitXmlDeclaration = true;
var xmlWriter = XmlWriter.Create(Response.OutputStream, xmlWriterSettings);
if (xmlWriter != null)
{
Response.Write("<pre>");
// ObjectChanges is a string property that contains an XML document
xmlWriter.WriteRaw(Server.HtmlEncode(auditLog.ObjectChanges));
xmlWriter.Flush();
Response.Write("</pre>");
}
This is the working code, based on dtb's answer:
int id = Convert.ToInt32(Request.QueryString["id"]);
var auditLog = webController.DB.Manager.AuditLog.GetByKey(id);
var xml = XDocument.Parse(auditLog.ObjectChanges, LoadOptions.None);
Response.Write("<pre>" + Server.HtmlEncode(xml.ToString(SaveOptions.None)) + "</pre>");
Thank you for helping me!
WriteRaw just writes the input unchanged to the underlying stream.
if you want to use built-in formatting, you need first to parse the XML and then convert it back to a string.
The simplest solution is possibly to use XLinq:
var xml = XDocument.Parse(auditLog.ObjectChanges);
Response.Write(Server.HtmlEncode(xml.ToString(SaveOptions.None)));
(This assumes auditLog.ObjectChanges is a string that represents well-formed XML.)
If you need more control over the formatting (indentation, line-breaks) save the XDocument to a MemoryStream-backed XmlWriter, decode the MemoryStream back to a string, and write the string HtmlEncoded.
If auditLog.ObjectChanges is the XML content that needs to be formatted, then you've stored it in an unformatted way. To format it, treat it as XML and write it to an XMLWriter to format it. Then include the formatted XML into the response, with the HTML encoding.

Categories