HtmlAgilityPack: how to create indented HTML? - c#

So, I am generating html using HtmlAgilityPack and it's working perfectly, but html text is not indented. I can get indented XML however, but I need HTML. Is there a way?
HtmlDocument doc = new HtmlDocument();
// gen html
HtmlNode table = doc.CreateElement("table");
table.Attributes.Add("class", "tableClass");
HtmlNode tr = doc.CreateElement("tr");
table.ChildNodes.Append(tr);
HtmlNode td = doc.CreateElement("td");
td.InnerHtml = "—";
tr.ChildNodes.Append(td);
// write text, no indent :(
using(StreamWriter sw = new StreamWriter("table.html"))
{
table.WriteTo(sw);
}
// write xml, nicely indented but it's XML!
XmlWriterSettings settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
settings.Indent = true;
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlWriter xw = XmlTextWriter.Create("table.xml", settings))
{
table.WriteTo(xw);
}

Fast, Reliable, Pure C#, .NET Core compatible AngleSharp
You can parse it with AngleSharp
which provides a way to auto indent:
var parser = new HtmlParser();
var document = parser.ParseDocument(text);
using (var writer = new StringWriter())
{
document.ToHtml(writer, new PrettyMarkupFormatter
{
Indentation = "\t",
NewLine = "\n"
});
var indentedText = writer.ToString();
}

No, and it's a "by design" choice. There is a big difference between XML (or XHTML, which is XML, not HTML) where - most of the times - whitespaces are no specific meaning, and HTML.
This is not a so minor improvement, as changing whitespaces can change the way some browsers render a given HTML chunk, especially malformed HTML (that is in general well handled by the library). And the Html Agility Pack was designed to keep the way the HTML is rendered, not to minimize the way the markup is written.
I'm not saying it's not feasible or plain impossible. Obviously you can convert to XML and voilà (and you could write an extension method to make this easier) but the rendered output may be different, in the general case.

As far as I know, HtmlAgilityPack cannot do this. But you could look through html tidy packs which are proposed in similar questions:
Html Agility Pack: make code look
neat
Which is the best HTML tidy pack? Is
there any option in HTML agility pack
to make HTML webpage tidy?

I made the same experience even though HtmlAgilityPack is great to read and modify Html (or in my case asp) files you cannot create readable output.
However, I ended up in writing some lines of code which work for me:
Having a HtmlDocument named "m_htmlDocument" I create my HTML file as follows:
file = new System.IO.StreamWriter(_sFullPath);
if (m_htmlDocument.DocumentNode != null)
foreach (var node in m_htmlDocument.DocumentNode.ChildNodes)
WriteNode(file, node, 0);
and
void WriteNode(System.IO.StreamWriter _file, HtmlNode _node, int _indentLevel)
{
// check parameter
if (_file == null) return;
if (_node == null) return;
// init
string INDENT = " ";
string NEW_LINE = System.Environment.NewLine;
// case: no children
if(_node.HasChildNodes == false)
{
for (int i = 0; i < _indentLevel; i++)
_file.Write(INDENT);
_file.Write(_node.OuterHtml);
_file.Write(NEW_LINE);
}
// case: node has childs
else
{
// indent
for (int i = 0; i < _indentLevel; i++)
_file.Write(INDENT);
// open tag
_file.Write(string.Format("<{0} ",_node.Name));
if(_node.HasAttributes)
foreach(var attr in _node.Attributes)
_file.Write(string.Format("{0}=\"{1}\" ", attr.Name, attr.Value));
_file.Write(string.Format(">{0}",NEW_LINE));
// childs
foreach(var chldNode in _node.ChildNodes)
WriteNode(_file, chldNode, _indentLevel + 1);
// close tag
for (int i = 0; i < _indentLevel; i++)
_file.Write(INDENT);
_file.Write(string.Format("</{0}>{1}", _node.Name,NEW_LINE));
}
}

Related

Weird character encoded characters (’) appearing from a feed

I've got a question regarding an XML feed and XSL transformation I'm doing. In a few parts of the outputted feed on an HTML page, I get weird characters (such as ’) appearing on the page.
On another site (that I don't own) that's using the same feed, it isn't getting these characters.
Here's the code I'm using to grab and return the transformed content:
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return resultText;
And my Utils.XslTransform function looks like this:
static public string XslTransform(string data, string xslurl)
{
TextReader textReader = new StringReader(data);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XmlReader xmlReader = XmlReader.Create(textReader, settings);
XmlReader xslReader = new XmlTextReader(Uri.UnescapeDataString(xslurl));
XslCompiledTransform myXslT = new XslCompiledTransform();
myXslT.Load(xslReader);
StringBuilder sb = new StringBuilder();
using (TextWriter tw = new StringWriter(sb))
{
myXslT.Transform(xmlReader, new XsltArgumentList(), tw);
}
string transformedData = sb.ToString();
return transformedData;
}
I'm not extremely knowledgeable with character encoding issues and I've been trying to nip this in the bud for a bit of time and could use any suggestions possible. I'm not sure if there's something I need to change with how the WebClient downloads the file or something going weird in the XslTransform.
Thanks!
Give HtmlEncode a try. So in this case you would reference System.Web and then make this change (just call the HtmlEncode function on the last line):
string xmlUrl = "http://feedurl.com/feed.xml";
string xmlData = new System.Net.WebClient().DownloadString(xmlUrl);
string xslUrl = "http://feedurl.com/transform.xsl";
XsltArgumentList xslArgs = new XsltArgumentList();
xslArgs.AddParam("type", "", "specifictype");
string resultText = Utils.XslTransform(xmlData, xslUrl, xslArgs);
return HttpUtility.HtmlEncode(resultText);
The character â is a marker of multibyte sequence (’) of UTF-8-encoded text when it's represented as ASCII. So, I guess, you generate an HTML file in UTF-8, while browser interprets it otherwise. I see 2 ways to fix it:
The simplest solution would be to update the XSLT to include the HTML meta tag that will hint the correct encoding to browser: <meta charset="UTF-8">.
If your transform already defines a different encoding in meta tag and you'd like to keep it, this encoding needs to be specified in the function that saves XML as file. I assume this function took ASCII by default in your example. If your XSLT was configured to generate XML files directly to disk, you could adjust it with XSLT instruction <xsl:output encoding="ASCII"/>.
To use WebClient.DownloadString you have to know what the encoding the server is going use and tell the WebClient in advance. It's a bit of a Catch-22.
But, there is no need to do that. Use WebClient.DownloadData or WebClient.OpenReader and let an XML library figure out which encoding to use.
using (var web = new WebClient())
using (var stream = web.OpenRead("http://unicode.org/repos/cldr/trunk/common/supplemental/windowsZones.xml"))
using (var reader = XmlReader.Create(stream, new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse }))
{
reader.MoveToContent();
//… use reader as you will, including var doc = XDocument.ReadFrom(reader);
}

XML Extra free space

Good day, in general is a problem, I work with XML through C# XMLdocument, after saving that "document", there is such a thing: 
<Name></Name>
After saving:
<Name>
</Name>
How to remove extra spaces?  I've tried: doc.PreserveWhitespace=true;  before saving and before loading. The result is not one that removes all spaces. XML document (large volume) become visually unreadable.
I have already tried, same result. And need Encoding windows-1251 Why XmlDocument do this bad thing? That free or whitespace important for me and my "program".
the problem is solved. thank you all
It can be done. You've got to help control the formatting options when you save the document:
XmlDocument doc = new XmlDocument();
using (var wr = new XmlTextWriter(fileName))
{
wr.Formatting = Formatting.None;
doc.Save(wr);
}
Or you can fine-tune it further with XmlWriterSettings:
var settings = new XmlWriterSettings
{
Indent = false,
NewLineChars = String.Empty
};
using (var wr = XmlWriter.Create(fileName, settings))
{
wr.Formatting = Formatting.None;
doc.Save(wr);
}

C# htmlagilitypack Node.InnerHTML not case correct, how to pull case correct

I'm using HTMLAgilityPack and i'm using the standard operating procedure for loading a document and select a node. However when i go to view the node all the aspx controls are in lowercase. is there a way to get it in propercase For example when I look at <asp:RequiredFieldValidator it's returned as <asp:requiredfieldvalidator. This wont work because i'm mass updating my controls.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(#"C:\my.ascx");
HtmlNodeCollection node_collection = doc.DocumentNode.SelectNodes("//div");
foreach (HtmlNode node in node_collection)
{
templateString = node.InnerHtml; //lower case happens here.....
}
Anybody?
All you need is to set true to OptionOutputOriginalCase before Load
var doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionOutputOriginalCase = true;
doc.LoadHtml("<html><asp:RequiredFieldValidator></asp:RequiredFieldValidator></html>");
var html = doc.DocumentNode.InnerHtml;
Try changing your code to
var doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionOutputOriginalCase = true;
doc.LoadHtml("<html><asp:Content ID='Content1' ContentPlaceHolderID='head' runat='Server'/></html>");
var html = doc.DocumentNode.InnerHtml;

C# XDocument Load with multiple roots

I have an XML file with no root. I cannot change this. I am trying to parse it, but XDocument.Load won't do it. I have tried to set ConformanceLevel.Fragment, but I still get an exception thrown. Does anyone have a solution to this?
I tried with XmlReader, but things are messed up and can't get it work right. XDocument.Load works great, but if I have a file with multiple roots, it doesn't.
XmlReader itself does support reading of xml fragment - i.e.
var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var reader = XmlReader.Create("fragment.xml", settings))
{
// you can work with reader just fine
}
However XDocument.Load does not support reading of fragmented xml.
Quick and dirty way is to wrap the nodes under one virtual root before you invoke the XDocument.Parse. Like:
var fragments = File.ReadAllText("fragment.xml");
var myRootedXml = "<root>" + fragments + "</root>";
var doc = XDocument.Parse(myRootedXml);
This approach is limited to small xml files - as you have to read file into memory first; and concatenating large string means moving large objects in memory - which is best avoided.
If performance matters you should be reading nodes into XDocument one-by-one via XmlReader as explained in excellent #Martin-Honnen 's answer (https://stackoverflow.com/a/18203952/2440262)
If you use API that takes for granted that XmlReader iterates over valid xml, and performance matters, you can use joined-stream approach instead:
using (var jointStream = new MultiStream())
using (var openTagStream = new MemoryStream(Encoding.ASCII.GetBytes("<root>"), false))
using (var fileStream =
File.Open(#"fragment.xml", FileMode.Open, FileAccess.Read, FileShare.Read))
using (var closeTagStream = new MemoryStream(Encoding.ASCII.GetBytes("</root>"), false))
{
jointStream.AddStream(openTagStream);
jointStream.AddStream(fileStream);
jointStream.AddStream(closeTagStream);
using (var reader = XmlReader.Create(jointStream))
{
// now you can work with reader as if it is reading valid xml
}
}
MultiStream - see for example https://gist.github.com/svejdo1/b9165192d313ed0129a679c927379685
Note: XDocument loads the whole xml into memory. So don't use it for large files - instead use XmlReader for iteration and load just the crispy bits as XElement via XNode.ReadFrom(...)
The only in-memory tree representations in the .NET framework that can deal with fragments are the XmlDocumentFragment in .NET's DOM implementation so you would need to create an XmlDocument and a fragment with e.g.
XmlDocument doc = new XmlDocument();
XmlDocumentFragment frag = doc.CreateDocumentFragment();
frag.InnerXml = stringWithXml; // for instance
// frag.InnerXml = File.ReadAllText("fragment.xml");
or is XPathDocument where you can create one using an XmlReader with ConformanceLevel set to Fragment:
XPathDocument doc;
using (XmlReader xr =
XmlReader.Create("fragment.xml",
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
}))
{
doc = new XPathDocument(xr);
}
// new create XPathNavigator for read out data e.g.
XPathNavigator nav = doc.CreateNavigator();
Obviously XPathNavigator is read-only.
If you want to use LINQ to XML then I agree with the suggestions made that you need to create an XElement as a wrapper. Instead of pulling in a string with the file contents you could however use XNode.ReadFrom with an XmlReader e.g.
public static class MyExtensions
{
public static IEnumerable<XNode> ParseFragment(XmlReader xr)
{
xr.MoveToContent();
XNode node;
while (!xr.EOF && (node = XNode.ReadFrom(xr)) != null)
{
yield return node;
}
}
}
then
XElement root = new XElement("root",
MyExtensions.ParseFragment(XmlReader.Create(
"fragment.xml",
new XmlReaderSettings() {
ConformanceLevel = ConformanceLevel.Fragment })));
That might work better and more efficiently than reading everything into a string.
If you wanted to use XmlDocument.Load() then you would need to wrap the content in a root node.
or you could try something like this...
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
XmlDocument d = new XmlDocument();
d.CreateElement().InnerText = xmlReader.ReadOuterXml();
}
}
XML document cannot have more than one root elements. One root element is required. You may do one thing. Get all the fragment elements and wrap them into a root element and parse it with XDocument.
This would be the best and easiest approach that one could think of.

HtmlAgilityPack - How to set custom encoding when loading pages

Is it possible to set custom encoding when loading pages with the method below?
HtmlWeb hwWeb = new HtmlWeb();
HtmlDocument hd = hwWeb.load("myurl");
I want to set encoding to "iso-8859-9".
I use C# 4.0 and WPF.
Edit: The question has been answered on MSDN.
I suppose you could try overriding the encoding in the HtmlWeb object.
Try this:
var web = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = myEncoding,
};
var doc = web.Load(myUrl);
Note: It appears that the OverrideEncoding property was added to HTML agility pack in revision 76610 so it is not available in the current release v1.4 (66017). The next best thing to do would be to read the page manually with the encodings overridden.
var document = new HtmlDocument();
using (var client = new WebClient())
{
using (var stream = client.OpenRead(url))
{
var reader = new StreamReader(stream, Encoding.GetEncoding("iso-8859-9"));
var html = reader.ReadToEnd();
document.LoadHtml(html);
}
}
This is a simple version of the solution answered here (for some reasons it got deleted)
A decent answer is over here which handles auto-detecting the encoding as well as some other nifty features:
C# and HtmlAgilityPack encoding problem

Categories