C# .NET XMLWriter/Reader problem - c#

I've been having problems writing XML and reading it in. I have a handwritten XML that gets read in fine, but after I write the XML it acts funny.
The output of the WriteXML: http://www.craigmouser.com/random/test.xml
It works if you hit enter after the (specials) tag. I.E. make (specials)(special) look like
(specials)
(special)
If I step through it, when reading it, it goes to the start node of specials, then the next iteration reads it as an EndElement with name Shots. I have no idea where to go from here. Thanks in advance.
Code: Writing
public void SaveXMLFile(string filename, Bar b, Boolean saveOldData)
{
XmlWriter xml;
if(filename.Contains(".xml"))
{
xml = XmlWriter.Create(filename);
}
else
{
xml = XmlWriter.Create(filename + ".xml");
}
xml.WriteStartElement("AggievilleBar");
xml.WriteElementString("name", b.Name);
xml.WriteStartElement("picture");
xml.WriteAttributeString("version", b.PictureVersion.ToString());
xml.WriteEndElement();
xml.WriteElementString("location", b.Location.Replace(Environment.NewLine, "\n"));
xml.WriteElementString("news", b.News.Replace(Environment.NewLine, "\n"));
xml.WriteElementString("description", b.Description.Replace(Environment.NewLine, "\n"));
xml.WriteStartElement("specials");
xml.WriteString("\n"); //This line fixes the problem... ?!?!
foreach (Special s in b.Specials)
{
if (s.DayOfWeek > 0 || (s.DayOfWeek == -1
&& ((s.Date.CompareTo(DateTime.Today) < 0 && saveOldData )
|| s.Date.CompareTo(DateTime.Today) >= 0)))
{
xml.WriteStartElement("special");
xml.WriteAttributeString("dayofweek", s.DayOfWeek.ToString());
if (s.DayOfWeek == -1)
xml.WriteAttributeString("date", s.Date.ToString("yyyy-MM-dd"));
xml.WriteAttributeString("price", s.Price.ToString());
xml.WriteString(s.Name);
xml.WriteEndElement();
}
}
xml.WriteEndElement();
xml.WriteEndElement();
xml.Close();
}
Code: Reading
public Bar LoadXMLFile(string filename)
{
List<Special> specials = new List<Special>();
XmlReader xml;
try
{
xml = XmlReader.Create(filename);
}
catch (Exception)
{
MessageBox.Show("Unable to open file. If you get this error upon opening the program, we failed to pull down your current data. You will most likely be unable to save, but you are free to try. If this problem persists please contact us at pulsarproductionssupport#gmail.com",
"Error Opening File", MessageBoxButtons.OK, MessageBoxIcon.Error);
return null;
}
Bar current = new Bar();
Special s = new Special();
while (xml.Read())
{
if (xml.IsStartElement())
{
switch (xml.Name)
{
case "AggievilleBar":
current = new Bar();
break;
case "name":
if (xml.Read())
current.Name = xml.Value.Trim();
break;
case "picture":
if (xml.HasAttributes)
{
try
{
current.PictureVersion = Int32.Parse(xml.GetAttribute("version"));
}
catch (Exception)
{
MessageBox.Show("Error reading in the Picture Version Number.","Error",MessageBoxButtons.OK,MessageBoxIcon.Error);
}
}
break;
case "location":
if (xml.Read())
current.Location = xml.Value.Trim();
break;
case "news":
if (xml.Read())
current.News = xml.Value.Trim();
break;
case "description":
if (xml.Read())
current.Description = xml.Value.Trim();
break;
case "specials":
if (xml.Read())
specials = new List<Special>();
break;
case "special":
s = new Special();
if (xml.HasAttributes)
{
try
{
s.DayOfWeek = Int32.Parse(xml.GetAttribute(0));
if (s.DayOfWeek == -1)
{
s.Date = DateTime.Parse(xml.GetAttribute(1));
s.Price = Int32.Parse(xml.GetAttribute(2));
}
else
s.Price = Int32.Parse(xml.GetAttribute(1));
}
catch (Exception)
{
MessageBox.Show("Error reading in a special.", "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
if (xml.Read())
s.Name = xml.Value.Trim();
break;
}
}
else
{
switch (xml.Name)
{
case "AggievilleBar":
xml.Close();
break;
case "special":
specials.Add(s);
break;
case "specials":
current.Specials = specials;
break;
}
}
}
return current;
}

Without seeing your code it's hard to really give a straight answer to that question. However, I can suggest using Linq-to-XML instead of XMLReader/XMLWriter -- it's so much easier to work with when you don't have to read each node one at a time and determine what node you're working with, which sounds like the problem you're having.
For example, code like:
using (var reader = new XmlReader(...))
{
while reader.Read()
{
if (reader.Name = "book" && reader.IsStartElement)
{
// endless, confusing nesting!!!
}
}
}
Becomes:
var elem = doc.Descendants("book").Descendants("title")
.Where(c => c.Attribute("name").Value == "C# Basics")
.FirstOrDefault();
For an introduction to LINQ-to-XML, check out http://www.c-sharpcorner.com/UploadFile/shakthee/2868/, or just search for "Linq-to-XML". Plenty of examples out there.
Edit: I tried your code and I was able to reproduce your problem. It seems that without a newline before the special tag, the first special element is read in as IsStartElement() == false. I wasn't sure why this is; even skimmed through the XML Specifications and didn't see any requirements about newlines before elements.
I rewrote your code in Linq-to-XML and it worked fine without any newlines:
var xdoc = XDocument.Load(filename);
var barElement = xdoc.Element("AggievilleBar");
var specialElements = barElement.Descendants("special").ToList();
var specials = new List<Special>();
specialElements.ForEach(s =>
{
var dayOfWeek = Convert.ToInt32(s.Attribute("dayofweek").Value);
var price = Convert.ToInt32(s.Attribute("price").Value);
var date = s.Attribute("date");
specials.Add(new Special
{
Name = s.Value,
DayOfWeek = dayOfWeek,
Price = price,
Date = date != null ? DateTime.Parse(date.Value) : DateTime.MinValue
});
});
var bar = new Bar() {
Name = barElement.Element("name").Value,
PictureVersion = Convert.ToInt32(barElement.Elements("picture").Single()
.Attribute("version").Value),
Location = barElement.Element("location").Value,
Description = barElement.Element("description").Value,
News = barElement.Element("news").Value,
Specials = specials
};
return bar;
Would you consider using Linq-to-XML instead of XMLReader? I've had my share of trouble with XMLReader in the past and once I switched to Linq-to-XML haven't looked back!
EDIT: I know this question is rather old now, but I just came across an article that reminded me of this question and might explain why this is happening: --> http://www.codeproject.com/KB/dotnet/pitfalls_xml_4_0.aspx
The author states:
In this light, a nasty difference between XmlReaders/Writers and XDocument is the way whitespace is treated. (See http://msdn.microsoft.com/en-us/library/bb387014.aspx.)
From msdn:
In most cases, if the method takes LoadOptions as an argument, you can optionally preserve insignificant white space as text nodes in the XML tree. However, if the method is loading the XML from an XmlReader, then the XmlReader determines whether white space will be preserved or not. Setting PreserveWhitespace will have no effect.
So perhaps, since you're loading using an XmlReader, the XmlReader is making the determination as to whether or not it should preserve white space. Most likely it IS preserving the white space which is why the newline (or lack thereof) makes a difference. And it doesn't seem like you can do anything to change it, so long as you're using an XmlReader! Very peculiar.

I'd recommend you use the XmlDocument class and its Load and Save methods, and then work with the XML tree instead of messing around with XmlReader and XmlWriter. In my experience using XmlDocument has fewer weird formatting problems.

Related

When using MergeField FieldCodes in OpenXml SDK in C# why do field codes disappear or fragment?

I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!

I am trying to parse an XML file using C# in the .NET environment and it keeps skipping over elements

So this is how a portion of the XML I am trying to parse looks like:
<azsa:Views>
<azsa:Spatial_Array>
<azsa:Spatial>
<azsa:ViewName>Spatial</azsa:ViewName>
<azsa:BBox>
<azsa:PointLo>
<azsa:x>0</azsa:x>
<azsa:y>0</azsa:y>
<azsa:z>0</azsa:z>
</azsa:PointLo>
<azsa:PointHi>
<azsa:x>2925</azsa:x>
<azsa:y>3375</azsa:y>
<azsa:z>2775</azsa:z>
</azsa:PointHi>
</azsa:BBox>
</azsa:Spatial>
</azsa:Spatial_Array>
</azsa:Views>
I have to read the x,y and z coordinates for both PointHi and PointLo
I was using the XMLReader() class to perform the task.
XmlTextReader reader = new XmlTextReader(openFileDialog1.FileName);
while (reader.Read())
{
reader.ReadToFollowing("azsa:Views");
reader.ReadToFollowing("azsa:Spatial_Array");
reader.ReadToFollowing("azsa:Spatial");
reader.ReadToFollowing("azsa:ViewName");
reader.ReadToFollowing("azsa:BBox");
reader.ReadToFollowing("azsa:PointLo");
reader.ReadToFollowing("azsa:x");
low[0] = (int)(Double.Parse(reader.ReadElementString()));
reader.ReadToFollowing("azsa:y");
low[1] = (int)(Double.Parse(reader.ReadElementString()));
reader.ReadToFollowing("azsa:z");
low[2] = (int)(Double.Parse(reader.ReadElementString()));
reader.ReadToFollowing("azsa:PointHi");
reader.ReadToFollowing("azsa:x");
high[0] = (int)(Double.Parse(reader.ReadElementString()));
reader.ReadToFollowing("azsa:y");
high[1] = (int)(Double.Parse(reader.ReadElementString()));
reader.ReadToFollowing("azsa:z");
high[2] = (int)(Double.Parse(reader.ReadElementString()));
}
The reader works perfectly until it gets to the first x in the PointLo and then it just skips to the y in PointHi instead. I have tried using descendants, subtrees and readinnerxml but it still does the same thing.
NOTE: 1. There is more code in the while loop for reading the remaining part of the XML but was not necessary for this problem so I have not included it in the post.
2. Changing the way the XML is organized is not possible because that's how they are required to be stored for the task I am performing.
3. XMLReader is the preferable method as I am dealing with a large number of documents and there is no scope for having this use cache memory.
I had a fairly similar issue a while back when reading subtrees. The solution in that scenario was to dispose the subtree XmlReaders. Granted, the situation here is slightly different, but could you consider an approach such as below (note that I removed the element prefixes for simplicity of testing, as well as read in the XML string rather than a file)?
It is certainly ugly looking, but this was more a proof of concept and could be tidied up a bit. It is also lacking the appropriate error checking, but again this was more for demonstration purposes. It does at least parse out the different point values.
As a side note, I think perhaps a lot of the ugliness could be abstracted away by making classes to represent the different components (or objects) within the XML stream, and making those classes responsible for parsing out their own properties.
Just one way (of many I'm sure) to skin a cat...
private void ParseXml(string xml)
{
double[] low = null;
double[] hi = null;
using (StringReader stringReader = new StringReader(xml))
{
using (XmlReader xmlReader = XmlReader.Create(stringReader))
{
while (xmlReader.Read())
{
if (xmlReader.NodeType != XmlNodeType.Element) continue;
if (xmlReader.Name == "PointLo")
{
low = ParsePoint(xmlReader);
}
else if (xmlReader.Name == "PointHi")
{
hi = ParsePoint(xmlReader);
}
}
}
}
}
private double[] ParsePoint(XmlReader xmlReader)
{
double[] point = new double[3];
using (XmlReader pointReader = xmlReader.ReadSubtree())
{
while (pointReader.Read())
{
if (pointReader.NodeType != XmlNodeType.Element) continue;
if (pointReader.Name == "x")
{
point[0] = GetDimensionValue(pointReader);
}
else if (pointReader.Name == "y")
{
point[1] = GetDimensionValue(pointReader);
}
else if (pointReader.Name == "z")
{
point[2] = GetDimensionValue(pointReader);
}
}
}
return point;
}
private double GetDimensionValue(XmlReader reader)
{
using (XmlReader dimensionReader = reader.ReadSubtree())
{
dimensionReader.Read();
return reader.ReadElementContentAsDouble();
}
}
So as I mentioned in the comments to manderson's solution that for some reason it does not see the y element as an element and instead sees it as a text element, I made the following changes to the while loop in ParsePoint()
while (pointReader.Read())
{
if (pointReader.NodeType == XmlNodeType.Element || pointReader.NodeType== XmlNodeType.Text)
{
if (pointReader.Name == "azsa:x")
{
point[0] = pointReader.ReadElementContentAsDouble();
}
else if (pointReader.Name == "")
{
point[1] = Double.Parse(pointReader.Value);
}
else if (pointReader.Name == "azsa:z")
{
point[2] = pointReader.ReadElementContentAsDouble();
}
}
}
While I am not claiming that this is the ideal way to do this, it works for the XML files I am dealing with. I also removed the GetDimensionValue method and just do the reading of the values/element contents in this method itself.

How to convert HTML to plain text? [duplicate]

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Analysing C# source with Irony

This is what my team and I chose to do for our school project. Well, actually we haven't decided on how to parse the C# source files yet.
What we are aiming to achieve is, perform a full analysis on a C# source file, and produce up a report.
In which the report is going to contain stuff that happening in the codes.
The report only has to contain:
string literals
method names
variable names
field names
etc
I'm in charge of looking into this Irony library. To be honest, I don't know the best way to sort the data out into a clean readable report. I am using the C# grammar class packed with the zip.
Is there any step where I can properly identify each node children? (eg: using directives, namespace declaration, class declaration etc, method body)
Any help or advice would be very much appreciated. Thanks.
EDIT: Sorry I forgot to say we need to analysis the method calls too.
Your main goal is to master the basics of formal languages. A good start-up might be found here. This article describes the way to use Irony on the sample of a grammar of a simple numeric calculator.
Suppose you want to parse a certain file containing C# code the path to which you know:
private void ParseForLongMethods(string path)
{
_parser = new Parser(new CSharpGrammar());
if (_parser == null || !_parser.Language.CanParse()) return;
_parseTree = null;
GC.Collect(); //to avoid disruption of perf times with occasional collections
_parser.Context.SetOption(ParseOptions.TraceParser, true);
try
{
string contents = File.ReadAllText(path);
_parser.Parse(contents);//, "<source>");
}
catch (Exception ex)
{
}
finally
{
_parseTree = _parser.Context.CurrentParseTree;
TraverseParseTree();
}
}
And here is the traversal method itself with counting some info in the nodes. Actually this code counts the number of statements in every method of the class. If you have any question you are always welcome to ask me
private void TraverseParseTree()
{
if (_parseTree == null) return;
ParseNodeRec(_parseTree.Root);
}
private void ParseNodeRec(ParseTreeNode node)
{
if (node == null) return;
string functionName = "";
if (node.ToString().CompareTo("class_declaration") == 0)
{
ParseTreeNode tmpNode = node.ChildNodes[2];
currentClass = tmpNode.AstNode.ToString();
}
if (node.ToString().CompareTo("method_declaration") == 0)
{
foreach (var child in node.ChildNodes)
{
if (child.ToString().CompareTo("qual_name_with_targs") == 0)
{
ParseTreeNode tmpNode = child.ChildNodes[0];
while (tmpNode.ChildNodes.Count != 0)
{ tmpNode = tmpNode.ChildNodes[0]; }
functionName = tmpNode.AstNode.ToString();
}
if (child.ToString().CompareTo("method_body") == 0) //method_declaration
{
int statementsCount = FindStatements(child);
//Register bad smell
if (statementsCount>(((LongMethodsOptions)this.Options).MaxMethodLength))
{
//function.StartPoint.Line
int functionLine = GetLine(functionName);
foundSmells.Add(new BadSmellRegistry(name, functionLine,currentFile,currentProject,currentSolution,false));
}
}
}
}
foreach (var child in node.ChildNodes)
{ ParseNodeRec(child); }
}
I'm not sure this is what you need but you could use the CodeDom and CodeDom.Compiler namespaces to compile the C# code, and than analyze the results using Reflection, something like:
// Create assamblly in Memory
CodeSnippetCompileUnit code = new CodeSnippetCompileUnit(classCode);
CSharpCodeProvider provider = new CSharpCodeProvider();
CompilerResults results = provider.CompileAssemblyFromDom(compileParams, code);
foreach(var type in results.CompiledAssembly)
{
// Your analysis go here
}
Update: In VS2015 you could use the new C# compiler (AKA Roslyn) to do the same, for example:
var root = (CompilationUnitSyntax)tree.GetRoot();
var compilation = CSharpCompilation.Create("HelloTDN")
.AddReferences(references: new[] { MetadataReference.CreateFromAssembly(typeof(object).Assembly) })
.AddSyntaxTrees(tree);
var model = compilation.GetSemanticModel(tree);
var nameInfo = model.GetSymbolInfo(root.Usings[0].Name);
var systemSymbol = (INamespaceSymbol)nameInfo.Symbol;
foreach (var ns in systemSymbol.GetNamespaceMembers())
{
Console.WriteLine(ns.Name);
}

How can I Convert HTML to Text in C#?

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Categories