C# Reading from a custom file - c#

Okay so, I have read that .INI files have become obsolete now and the .NET Framework creators want us to use .XML files. However, I feel that it would be difficult for some of my users to use .XML files so I thought of creating a custom config file.
I have a list string which has 3 parameters (it is actually the snippet function in Scintilla), like this:
Snippet.Insert("Name", "Code", 'char');
Now I want to insert all snippets from a files which the user can add himself, including the name, code and character but I have no clue about how I would do this. Maybe something like:
[Snippet1] [Snippet1Code] [Snippet1Char]
[Snippet2] [Snippet2Code] [Snippet2Char]
[Snippet3] [Snippet3Code] [Snippet3Char]
However I don't know how I would read the values that way. If someone can tell me an efficient way to read snippets from a file I would be really grateful.

As others have suggested, and similar to #gmcalab's approach, here is a quick example using Linq To XML.
public class Snippet
{
public string Name { get; set; }
public string Code { get; set; }
public char Character { get; set; }
}
XDocument doc = XDocument.Parse(#"<snippets>
<snippet name='snippet1' character='a'>
<![CDATA[
// Code goes here
]]>
</snippet>
<snippet name='snippet2' character='b'>
<![CDATA[
// Code goes here
]]>
</snippet>
</snippets>");
List<Snippet> snippetsList = (from snippets in doc.Descendants("snippet")
select new Snippet
{
Name = snippets.Attribute("name").Value,
Character = Convert.ToChar(snippets.Attribute("character").Value),
Code = snippets.Value
}).ToList();
snippetsList.ForEach(s => Console.WriteLine(s.Code));

If you prefer ini files, I've read good things about Nini

Why not just set up your snippets in the xml you talked about then read it with XMLTextReader?
Pretty straight forward...
<snippets>
<snippet name="Name1" char="a">
<snippetCode>
for(int i = 0; i < 10; i++) { // do work }
</snippetCode>
</snippet>
<snippet name="Name2" char="b">
<snippetCode>
foreach(var item in items) { // do work }
</snippetCode>
</snippet>
</snippets>
C#
public class Snippet
{
public string Name {get;set;}
public char Char { get; set;}
public string Value { get; set;}
}
List<Snippet> Snippets = new List<Snippet>();
XmlTextReader reader = new XmlTextReader ("snippets.xml");
Snippet snippet = new Snippet();
while (reader.Read())
{
// Do some work here on the data.
switch (reader.NodeType)
{
case XmlNodeType.Element:
if(reader.Name == "snippet"){
while (reader.MoveToNextAttribute())
{
if(reader.Name == "Name"){
// get name attribute value
snippet.Name = reader.Value;
}
else if(reader.Name == "Char"){
// get char attribute value
snippet.Char = Char.Parse(reader.Value);
}
}
}
break;
case XmlNodeType.Text: //Display the text in each element.
//Here we get the actual snippet code
snippet.Value = reader.Value;
break;
case XmlNodeType. EndElement:
if(reader.Name == "snippet"){
// Add snippet to list
Snippets.Add(snippet);
// Create a new Snippet object
snippet = new Snippet();
}
break;
}
}

Related

How to write and read list<> from text files in C#

English is not my native language and I am newbie, so don't laugh at me.
I want to create a class in C# that help me to save data to file and read them easily. It works like this:
RealObject john = new RealObject("John");
john.AddCharacter("Full Name", "Something John");
john.AddCharacter("Grade", new List<double> { 9.9, 8.8, 7.7 });
await john.SaveToFileAsync("Test.ini");
RealObject student = new RealObject("John");
await student.ReadFromFileAsync("Test.ini");
Type valueType = student.GetCharacter("Grade").Type;
List<double> johnGrade = (List<double>) student.GetCharacter("Grade").Value;
The file "Test.ini" looks like this:
S_Obj_John
Name System.String Something John
Grade System.Collections.Generic.List`1[System.Double] 9.9;8.8;7.7
E_Obj_John
I have some questions:
Question 1. Can you give me some libraries that do this job for me, please?
Question 2. My code is too redundant, how can I optimize it?
2.1 Saving code: I have to write similar functions: ByteListToString, IntListToString, DoubleListToString,...
private static string ListToString(Type listType, object listValue)
{
string typeString = GetBaseTypeOfList(listType);
if (typeString == null)
{
return null;
}
switch (typeString)
{
case "Byte":
return ByteListToString(listValue);
..........
default:
return null;
}
}
private static string ByteListToString(object listValue)
{
List<byte> values = (List<byte>) listValue;
string text = "";
for (int i = 0; i < values.Count; i++)
{
if (i > 0)
{
text += ARRAY_SEPARATING_SYMBOL;
}
text += values[i].ToString();
}
return text;
}
2.2 Reading code: I have to write similar functions: StringToByteList, StringToIntList, StringToDoubleList,...
public static object StringToList(Type listType, string listValueString)
{
string typeString = GetBaseTypeOfList(listType);
if (typeString == null)
{
return null;
}
switch (typeString)
{
case "Byte":
return StringToByteList(listValueString);
..........
default:
return null;
}
}
private static List<byte> StringToByteList(string listValueString)
{
var valuesString = listValueString.Split(ARRAY_SEPARATING_SYMBOL);
List<byte> values = new List<byte>(valuesString.Length);
foreach (var v in valuesString)
{
byte tmp;
if (byte.TryParse(v, out tmp))
{
values.Add(tmp);
}
}
return values;
}
Thank you for your help
There are two ways two common ways to "serialize" data, which is a fancy way of taking an object and turning it into a string. Then on the other side you can "deserialize" that string and turn it back into an object. Many folks like JSON because it is really simple, XML is still used and can be useful for complex structures but for simple classes JSON is really nice.
I would check out https://www.json.org/ and explore, libraries exist that will serialize and deserialize for you which is nice. Trying to do it with string manipulation is not recommended as most people (including me) will mess it up.
The idea though is to start and end with objects, so take an object and serialize it to save it to the file. Then read that object (really just a string or line of text in the file) and deserialize it back into an object.

Issues creating and writing data to a CSV file using C#

I'm using C# Code in Ranorex 5.4.2 to create a CSV file, have data gathered from an XML file and then have it write this into the CSV file. I've managed to get this process to work but I'm experiencing an issue where there are 12 blank lines created beneath the gathered data.
I have a file called CreateCSVFile which creates the CSV file and adds the headers in, the code looks like this:
writer.WriteLine("PolicyNumber,Surname,Postcode,HouseNumber,StreetName,CityName,CountyName,VehicleRegistrationPlate,VehicleMake,VehicleModel,VehicleType,DateRegistered,ABICode");
writer.WriteLine("");
writer.Flush();
writer.Close();
The next one to run is MineDataFromOutputXML. The program I am automating provides insurance quotes and an output xml file is created containing the clients details. I've set up a mining process which has a variable declared at the top which shows as:
string _PolicyHolderSurname = "";
[TestVariable("3E92E370-F960-477B-853A-0F61BEA62B7B")]
public string PolicyHolderSurname
{
get { return _PolicyHolderSurname; }
set { _PolicyHolderSurname = value; }
}
and then there is another section of code which gathers the information from the XML file:
var QuotePolicyHolderSurname = (XmlElement)xmlDoc.SelectSingleNode("//cipSurname");
string QuotePolicyHolderSurnameAsString = QuotePolicyHolderSurname.InnerText.ToString();
PolicyHolderSurname = QuotePolicyHolderSurnameAsString;
Report.Info( "Policy Holder Surname As String = " + QuotePolicyHolderSurnameAsString);
Report.Info( "Quote Policy Holder Surname = " + QuotePolicyHolderSurname.InnerText);
The final file is called SetDataSource and it puts the information into the CSV file, there is a variable declared at the top like this:
string _PolicyHolderSurname = "";
[TestVariable("222D47D2-6F66-4F05-BDAF-7D3B9D335647")]
public string PolicyHolderSurname
{
get { return _PolicyHolderSurname; }
set { _PolicyHolderSurname = value; }
}
This is then the code that adds it into the CSV file:
string Surname = PolicyHolderSurname;
Report.Info("Surname = " + Surname);
dataConn.Rows.Add(new string[] { Surname });
dataConn.Store();
There are multiple items in the Mine and SetDataSource files and the output looks like this in Notepad++:
Picture showing the CSV file after the code has been run
I believe the problem lies in the CreateCSVFile and the writer.WriteLine function. I have commented this region out but it then produces the CSV with just the headers showing.
I've asked some of the developers I work with but most don't know C# very well and no one has been able to solve this issue yet. If it makes a difference this is on Windows Server 2012r2.
Any questions about this please ask, I can provide the whole files if needed, they're just quite long and repetitive.
Thanks
Ben Jardine
I had the exact same thing to do in Ranorex. Since the question is a bit old I didn't checked your code but here is what I did and is working. I found an example (probably on stack) creating a csv file in C#, so here is my adaptation for using in Ranorex UserCodeCollection:
[UserCodeCollection]
public class UserCodeCollectionDemo
{
[UserCodeMethod]
public static void ConvertXmlToCsv()
{
System.IO.File.Delete("E:\\Ranorex_test.csv");
XDocument doc = XDocument.Load("E:\\lang.xml");
string csvOut = string.Empty;
StringBuilder sColumnString = new StringBuilder(50000);
StringBuilder sDataString = new StringBuilder(50000);
foreach (XElement node in doc.Descendants(GetServerLanguage()))
{
foreach (XElement categoryNode in node.Elements())
{
foreach (XElement innerNode in categoryNode.Elements())
{
//"{0}," give you the output in Comma seperated format.
string sNodePath = categoryNode.Name + "_" + innerNode.Name;
sColumnString.AppendFormat("{0},", sNodePath);
sDataString.AppendFormat("{0},", innerNode.Value);
}
}
}
if ((sColumnString.Length > 1) && (sDataString.Length > 1))
{
sColumnString.Remove(sColumnString.Length-1, 1);
sDataString.Remove(sDataString.Length-1, 1);
}
string[] lines = { sColumnString.ToString(), sDataString.ToString() };
System.IO.File.WriteAllLines(#"E:\Ranorex_test.csv", lines);
}
}
For your information, a simple version of my xml looks like that:
<LANGUAGE>
<ENGLISH ID="1033">
<TEXT>
<IDS_TEXT_CANCEL>Cancel</IDS_TEXT_CANCEL>
<IDS_TEXT_WARNING>Warning</IDS_TEXT_WARNING>
</TEXT>
<LOGINCLASS>
<IDS_LOGC_DLGTITLE>Log In</IDS_LOGC_DLGTITLE>
</LOGINCLASS>
</ENGLISH>
<FRENCH ID="1036">
<TEXT>
<IDS_TEXT_CANCEL>Annuler</IDS_TEXT_CANCEL>
<IDS_TEXT_WARNING>Attention</IDS_TEXT_WARNING>
</TEXT>
<LOGINCLASS>
<IDS_LOGC_DLGTITLE>Connexion</IDS_LOGC_DLGTITLE>
</LOGINCLASS>
</FRENCH>
</LANGUAGE>

How to convert HTML to plain text? [duplicate]

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Reading XML node from C# when it's a resource in XAML

I'm having some trouble with the following. I'm a beginner, which is probably why.
I have a listbox that displays some pictures, it gets the paths of these pictures from an XML file. This XML file is defined as a resource in XAML. If a picture is selected and the user presses enter, I want to launch an external app with some parameters, including a path found in another node of that XML file (appath in the example below).
XML layout:
<picture>
<path></path>
<appath></appath>
</picture>
I can't seem to find the way to access the node from C#.
Any pointers greatly appreciated!
Thanks in advance,
J.
If you don't have any attributes in the picture node (an id of some sort) you, have to first match up on the path which you should already have in your listbox, then return the appath.
static string GetAppath(string xmlString, string picPath)
{
string appath = String.Empty;
XmlDocument xDoc = new XmlDocument();
xDoc.LoadXml(xmlString);
XmlNodeList xList = xDoc.SelectNodes("/picture");
foreach (XmlNode xNode in xList)
{
if (xNode["path"].InnerText == picPath)
{
appath = xNode["appath"].InnerText;
break;
}
}
return appath;
}
Assuming your xml file looks something like:
<?xml version="1.0" encoding="ISO-8859-1"?>
<pictures>
<picture>
<path></path>
<appath></appath>
</picture>
</pictures>
If your resource name is Pictures:
XElement resource = XElement.Parse(Properties.Resources.Pictures);
Using these extensions: (just copy the class/file into your root directory of your project) http://searisen.com/xmllib/extensions.wiki
public class PicturesResource
{
XElement self;
public PicturesResource()
{ self = XElement.Parse(Properties.Resources.Pictures); }
public IEnumerable<Picture> Pictures
{ get { return self.GetEnumerable("picture", x => new Picture(x)); } }
}
public class Picture
{
XElement self;
public Pictures(XElement self) { this.self = self; }
public string Path { get { return self.Get("path", string.Empty); } }
public string AppPath { get { return self.Get("apppath", string.Empty); } }
}
You could then bind the Pictures or do a look up on them:
PicturesResource pictures = new PicturesResource();
foreach(Picture pic in pictures.Pictures)
{
string path = pic.Path;
string apppath = pic.AppPath;
}
Or searching for a particular picture:
Picture pic = pictures.FirstOrDefault(p => p.Path = "some path");
if(pic != null)
{
// do something with pic
}

How can I Convert HTML to Text in C#?

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Categories