Remove html special Char from displayed text - c#

I have Xml witch a convert to plain text and then display with html formatting in a web browser.
At the end of each line the symbol ¶ appears i would like to remove the symbol or replace it with a .
Does anyone know how i could do this?
This is how i convert XML to plain text:
XmlDocument doc = new XmlDocument();
doc.LoadXml(this.dataGridViewResult.SelectedRows[0].Cells["XMLEvent"].Value.ToString());
StringBuilder sb = new StringBuilder();
foreach (XmlNode node in doc.DocumentElement.ChildNodes)
{
sb.Append(char.ToUpper(node.Name[0]));
sb.Append(node.Name.Substring(1));
sb.Append(' ');
sb.AppendLine(node.InnerText);
}

Where does the '¶' appear? Is it when you open the converted text file in an editor?
Normally that sign is used to visualize the end of line in a text editor, and it's not really part of you text. In many cases you have an option in the text editor to show/hide line ending markers.
However, if the output you are interested in is HTML, the character should not appear here.

Try this:
sb.AppendLine(node.InnerText.TrimEnd('¶'));
or
sb.AppendLine(node.InnerText.Replace("¶","."));

After the foreach loop, Try:
sb.Replace("¶", ".");

Specifically in your case (assuming it's always at the end of the line), I'd use:
sb.AppendLine(node.InnerText.Replace('\u00b6', '.'));
If you want to keep your code unicode free.

Related

Breakdown of HTML RTF string for 3rd Party Formatting

I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...
I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text
So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.

C# StringBuilder : i can't set a specific Microsoft Word line feed

I have created a template in word, basically, a word document with some [Something] to be replaced in a C# console application, and i'm trying to set a spécific character as a line feed.
Indeed, in word, according of the line feed character you are using, it will allow or not the page break into a table cell.
To be very clear, when you have clicked the display all char button, you can see all the special characters.
The one i want to add is this one : good linefeed but, i tried all unicode character i found on wikipedia and i always have this kind of linefeed bad return
I use this librairy to manipulate the docx document : github.com/WordDocX/DocX
And this is the code i have used in the "Examples" project of the DocX github librairy :
private static void ModifyTemplate()
{
// Loading the template
using (DocX document = DocX.Load("D:\\DocX-master\\Examples\\docs\\Template.docx"))
{
var sb = new StringBuilder("");
sb.Clear();
sb.AppendLine("bla bla bla bla bla bla bla bla.");
//Testing all the unicode codes i found here : https://en.wikipedia.org/wiki/Newline
sb.Append("u000D\u000D").Append("u000A\u000A").Append("u0085\u0085").Append("u000B\u000B").Append("u000C\u000C").Append("u2028\u2028").Append("u2029\u2029");
sb.AppendLine("AppendLine");
document.ReplaceText("[test]", sb.ToString());
//Testing with a different encoding
String test = Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(sb.ToString()));
document.ReplaceText("[test]", test);
#region Saving the modified template on the disk
// Save all changes to this document.
document.SaveAs(#"docs\Result.docx");
#endregion
}// Release this document from memory.
}
My word template is basically a new docx with the texte [Test] and [Test2] inside
The result with this code :
Sorry i can't post images... so i can't show you the final result... but trust me, none of this unicode seems to produce the right result
To conclude, no matter what unicode code i use, it is impossible to have the good return char
What unicode code i should use? Or what encoding trick should i use to have the linefeed char i seek ?
Seems like you need to use "InsertParagraph"
Paragraph p1 = document.InsertParagraph();
p1.Append("New Para 1");
Paragraph p2 = document.InsertParagraph();
p2.Append("New Para 2");
Paragraph p3 = document.InsertParagraph();
p3.Append("New Para 3");
Thx to Aruk tip, i managed to solve this issue.
By using InsertParagraph on the specific cell i want, i can have the wanted linefeed, the one Word can understand and use.
One issue is remaining, i am losing the style of my former tag [something], but this shouldn't be too much of an issue to solve
Thx a lot

When I assign InnerText to string varible All all enter characters removed

I use htmlagilitypack.
This is part of my code to get innertext from single node.
var edit = outDocument.DocumentNode.SelectSingleNode("//textarea[#id='wpTextbox1']//text()");
String _edit;
_edit = edit.InnerText.ToString().Trim();
picture 1
When I write _edit to text file , All texts are bound(all enter characters removed).
picture 2
I want texts with enter characters.
How can I fix this?
Sorry for my bad English.
HTML text that you receive contains the character "\n" as newline. Meanwhile, Windows uses two symbols "\r\n" for this purpose. Therefore, the Notepad displays the text in this way.
You need to make the change:
string _edit = edit.InnerText.Replace("\n", Environment.NewLine);

Remove line-breaks after block tags with regular expression

I want to remove the line-breaks after block tags such as h1, h2, ul, blockquote etc. before converting them to PDF.
I am currently using string.Replace method as below. Is there a better solution with RegEx?
text = text.Replace("center]\r\n", "center]")
.Replace("li]\r\n", "li]")
.Replace("ol]\r\n", "ol]")
.Replace("ul]\r\n", "ul]")
.Replace("center]\n", "center]")
.Replace("li]\n", "li]")
.Replace("ol]\n", "ol]")
.Replace("ul]\n", "ul]")
.Replace("h1]\r\n", "h1]")
.Replace("h2]\r\n", "h2]")
.Replace("h3]\r\n", "h3]")
.Replace("h4]\r\n", "h4]")
.Replace("h1]\n", "h1]")
.Replace("h2]\n", "h2]")
.Replace("h3]\n", "h3]")
.Replace("h4]\n", "h4]")
.Replace("\r\n[h1]", "[h1]")
.Replace("\r\n[h2]", "[h2]")
.Replace("\r\n[h3]", "[h3]")
.Replace("\r\n[h4]", "[h4]")
.Replace("\n[h1]", "[h1]")
.Replace("\n[h2]", "[h2]")
.Replace("\n[h3]", "[h3]")
.Replace("\n[h4]", "[h4]")
;
NOTE THAT
This is just one step of the process. There are many other custom tags such as blue, red, email doc which are already being parsed into HTML. There reason I am trying to remove line breaks is because I cannot use the line-break br tag. We must maintain the normal line-breaks in the text document.
How about the regx
((?:center|li|[ou]l|h[1-4])\])\r?\n|\r?\n(\[h[1-4]\])
replace with the contents of capture group 1
text = text.Replace("((?:center|li|[ou]l|h[1-4])\\])\\r?\\n|\\r?\\n(\\[h[1-4]\\])", "$1")

extracting just page text using HTMLAgilityPack

Ok so i am really new to XPath queries used in HTMLAgilityPack.
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
However not only i am not getting just text i am also getting numerous /r /n characters.
Please i require a little guidance in this regard.
If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
You can further exclude text nodes that are only whitespace using XPath's normalize-space():
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
or the shorter
//*[not(self::script or self::style)]/text()[normalize-space()]
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as #aL3891 suggests.
If \r \n characters in the final string is the problem, you could just remove them after the fact:
TempString.ToString().Replace("\r", "").Replace("\n", "");

Categories