remove all HTML formatting from a string

remove all HTML formatting from a string - c#

I am trying to compare 2 strings but i just realized that one has some html formatting already.
How can i get these two strings to match when doing string1 == string2. (NOTE: i dont know what the HTML formatting is going to be upfront)
string1 = "This is a test";
string1 = "<font color=\"black\" size=\"1\">This is a test</font>";

Load the html into Html Agility Pack, and extract only the text.
string html = "<html><body><div>test</div></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html)
string text = document.DocumentNode.InnerText;
This will not remove the content of <script> nodes, but you can easily remove the script nodes first.

string newText = System.Text.RegularExpressions.Regex.Replace(OldHtmlTextHere, "<[^>]*>", string.Empty);

Check out system.web.Httputility.HTMLdecode

Related

How to identify html tags in html string

I have below html string, where i am trying to identify the <br> tag start and end of the whole text inside an html string using the below code
var htmlstring = "<p><span><br> text <b>text <br></b>text <br></span></p>"
var document = new HtmlDocument();
document.LoadHtml(htmlString);
var nodes= rootNode.SelectNodes("//br")
but it is giving all <br> tags nodes where i want only at the start and at the end of whole html text string in below html string
<p><span><br> text <b> text <br></b>text <br></span></p>
I am looking for nodes should be 2 instead of 3 but getting as 3 as it counts the <br> tag presented in between text as well.
Could any one please help on this how can i achieve this, many thanks in advance

You can use the Split method to solve your problem. I have a suggestion for you as follows. It prints text between <br> tags which are start and end tags. In addition, you can modify the output according to your requirements. Maybe it can be solved by using the regex pattern.
const string tag = "<br>";
var splitedHtmlString = htmlString.Split(tag);
StringBuilder builder = new StringBuilder();
for (int i = 1; i < splitedHtmlString.Length - 1; i++)
{
builder.Append(splitedHtmlString[i]);
builder.Append(tag);
}
builder.Remove(builder.ToString().Length - tag.Length, tag.Length);
Console.WriteLine(builder.ToString());
Output: text <b>text <br></b>text

You can convert your string to an HtmlDocument and filter by nodes, using HtmlAgilityPack library
HtmlDocument document = new HtmlDocument();
document.LoadHtml("your html code");
var htmlTag = document.DocumentNode.SelectNodes("//br");

how to check string value if table tag is available and implement regex

I have this string variable which composes a text and html tags. how do i perform regex only within the html table tag? is this possible?
string input = "Hello,\nTRAVEL DETAILS\n<table border=\"1\">\n<tr>\n<th align=\"center\">Initial Travel Date</th>\n<th align=\"center\">Reference Number</th>\n<th align=\"center\">First Name</th>\n<th align=\"center\">Surname</th>\n<th align=\"center\">Main Reason</th>\n<th align=\"center\">Client ID</th>\n</tr>\n<tr>\n<td align=\"center\">{TRV TRL INIT.trn}</td>\n<td align=\"center\">{TRV REF NO.trn}</td>\n<td align=\"center\">{TRV FIRST NM.trn}</td>\n<td align=\"center\">{TRV SURNAME.trn}</td>\n<td align=\"center\">Internal Meeting</td>\n<td align=\"center\">{TRV CLIEN ID.trn}</td>\n</tr>\n</table>"
string output = Regex.Replace(input, #"\t|\n|\r", "");
return output;
i only need to remove the "\n" inside the table element

You can use the WebBrowser control to parse the HTML string, get the table chunk and remove the new lines from there.
Or you can utilise IHTMLDocument, IHTMLDocument2, IHtmlDocument3 ... up to 8 to parse the HTML. You need to include Mshtml.dll in your project references though.
Or use a 3rd party HTML parser.
Do not try to manipulate the raw string unless you wanna write your own HTML parser.

i have found a way to eliminate the "\n" inside the table. but then it resulted for not using the regex. here's the updated codes
string input = emailMessage.Message.Replace("\n<tr>\n", "<tr>").Replace("</th>\n", "</th>").Replace("\n</tr>", "</tr>")
.Replace("</td>\n", "</td>").Replace("\n</table>", "</table>");
string output = input;
return output;
thank you to all of the comments and suggestions

Need data inside the Body tag, but no any other tag

Hi I have Resume in the html format,
I am reading file using StreamReader ,and I am removing tags using below method.
using (StreamReader sr = new StreamReader("\\Myfile.html"))
{
String line = sr.ReadToEnd();
string jj = Regex.Replace(line, "<.*?>", String.Empty);
}
Its working Damn Cool
But however as per my requirement I need the data only inside the body tag.
but no body tag, and with no tags inside.

Don't use Regex for HTML/XML parsing. Use Html/Xml parser. Here is explain well why you should not use it.
RegEx match open tags except XHTML self-contained tags
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
You can load the string in Html document using HTML Agility pack
Here little example of how to do it:
public string ReplacePElement()
{
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
foreach(HtmlNode p in doc.DocumentNode.SelectNodes("body"))
{
}
return doc.DocumentNode.OuterHtml;
}

Removing Breaklines at the start of string

I am using HtmlAgilityPack to format html for a text file. <br> nodes are replaced with '\r\n' so it stays formatted in the text file. I want all the breaklines before the first actual char to be removed but my code doesnt do that. The final output for the test should be: Original:HelloCheck Expected:HelloCheck
html = "<br><br><br>Hello<br>Check";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//br");
if (nodes != null)
{
foreach(var node in nodes)
{
node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);
}
}
html = doc.Documentode.InnerText.TrimStart('r','n');
OutputLog.WriteLine("trimmed: " + html);

I want all the breaklines before the first actual char to be removed
but my code doesnt do that.
You could easily do this using Regex.
html = "<br><br><br>Hello<br>Check";
Regex.Replace(s,"^(?:<br>)+","\r\n") //Returns Hello<br>Check
Then you can process your html as you want.

How to get text from html nodes and solve character encoding issue?

I'm trying to get innertext in this site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack.
html structure is
<div class="detailText">
<span class="yzrArticleDate">30 Mart 2014</span>
<h1 class="yazarArticleTitle">31 Mart sabahı için acil ihtiyaç listesi</h1>
<p></p><p><p >Akıl.<br />Sağduyu.<br />Barış.<br />
Özgürlük.<br />Kardeşlik.<br />Vicdan.<br />Huzur.............
and my current code
string htmlContent = getsource(s);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(htmlContent);
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerText;
problem is it gets with the heading and date. I mean with "30 Mart 2014" and "31 Mart sabahı için acil ihtiyaç listesi".
I want the part which begins with
<*p><*/p><*p><p* >Akıl.<*br "
I tried different variation
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").InnerHtml;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").NextSibling.NextSibling.InnerText;
var noa = document.DocumentNode.SelectSingleNode("*//div[#class='detailText']").LastSibling.InnerText;
my second question ; if I manage to text this text I ll be faced a character encoding problem, how can I fix this

The easiest solution would be to remove nodes you don't want and than get InnerHtml/InnerText as covered in remove html node from htmldocument :HTMLAgilityPack.
var noa =document.DocumentNode.SelectSingleNode("*//div[#class='detailText']")
noa.RemoveChild(noa.SelectSingleNode("span"));
// remove the rest too...
var result = noa.InnerText;
There should be no encoding problem unless site reports invalid encoding as C# strings are Unicode (UTF16).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

remove all HTML formatting from a string - c#

string newText = System.Text.RegularExpressions.Regex.Replace(OldHtmlTextHere, "<[^>]*>", string.Empty);

Check out system.web.Httputility.HTMLdecode

Related

How to identify html tags in html string

how to check string value if table tag is available and implement regex

Need data inside the Body tag, but no any other tag

Removing Breaklines at the start of string

How to get text from html nodes and solve character encoding issue?

Categories

Resources