Remove html markup from string - c#

I am reading in a field from the database and displaying in a GridView and in that field it contains <br/> tags in the text. So I am trying to remove these from the code but when I check the value of e.Row.Cells[index].Text it doesn't contain <br/> and is has ;br/> instead.
So I tried creating a function that removes any substring starting with < and ending with > or starting with & and ending with ;. The code removes the <> but it is still showing br/
Code:
index = gv.Columns.HeaderIndex("Message");
if (index > 0)
{
string message = RemoveHTMLMarkup(e.Row.Cells[index].Text);
e.Row.Cells[index].Text = message;
}
static string RemoveHTMLMarkup(string text)
{
return Regex.Replace(Regex.Replace(text, "<.+?>", string.Empty), "&.+?;", string.Empty);
}
How do I remove the <br/> tag?

Since this is a literal string, you (sh|c)ould only use String.Replace():
static string RemoveHTMLNewLines(string text)
{
return text.Replace("<br/>", string.Empty);
}
Or replace with Environment.NewLine if needed.

De-entitize the string.
Then use regex to to find and remove expected tags.
Or
If you have enough time to study and Use, then use HtmlAgilityPack package.
About HtmlAgilityPack
Nuget Package Link

Being regardless of the question, I am curious about what this is:

Related

how to check string value if table tag is available and implement regex

I have this string variable which composes a text and html tags. how do i perform regex only within the html table tag? is this possible?
string input = "Hello,\nTRAVEL DETAILS\n<table border=\"1\">\n<tr>\n<th align=\"center\">Initial Travel Date</th>\n<th align=\"center\">Reference Number</th>\n<th align=\"center\">First Name</th>\n<th align=\"center\">Surname</th>\n<th align=\"center\">Main Reason</th>\n<th align=\"center\">Client ID</th>\n</tr>\n<tr>\n<td align=\"center\">{TRV TRL INIT.trn}</td>\n<td align=\"center\">{TRV REF NO.trn}</td>\n<td align=\"center\">{TRV FIRST NM.trn}</td>\n<td align=\"center\">{TRV SURNAME.trn}</td>\n<td align=\"center\">Internal Meeting</td>\n<td align=\"center\">{TRV CLIEN ID.trn}</td>\n</tr>\n</table>"
string output = Regex.Replace(input, #"\t|\n|\r", "");
return output;
i only need to remove the "\n" inside the table element
You can use the WebBrowser control to parse the HTML string, get the table chunk and remove the new lines from there.
Or you can utilise IHTMLDocument, IHTMLDocument2, IHtmlDocument3 ... up to 8 to parse the HTML. You need to include Mshtml.dll in your project references though.
Or use a 3rd party HTML parser.
Do not try to manipulate the raw string unless you wanna write your own HTML parser.
i have found a way to eliminate the "\n" inside the table. but then it resulted for not using the regex. here's the updated codes
string input = emailMessage.Message.Replace("\n<tr>\n", "<tr>").Replace("</th>\n", "</th>").Replace("\n</tr>", "</tr>")
.Replace("</td>\n", "</td>").Replace("\n</table>", "</table>");
string output = input;
return output;
thank you to all of the comments and suggestions

Extracting only tags from html text file

I'm working on a steganography method which hides text withing html tags.
for example this tag: <heEAd> I have to extract every character within the tag and then
analyze the case of the letter if it is capital then the bit is set to 1 else 0 and I also want to check the end if it sees the matching closing /head tag
here is the code :
WebClient client = new WebClient();
String htmlCode = client.DownloadString("url");
String Tags = "";
for(int i = 0; i < htmlCode.Length; i++){
if(htmlCode[i] ='<'){
if(htmlCode[i] = '>')
continue;
else{
Tags += htmlCode[i];
}
}
}
That logic is terrible but how do I use IndexOf and lastIndexOf to get the desired substring I tried to use that but I'm just missing something due to the lack of my knowledge about c#
I think you need to use REGEX.
I tried to do this once with Substring and i had much job. Latter i decided to use regex and it was easier than the first one.
var regex = new Regex(#"(?<=<head>).*(?=</head>)");
return regex.Matches(strInput);

why regex split add to pattern \r\n

I want to split the body of article by html div tag so I have a pattern to search div.
the problem is that the pattern also split \r\n
[enter image description here][1]
string pattern = #"<div[^<>]*>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None);
Response.Write("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Response.Write("bodyParagraphs" + i + "= " + bodyParagraphsnew[i]+ Environment.NewLine);
}
When I debug this code I see a lot of "\r\n" in the array bodyParagraphsnew.
Its seen that the pattern include split by the string "\r\n"
I try to replace \r\n to string empty and i hoped that bodyParagraphsnew length will change.but not.I got instead of item(in array) that contain \r\n it contain ""
WHY?
here is link to image http://i.stack.imgur.com/Hxqki.gif that explain the problem
What you are seeing is the text that is between the end of the first </div> tag and the start of the next <div> tag. This is what Split does, it finds the text between the Regular Expression matches.
What is curious here though is that you are also going to get the text between the open and close tags because you put brackets in your string forming a capturing group. Consider the following program:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i]);
}
}
}
What you will get from this is:
"" - An empty string taken from before the first <div>.
"some text" - The contents of the first <div>, because of the capturing group.
"\r\n" - The text between the end of the first </div> and the start of the last <div>.
"some more text" - The contents of the second div, again because of the capturing group.
"" - An empty string taken from after the last </div>.
What you are probably after is the contents of the div tags. This can kind of be achieved using this code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
MatchCollection bodyParagraphsnew = Regex.Matches(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Count);
for (int i = 0; i < bodyParagraphsnew.Count; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i].Groups[1].Value);
}
}
}
Note however that in HTML, div tags can be nested within each other. For example, the following is a valid HTML string:
string test = "<div>Outer div<div>inner div</div>outer div again</div>";
With this kind of situation Regular expressions are not going to work! This is largely due to HTML not being a Regular Language. To deal with this situation you are going to need to write a Parser (of which regular expressions are only a small part). However personally I wouldn't bother as there are plenty of open source HTML parsers already available HTML Agility Pack for example.
Two possibilies
you use llist instead of array and list.remove
you go through your array search for \r\n and remove it by index
if(bodyParagraphsnew[i] == "\r\n")
{
bodyParagraphsnew = bodyParagraphsnew.Where(w => w != bodyParagraphsnew[i]).ToArray();
}
Not very nice but maybe it is what you were looking for

unable to find a substring in html after decode/normalize

I have a snippet of html held as a string "s", it's user generated and may come from multiple sources, so I can't control the encoding of characters etc.
I have a simple string "comparison", and I need to check if comparison exists as a substring of "s". "comparison" does not have any html tags or encoding.
I am decoding, normalizing, and using a regex to strip out html tags, but am still unable to find the substring even when I know it is there...
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Literal1.Text = normalized;
if (normalized.IndexOf(comparison) != -1)
{
Label1.Text = "substring found";
}
else
{
Label1.Text = "substring not found";
}
This is returning "substring not found". I can see by clicking view source that the string sent to the Literal absolutely includes the comparison string exactly as provided, so why isn't in being found?
Is there another way to achieve this?
The answer is that the HTML entity decoding still decodes your to the character 0xc2 0xa0 which is not a normal space character ' ' (which is 0x20). Verfy this with the following program:
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
namespace TestStuff
{
class Program
{
static void Main(string[] args)
{
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
s = "i want to find a substring";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Console.WriteLine("Dumping first string");
Console.WriteLine(normalized);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(normalized)));
Console.WriteLine("Dumping second string");
Console.WriteLine(comparison);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(comparison)));
if (normalized.IndexOf(comparison) != -1)
Console.WriteLine("substring found");
else
Console.WriteLine("substring not found");
Console.ReadLine();
return;
}
}
}
It dumps the UTF8 encodings of the two strings for you. You'll see as output:
Dumping first string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-C2-A0-66-69-6E-64-C2-A0-61-C2-A0-73-75-62-73-74-72-69-6E-67
Dumping second string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-20-66-69-6E-64-20-61-20-73-75-62-73-74-72-69-6E-67
substring not found
You see that the bytearrays do not match, therefore they aren't equal, therefore .IndexOf() is right to tell you that nothing was found.
So, the problem lies within the HTML itself since there is a non-breaking space character which you don't decode to a normal space. You can hack around it by substituting a " " for a " " in the string using String.Replace().

Losing the 'less than' sign in HtmlAgilityPack loadhtml

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.
I have a string with the following content:
string s = "<span style=\"color: #0000FF;\"><</span>";
You see that in my span I have a 'less than' sign.
I process this string with the following code:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);
But when I do a quick and dirty look in the span like this:
htmlDocument.DocumentNode.ChildNodes[0].InnerHtml
I see that the span is empty.
What option do I need to set maintain the 'less than' sign. I already tried this:
htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;
but with no success.
I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs
Please direct me in the right direction. Thanks in advance
The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:
string s = "<span style=\"color: #0000FF;\"><</span>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
doc.Save(Console.Out);
Console.WriteLine();
Console.WriteLine();
foreach (HtmlParseError err in doc.ParseErrors)
{
Console.WriteLine("Error");
Console.WriteLine(" code=" + err.Code);
Console.WriteLine(" reason=" + err.Reason);
Console.WriteLine(" text=" + err.SourceText);
Console.WriteLine(" line=" + err.Line);
Console.WriteLine(" pos=" + err.StreamPosition);
Console.WriteLine(" col=" + err.LinePosition);
}
It will display this (the corrected text first, and details about the error then):
<span style="color: #0000FF;"></span>
Error
code=EndTagNotRequired
reason=End tag </> is not required
text=<
line=1
pos=30
col=31
So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.
As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value <.
return Regex.Replace(html, "<(?![^<]+>)", "<");
Fix the markup, because your HTML string is invalid:
string s = "<span style=\"color: #0000FF;\"><</span>";
Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.
I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "<":
static string PreProcess(string htmlInput)
{
// Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
int lastGt = -1;
// This list will be populated with all the unclosed '<' characters.
List<int> gtPositions = new List<int>();
// Collect the unclosed '<' characters.
for (int i = 0; i < htmlInput.Length; i++)
{
if (htmlInput[i] == '<')
{
if (lastGt != -1)
gtPositions.Add(lastGt);
lastGt = i;
}
else if (htmlInput[i] == '>')
lastGt = -1;
}
if (lastGt != -1)
gtPositions.Add(lastGt);
// If no unclosed '<' characters are found, then just return the input string.
if (gtPositions.Count == 0)
return htmlInput;
// Build the output string, replace all unclosed '<' character by "<".
StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
int start = 0;
foreach (int gtPosition in gtPositions)
{
htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
htmlOutput.Append("<");
start = gtPosition + 1;
}
htmlOutput.Append(htmlInput.Substring(start));
return htmlOutput.ToString();
}
string "s" is bad html.
string s = "<span style=\"color: #0000FF;\"><</span>";
it's true.

Categories