C# removing white spaces in an HTML string - c#

is it possible to remove all white spaces in the following HTML string in C#:
"
<html>
<body>
</body>
</html>
"
Thanks

When dealing with HTML or any markup for that matter, it's usually best to run it through a parser that truly understands the rules of that markup.
The first benefit is that it can tell you if your initial input data is garbage to start with.
If the parser is smart enough it might even be able to correct badly formed markup automatically, or accept it with relaxed rules.
You can then modify the parsed content....and get the parser to write out the changes...this way you can be sure the markup rules are followed and you have correct output.
For some simple HTML markup scenarios or for markup that is so badly formed a parser just balks on it straight away, then yes you can revert to hacking the input string...with string replacements, etc....it all depends on your needs as to which approach you take.
Here are a couple of tools that can help you out:
HTML Tidy
You can use HTML Tidy and just specify some options/rules on how you want your HTML to be tidied up (e.g. remove superfluous whitespace).
It's a WIN32 DLL...but there are C# Wrappers for it.
http://tidy.sourceforge.net
http://robertbeal.com/37/sanitising-html
C# version of HTML Tidy?
http://geekswithblogs.net/mnf/archive/2011/06/08/implementations-of-html-tidylib-for-.net.aspx
HtmlAgilityPack
You can use HtmlAgilityPack to parse HTML if you need to understand the structure better and perhaps do your own tidying up/restructuring.
http://html-agility-pack.net

myString = myString.Replace(System.Environment.NewLine, "");

You can use a regular expression to match white space characters for the replace:
s = RegEx.Replace(s, #"\s+", String.Empty);

I used this solution (in my opinion it works well. See also test code):
Add an extension method to trim the HTML string:
public static string RemoveSuperfluousWhitespaces(this string input)
{
if (input.Length < 3) return input;
var resultString = new StringBuilder(); // Using StringBuilder is much faster than using regular expressions here!
var inputChars = input.ToCharArray();
var index1 = 0;
var index2 = 1;
var index3 = 2;
// Remove superfluous white spaces from the html stream by the following replacements:
// '<no whitespace>' '>' '<whitespace>' ==> '<no whitespace>' '>'
// '<whitespace>' '<' '<no whitespace>' ==> '<' '<no whitespace>'
while (index3 < inputChars.Length)
{
var char1 = inputChars[index1];
var char2 = inputChars[index2];
var char3 = inputChars[index3];
if (!Char.IsWhiteSpace(char1) && char2 == '>' && Char.IsWhiteSpace(char3))
{
// drop whitespace character in char3
index3++;
}
else if (Char.IsWhiteSpace(char1) && char2 == '<' && !Char.IsWhiteSpace(char3))
{
// drop whitespace character in char1
index1 = index2;
index2 = index3;
index3++;
}
else
{
resultString.Append(char1);
index1 = index2;
index2 = index3;
index3++;
}
}
// (index3 >= inputChars.Length)
resultString.Append(inputChars[index1]);
resultString.Append(inputChars[index2]);
var str = resultString.ToString();
return str;
}
// 2) add test code:
[Test]
public void TestRemoveSuperfluousWhitespaces()
{
var html1 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html2 = $"<td class=\"keycolumn\">{Environment.NewLine}<p class=\"mandatory\">Some recipe parameter name</p>{Environment.NewLine}</td>";
var html3 = $"<td class=\"keycolumn\">{Environment.NewLine} <p class=\"mandatory\">Some recipe parameter name</p> {Environment.NewLine}</td>";
var html4 = " <td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html5 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td> ";
var compactedHtml1 = html1.RemoveSuperfluousWhitespaces();
compactedHtml1.Should().BeEquivalentTo(html1);
var compactedHtml2 = html2.RemoveSuperfluousWhitespaces();
compactedHtml2.Should().BeEquivalentTo(html1);
var compactedHtml3 = html3.RemoveSuperfluousWhitespaces();
compactedHtml3.Should().BeEquivalentTo(html1);
var compactedHtml4 = html4.RemoveSuperfluousWhitespaces();
compactedHtml4.Should().BeEquivalentTo(html1);
var compactedHtml5 = html5.RemoveSuperfluousWhitespaces();
compactedHtml5.Should().BeEquivalentTo(html1);
}

Related

Extracting only tags from html text file

I'm working on a steganography method which hides text withing html tags.
for example this tag: <heEAd> I have to extract every character within the tag and then
analyze the case of the letter if it is capital then the bit is set to 1 else 0 and I also want to check the end if it sees the matching closing /head tag
here is the code :
WebClient client = new WebClient();
String htmlCode = client.DownloadString("url");
String Tags = "";
for(int i = 0; i < htmlCode.Length; i++){
if(htmlCode[i] ='<'){
if(htmlCode[i] = '>')
continue;
else{
Tags += htmlCode[i];
}
}
}
That logic is terrible but how do I use IndexOf and lastIndexOf to get the desired substring I tried to use that but I'm just missing something due to the lack of my knowledge about c#
I think you need to use REGEX.
I tried to do this once with Substring and i had much job. Latter i decided to use regex and it was easier than the first one.
var regex = new Regex(#"(?<=<head>).*(?=</head>)");
return regex.Matches(strInput);

Replace character references for invalid XML characters

I am projecting some data as XML from SQL Server using ADO.NET. Some of my data contains characters that are invalid in XML, such as CHAR(7) (known as BEL).
SELECT 'This is BEL: ' + CHAR(7) AS A FOR XML RAW
SQL Server encodes such invalid characters as numeric references:
<row A="This is BEL: " />
However, even the encoded form is invalid under XML 1.0, and will give rise to errors in XML parsers:
var doc = XDocument.Parse("<row A=\"This is BEL: \" />");
// XmlException: ' ', hexadecimal value 0x07, is an invalid character. Line 1, position 25.
I would like to replace all these invalid numeric references with the Unicode replacement character, '�'. I know how to do this for unencoded XML:
string str = "<row A=\"This is BEL: \u0007\" />";
if (str.Any(c => !XmlConvert.IsXmlChar(c)))
str = new string(str.Select(c => XmlConvert.IsXmlChar(c) ? c : '�').ToArray());
// <row A="This is BEL: �" />
Is there a straightforward way to make it work for encoded XML too? I would prefer to avoid having to HtmlDecode then HtmlEncode the whole string, in order not to risk introducing changes other than invalid character replacement.
Edit: The conversion needs to be done in my C# code, not SQL, in order for it to be implemented centrally.
I made another go at it using regular expressions. This should handle both decimal and hex character codes. Also, this will not affect anything but numerically encoded characters.
public string ReplaceXMLEncodedCharacters(string input)
{
const string pattern = #"&#(x?)([A-Fa-f0-9]+);";
MatchCollection matches = Regex.Matches(input, pattern);
int offset = 0;
foreach (Match match in matches)
{
int charCode = 0;
if (string.IsNullOrEmpty(match.Groups[1].Value))
charCode = int.Parse(match.Groups[2].Value);
else
charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
char character = (char)charCode;
input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
offset += match.Length - 1;
}
return input;
}
You can wrap the special characters in the CDATA tag. This informs the parser to ignore text within the tag. To use your example:
SELECT 'This is BEL: <![CDATA[' + CHAR(7) + ']]>' AS A FOR XML RAW
This will allow the XML to be parsed at the very least, albeit requiring a slight change to the document structure.
For reference, this is my solution. I've built on Tonkleton's answer, but modified it to match the internal implementation of HtmlDecode more closely. The code below ignores surrogate pairs.
// numeric character references
static readonly Regex ncrRegex = new Regex("&#x?[A-Fa-f0-9]+;");
static string ReplaceInvalidXmlCharacterReferences(string input)
{
if (input.IndexOf("&#") == -1) // optimization
return input;
return ncrRegex.Replace(input, match =>
{
string ncr = match.Value;
uint num;
var frmt = NumberFormatInfo.InvariantInfo;
bool isParsed =
ncr[2] == 'x' ? // the x must be lowercase in XML documents
uint.TryParse(ncr.Substring(3, ncr.Length - 4), NumberStyles.AllowHexSpecifier, frmt, out num) :
uint.TryParse(ncr.Substring(2, ncr.Length - 3), NumberStyles.Integer, frmt, out num);
return isParsed && !XmlConvert.IsXmlChar((char)num) ? "�" : ncr;
});
}

When using indexof and substring how do i parse the right start and end indexs ? And how do i encode hebrew chars?

I have this code:
string firstTag = "Forums2008/forumPage.aspx?forumId=";
string endTag = "</a>";
index = forums.IndexOf(firstTag, index1);
if (index == -1)
continue;
var secondIndex = forums.IndexOf(endTag, index);
result = forums.Substring(index + firstTag.Length + 12, secondIndex - (index + firstTag.Length - 50));
The string i want to extract from is for example:
הנקה
What i want to get is the word after the title only this: הנקה
And the second problem is that when i'm extracting it i see instead hebrew some gibrish like this: ������
One powerful way to do this is to use Regular Expressions instead of trying to find a starting position and use a substring. Try out this code, and you'll see that it extracts the anchor tag's title:
var input = "הנקה";
var expression = new System.Text.RegularExpressions.Regex(#"title=\""([^\""]+)\""");
var match = expression.Match(input);
if (match.Success) {
Console.WriteLine(match.Groups[1]);
}
else {
Console.WriteLine("not found");
}
And for the curious, here is a version in JavaScript:
var input = 'הנקה';
var expression = new RegExp('title=\"([^\"]+)\"');
var results = expression.exec(input);
if (results) {
document.write(results[1]);
}
else {
document.write("not found");
}
Okay here is the solution using String.Substring() String.Split() and String.IndexOf()
String str = "הנקה"; // <== Assume this is passing string. Yes unusual scape sequence are added
int splitStart = str.IndexOf("title="); // < Where to start splitting
int splitEnd = str.LastIndexOf("</a>"); // < = Where to end
/* What we try to extract is this : title="הנקה">הנקה
* (Given without escape sequence)
*/
String extracted = str.Substring(splitStart, splitEnd - splitStart); // <=Extracting required portion
String[] splitted = extracted.Split('"'); // < = Now split with "
Console.WriteLine(splitted[1]); // <= Try to Out but yes will produce ???? But put a breakpoint here and check the values in split array
Now the problem, here you can see that i have to use escape sequence in an unusual way. You may ignore that since you are simply passing the scanning string.
And this actually works, but you cannot visualize it with the provided Console.WriteLine(splitted[1]);
But if you put a break point and check the extracted split array you can see that text are extracted. you can confirm it with following screenshot

C# and regular expressions: recursive replace until specific string

I have a recursive html text like:
string html = "<input id=\"txt0\" value=\"hello\"></input>some undefined text<input id=\"txt1\" value=\"world\"></input>";
that can be repeated n times (in the example n=2), but n is a variable number which is not known.
I would like to replace all text inside 'value' attribute (in the example 'hello' and 'world') with a text in an array, using regular expressions.
Regex rg = new Regex(which pattern?, RegexOptions.IgnoreCase);
int count= rg.Split(html).Length - 1; // in the example count = 2
for (int i = 0; i < count; i++)
{
html= rg.Replace(html, #"value=""" + myarray[i] + #""">", 1);
}
My problem is that I cannot find the right regex pattern to make these substitutions.
If I use something like:
Regex rg = new Regex(#"value="".*""", RegexOptions.IgnoreCase);
int count= rg.Split(html).Length - 1;
for (int i = 0; i < count; i++)
{
html= rg.Replace(html, #"value=""" + myarray[i] + #"""", 1);
}
I get html like
<input id="txt0" value="lorem ipsum"></input>
because .* in the pattern includes extra characters, while I need that it stops until the next
'<input'
occurence.
The result should be something like:
<input id="txt0" value="lorem ipsum"></input>some undefined text<input id="txt1" value="another text"></input>
A suggestion or an help would be very appreciated.
Thanks!
Don't try to parse html with regex as others pointed out in comments.
Suppose you have an input with value <input id=txt2 value="x">.
<input id=txt1 value='<input id=txt2 value="x">' > would you easily be able to parse it?
Therefore use an Html Parser. I will use for your sample Html Agility Pack
string html = "<input id=\"txt0\" value=\"hello\"></input>some undefined text<input id=\"txt1\" value=\"world\"></input>";
var myarray = new List<string>() { "val111", "val222", "val333" };
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
int count = 0;
foreach(var inp in doc.DocumentNode.Descendants("input"))
{
if (inp.Attributes["value"] != null)
inp.Attributes["value"].Value = myarray[count++];
}
While I'm inclined to nudge you towards using an HTML Parser, IF your HTML input is as simple as it is in your example and you have no funky HTMLs like the one L.B has in his answer, the solution to your problem is to just be NOT greedy:
Regex rg = new Regex(#"value="".*""?", RegexOptions.IgnoreCase);
The question mark tells Regex to stop at the shortest possible match for your pattern.

Losing the 'less than' sign in HtmlAgilityPack loadhtml

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.
I have a string with the following content:
string s = "<span style=\"color: #0000FF;\"><</span>";
You see that in my span I have a 'less than' sign.
I process this string with the following code:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);
But when I do a quick and dirty look in the span like this:
htmlDocument.DocumentNode.ChildNodes[0].InnerHtml
I see that the span is empty.
What option do I need to set maintain the 'less than' sign. I already tried this:
htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;
but with no success.
I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs
Please direct me in the right direction. Thanks in advance
The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:
string s = "<span style=\"color: #0000FF;\"><</span>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
doc.Save(Console.Out);
Console.WriteLine();
Console.WriteLine();
foreach (HtmlParseError err in doc.ParseErrors)
{
Console.WriteLine("Error");
Console.WriteLine(" code=" + err.Code);
Console.WriteLine(" reason=" + err.Reason);
Console.WriteLine(" text=" + err.SourceText);
Console.WriteLine(" line=" + err.Line);
Console.WriteLine(" pos=" + err.StreamPosition);
Console.WriteLine(" col=" + err.LinePosition);
}
It will display this (the corrected text first, and details about the error then):
<span style="color: #0000FF;"></span>
Error
code=EndTagNotRequired
reason=End tag </> is not required
text=<
line=1
pos=30
col=31
So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.
As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value <.
return Regex.Replace(html, "<(?![^<]+>)", "<");
Fix the markup, because your HTML string is invalid:
string s = "<span style=\"color: #0000FF;\"><</span>";
Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.
I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "<":
static string PreProcess(string htmlInput)
{
// Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
int lastGt = -1;
// This list will be populated with all the unclosed '<' characters.
List<int> gtPositions = new List<int>();
// Collect the unclosed '<' characters.
for (int i = 0; i < htmlInput.Length; i++)
{
if (htmlInput[i] == '<')
{
if (lastGt != -1)
gtPositions.Add(lastGt);
lastGt = i;
}
else if (htmlInput[i] == '>')
lastGt = -1;
}
if (lastGt != -1)
gtPositions.Add(lastGt);
// If no unclosed '<' characters are found, then just return the input string.
if (gtPositions.Count == 0)
return htmlInput;
// Build the output string, replace all unclosed '<' character by "<".
StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
int start = 0;
foreach (int gtPosition in gtPositions)
{
htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
htmlOutput.Append("<");
start = gtPosition + 1;
}
htmlOutput.Append(htmlInput.Substring(start));
return htmlOutput.ToString();
}
string "s" is bad html.
string s = "<span style=\"color: #0000FF;\"><</span>";
it's true.

Categories