C# and regular expressions: recursive replace until specific string - c#

I have a recursive html text like:
string html = "<input id=\"txt0\" value=\"hello\"></input>some undefined text<input id=\"txt1\" value=\"world\"></input>";
that can be repeated n times (in the example n=2), but n is a variable number which is not known.
I would like to replace all text inside 'value' attribute (in the example 'hello' and 'world') with a text in an array, using regular expressions.
Regex rg = new Regex(which pattern?, RegexOptions.IgnoreCase);
int count= rg.Split(html).Length - 1; // in the example count = 2
for (int i = 0; i < count; i++)
{
html= rg.Replace(html, #"value=""" + myarray[i] + #""">", 1);
}
My problem is that I cannot find the right regex pattern to make these substitutions.
If I use something like:
Regex rg = new Regex(#"value="".*""", RegexOptions.IgnoreCase);
int count= rg.Split(html).Length - 1;
for (int i = 0; i < count; i++)
{
html= rg.Replace(html, #"value=""" + myarray[i] + #"""", 1);
}
I get html like
<input id="txt0" value="lorem ipsum"></input>
because .* in the pattern includes extra characters, while I need that it stops until the next
'<input'
occurence.
The result should be something like:
<input id="txt0" value="lorem ipsum"></input>some undefined text<input id="txt1" value="another text"></input>
A suggestion or an help would be very appreciated.
Thanks!

Don't try to parse html with regex as others pointed out in comments.
Suppose you have an input with value <input id=txt2 value="x">.
<input id=txt1 value='<input id=txt2 value="x">' > would you easily be able to parse it?
Therefore use an Html Parser. I will use for your sample Html Agility Pack
string html = "<input id=\"txt0\" value=\"hello\"></input>some undefined text<input id=\"txt1\" value=\"world\"></input>";
var myarray = new List<string>() { "val111", "val222", "val333" };
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
int count = 0;
foreach(var inp in doc.DocumentNode.Descendants("input"))
{
if (inp.Attributes["value"] != null)
inp.Attributes["value"].Value = myarray[count++];
}

While I'm inclined to nudge you towards using an HTML Parser, IF your HTML input is as simple as it is in your example and you have no funky HTMLs like the one L.B has in his answer, the solution to your problem is to just be NOT greedy:
Regex rg = new Regex(#"value="".*""?", RegexOptions.IgnoreCase);
The question mark tells Regex to stop at the shortest possible match for your pattern.

Related

Extract ID and replace everything in `Example HTML`

New to Regular Expressions, I want to have the following text in my HTML and would like to replace with something else
Example HTML:
{{Object id='foo'}}
Extract the id into a variable like this:
string strId = "foo";
So far I have the following Regular Expression code that will capture the Example HTML:
string strStart = "Object";
string strFind = "{{(" + strStart + ".*?)}}";
Regex regExp = new Regex(strFind, RegexOptions.IgnoreCase);
Match matchRegExp = regExp.Match(html);
while (matchRegExp.Success)
{
//At this point, I have this variable:
//{{Object id='foo'}}
//I can find the id='foo' (see below)
//but not sure how to extract 'foo' and use it
string strFindInner = "id='(.*?)'"; //"{{Slider";
Regex regExpInner = new Regex(strFindInner, RegexOptions.IgnoreCase);
Match matchRegExpInner = regExpInner.Match(matchRegExp.Value.ToString());
//Do something with 'foo'
matchRegExp = matchRegExp.NextMatch();
}
I understand this might be a simple solution, I am hoping to gain more knowledge about Regular Expressions but more importantly, I am hoping to receive a suggestion on how to approach this cleaner and more efficiently.
Thank you
Edit:
Is this an example that I could potentially use: c# regex replace
While I am not solving my initial question with Regular Expressions, I did move into a simpler solution using SubString, IndexOf and string.Split for the time being, I understand that my code needs to be cleaned up but thought I would post the answer that I have thus far.
string html = "<p>Start of Example</p>{{Object id='foo'}}<p>End of example</p>"
string strObject = "Slider"; //Example
//When found, this will contain "{{Object id='foo'}}"
string strCode = "";
//ie: "id='foo'"
string strCodeInner = "";
//Tags will be a list, but in this example, only "id='foo'"
string[] tags = { };
//Looking for the following "{{Object "
string strFindStart = "{{" + strObject + " ";
int intFindStart = html.IndexOf(strFindStart);
//Then ending in the following
string strFindEnd = "}}";
int intFindEnd = html.IndexOf(strFindEnd) + strFindEnd.Length;
//Must find both Start and End conditions
if (intFindStart != -1 && intFindEnd != -1)
{
strCode = html.Substring(intFindStart, intFindEnd - intFindStart);
//Remove Start and End
strCodeInner = strCode.Replace(strFindStart, "").Replace(strFindEnd, "");
//Split by spaces, this needs to be improved if more than IDs are to be used
//but for proof of concept this is perfect
tags = strCodeInner.Split(new char[] { ' ' });
}
Dictionary<string, string> dictTags = new Dictionary<string, string>();
foreach (string tag in tags)
{
string[] tagSplit = tag.Split(new char[] { '=' });
dictTags.Add(tagSplit[0], tagSplit[1].Replace("'", "").Replace("\"", ""));
}
//At this point, I can replace "{{Object id='foo'}}" with anything I'd like
//What I don't show is that I go into the website's database,
//get the object (ie: Slider) and return the html for slider with the ID of foo
html = html.Replace(strCode, strView);
/*
"html" variable may contain:
<p>Start of Example</p>
<p id="foo">This is the replacement text</p>
<p>End of example</p>
*/

why regex split add to pattern \r\n

I want to split the body of article by html div tag so I have a pattern to search div.
the problem is that the pattern also split \r\n
[enter image description here][1]
string pattern = #"<div[^<>]*>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None);
Response.Write("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Response.Write("bodyParagraphs" + i + "= " + bodyParagraphsnew[i]+ Environment.NewLine);
}
When I debug this code I see a lot of "\r\n" in the array bodyParagraphsnew.
Its seen that the pattern include split by the string "\r\n"
I try to replace \r\n to string empty and i hoped that bodyParagraphsnew length will change.but not.I got instead of item(in array) that contain \r\n it contain ""
WHY?
here is link to image http://i.stack.imgur.com/Hxqki.gif that explain the problem
What you are seeing is the text that is between the end of the first </div> tag and the start of the next <div> tag. This is what Split does, it finds the text between the Regular Expression matches.
What is curious here though is that you are also going to get the text between the open and close tags because you put brackets in your string forming a capturing group. Consider the following program:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i]);
}
}
}
What you will get from this is:
"" - An empty string taken from before the first <div>.
"some text" - The contents of the first <div>, because of the capturing group.
"\r\n" - The text between the end of the first </div> and the start of the last <div>.
"some more text" - The contents of the second div, again because of the capturing group.
"" - An empty string taken from after the last </div>.
What you are probably after is the contents of the div tags. This can kind of be achieved using this code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
MatchCollection bodyParagraphsnew = Regex.Matches(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Count);
for (int i = 0; i < bodyParagraphsnew.Count; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i].Groups[1].Value);
}
}
}
Note however that in HTML, div tags can be nested within each other. For example, the following is a valid HTML string:
string test = "<div>Outer div<div>inner div</div>outer div again</div>";
With this kind of situation Regular expressions are not going to work! This is largely due to HTML not being a Regular Language. To deal with this situation you are going to need to write a Parser (of which regular expressions are only a small part). However personally I wouldn't bother as there are plenty of open source HTML parsers already available HTML Agility Pack for example.
Two possibilies
you use llist instead of array and list.remove
you go through your array search for \r\n and remove it by index
if(bodyParagraphsnew[i] == "\r\n")
{
bodyParagraphsnew = bodyParagraphsnew.Where(w => w != bodyParagraphsnew[i]).ToArray();
}
Not very nice but maybe it is what you were looking for

When using indexof and substring how do i parse the right start and end indexs ? And how do i encode hebrew chars?

I have this code:
string firstTag = "Forums2008/forumPage.aspx?forumId=";
string endTag = "</a>";
index = forums.IndexOf(firstTag, index1);
if (index == -1)
continue;
var secondIndex = forums.IndexOf(endTag, index);
result = forums.Substring(index + firstTag.Length + 12, secondIndex - (index + firstTag.Length - 50));
The string i want to extract from is for example:
הנקה
What i want to get is the word after the title only this: הנקה
And the second problem is that when i'm extracting it i see instead hebrew some gibrish like this: ������
One powerful way to do this is to use Regular Expressions instead of trying to find a starting position and use a substring. Try out this code, and you'll see that it extracts the anchor tag's title:
var input = "הנקה";
var expression = new System.Text.RegularExpressions.Regex(#"title=\""([^\""]+)\""");
var match = expression.Match(input);
if (match.Success) {
Console.WriteLine(match.Groups[1]);
}
else {
Console.WriteLine("not found");
}
And for the curious, here is a version in JavaScript:
var input = 'הנקה';
var expression = new RegExp('title=\"([^\"]+)\"');
var results = expression.exec(input);
if (results) {
document.write(results[1]);
}
else {
document.write("not found");
}
Okay here is the solution using String.Substring() String.Split() and String.IndexOf()
String str = "הנקה"; // <== Assume this is passing string. Yes unusual scape sequence are added
int splitStart = str.IndexOf("title="); // < Where to start splitting
int splitEnd = str.LastIndexOf("</a>"); // < = Where to end
/* What we try to extract is this : title="הנקה">הנקה
* (Given without escape sequence)
*/
String extracted = str.Substring(splitStart, splitEnd - splitStart); // <=Extracting required portion
String[] splitted = extracted.Split('"'); // < = Now split with "
Console.WriteLine(splitted[1]); // <= Try to Out but yes will produce ???? But put a breakpoint here and check the values in split array
Now the problem, here you can see that i have to use escape sequence in an unusual way. You may ignore that since you are simply passing the scanning string.
And this actually works, but you cannot visualize it with the provided Console.WriteLine(splitted[1]);
But if you put a break point and check the extracted split array you can see that text are extracted. you can confirm it with following screenshot

C# removing white spaces in an HTML string

is it possible to remove all white spaces in the following HTML string in C#:
"
<html>
<body>
</body>
</html>
"
Thanks
When dealing with HTML or any markup for that matter, it's usually best to run it through a parser that truly understands the rules of that markup.
The first benefit is that it can tell you if your initial input data is garbage to start with.
If the parser is smart enough it might even be able to correct badly formed markup automatically, or accept it with relaxed rules.
You can then modify the parsed content....and get the parser to write out the changes...this way you can be sure the markup rules are followed and you have correct output.
For some simple HTML markup scenarios or for markup that is so badly formed a parser just balks on it straight away, then yes you can revert to hacking the input string...with string replacements, etc....it all depends on your needs as to which approach you take.
Here are a couple of tools that can help you out:
HTML Tidy
You can use HTML Tidy and just specify some options/rules on how you want your HTML to be tidied up (e.g. remove superfluous whitespace).
It's a WIN32 DLL...but there are C# Wrappers for it.
http://tidy.sourceforge.net
http://robertbeal.com/37/sanitising-html
C# version of HTML Tidy?
http://geekswithblogs.net/mnf/archive/2011/06/08/implementations-of-html-tidylib-for-.net.aspx
HtmlAgilityPack
You can use HtmlAgilityPack to parse HTML if you need to understand the structure better and perhaps do your own tidying up/restructuring.
http://html-agility-pack.net
myString = myString.Replace(System.Environment.NewLine, "");
You can use a regular expression to match white space characters for the replace:
s = RegEx.Replace(s, #"\s+", String.Empty);
I used this solution (in my opinion it works well. See also test code):
Add an extension method to trim the HTML string:
public static string RemoveSuperfluousWhitespaces(this string input)
{
if (input.Length < 3) return input;
var resultString = new StringBuilder(); // Using StringBuilder is much faster than using regular expressions here!
var inputChars = input.ToCharArray();
var index1 = 0;
var index2 = 1;
var index3 = 2;
// Remove superfluous white spaces from the html stream by the following replacements:
// '<no whitespace>' '>' '<whitespace>' ==> '<no whitespace>' '>'
// '<whitespace>' '<' '<no whitespace>' ==> '<' '<no whitespace>'
while (index3 < inputChars.Length)
{
var char1 = inputChars[index1];
var char2 = inputChars[index2];
var char3 = inputChars[index3];
if (!Char.IsWhiteSpace(char1) && char2 == '>' && Char.IsWhiteSpace(char3))
{
// drop whitespace character in char3
index3++;
}
else if (Char.IsWhiteSpace(char1) && char2 == '<' && !Char.IsWhiteSpace(char3))
{
// drop whitespace character in char1
index1 = index2;
index2 = index3;
index3++;
}
else
{
resultString.Append(char1);
index1 = index2;
index2 = index3;
index3++;
}
}
// (index3 >= inputChars.Length)
resultString.Append(inputChars[index1]);
resultString.Append(inputChars[index2]);
var str = resultString.ToString();
return str;
}
// 2) add test code:
[Test]
public void TestRemoveSuperfluousWhitespaces()
{
var html1 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html2 = $"<td class=\"keycolumn\">{Environment.NewLine}<p class=\"mandatory\">Some recipe parameter name</p>{Environment.NewLine}</td>";
var html3 = $"<td class=\"keycolumn\">{Environment.NewLine} <p class=\"mandatory\">Some recipe parameter name</p> {Environment.NewLine}</td>";
var html4 = " <td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
var html5 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td> ";
var compactedHtml1 = html1.RemoveSuperfluousWhitespaces();
compactedHtml1.Should().BeEquivalentTo(html1);
var compactedHtml2 = html2.RemoveSuperfluousWhitespaces();
compactedHtml2.Should().BeEquivalentTo(html1);
var compactedHtml3 = html3.RemoveSuperfluousWhitespaces();
compactedHtml3.Should().BeEquivalentTo(html1);
var compactedHtml4 = html4.RemoveSuperfluousWhitespaces();
compactedHtml4.Should().BeEquivalentTo(html1);
var compactedHtml5 = html5.RemoveSuperfluousWhitespaces();
compactedHtml5.Should().BeEquivalentTo(html1);
}

Losing the 'less than' sign in HtmlAgilityPack loadhtml

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.
I have a string with the following content:
string s = "<span style=\"color: #0000FF;\"><</span>";
You see that in my span I have a 'less than' sign.
I process this string with the following code:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);
But when I do a quick and dirty look in the span like this:
htmlDocument.DocumentNode.ChildNodes[0].InnerHtml
I see that the span is empty.
What option do I need to set maintain the 'less than' sign. I already tried this:
htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;
but with no success.
I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs
Please direct me in the right direction. Thanks in advance
The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:
string s = "<span style=\"color: #0000FF;\"><</span>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
doc.Save(Console.Out);
Console.WriteLine();
Console.WriteLine();
foreach (HtmlParseError err in doc.ParseErrors)
{
Console.WriteLine("Error");
Console.WriteLine(" code=" + err.Code);
Console.WriteLine(" reason=" + err.Reason);
Console.WriteLine(" text=" + err.SourceText);
Console.WriteLine(" line=" + err.Line);
Console.WriteLine(" pos=" + err.StreamPosition);
Console.WriteLine(" col=" + err.LinePosition);
}
It will display this (the corrected text first, and details about the error then):
<span style="color: #0000FF;"></span>
Error
code=EndTagNotRequired
reason=End tag </> is not required
text=<
line=1
pos=30
col=31
So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.
As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value <.
return Regex.Replace(html, "<(?![^<]+>)", "<");
Fix the markup, because your HTML string is invalid:
string s = "<span style=\"color: #0000FF;\"><</span>";
Although it is true that the given html is invalid, HtmlAgilityPack should still be able to parse it. It is not an uncommon mistake on the web to forget to encode "<", and if HtmlAgilityPack is used as a crawler, then it should anticipate bad html. I tested the example in IE, Chrome and Firefox, and they all show the extra < as text.
I wrote the following method that you can use to preprocess the html string and replace all 'unclosed' '<' characters with "<":
static string PreProcess(string htmlInput)
{
// Stores the index of the last unclosed '<' character, or -1 if the last '<' character is closed.
int lastGt = -1;
// This list will be populated with all the unclosed '<' characters.
List<int> gtPositions = new List<int>();
// Collect the unclosed '<' characters.
for (int i = 0; i < htmlInput.Length; i++)
{
if (htmlInput[i] == '<')
{
if (lastGt != -1)
gtPositions.Add(lastGt);
lastGt = i;
}
else if (htmlInput[i] == '>')
lastGt = -1;
}
if (lastGt != -1)
gtPositions.Add(lastGt);
// If no unclosed '<' characters are found, then just return the input string.
if (gtPositions.Count == 0)
return htmlInput;
// Build the output string, replace all unclosed '<' character by "<".
StringBuilder htmlOutput = new StringBuilder(htmlInput.Length + 3 * gtPositions.Count);
int start = 0;
foreach (int gtPosition in gtPositions)
{
htmlOutput.Append(htmlInput.Substring(start, gtPosition - start));
htmlOutput.Append("<");
start = gtPosition + 1;
}
htmlOutput.Append(htmlInput.Substring(start));
return htmlOutput.ToString();
}
string "s" is bad html.
string s = "<span style=\"color: #0000FF;\"><</span>";
it's true.

Categories