Regex.Split returning whitespaces - c#

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.

You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

Related

Regex to match when not inside single quotes [duplicate]

I'm looking to write a regex (C#) that will match words that aren't surrounded by quotes. An example input string would be:
dbo.test line_length "quoted words" notquoted
And this needs to match
dbo.test
line_length
nonquoted
So 3 separate matches and "quoted words" is not matched. The quoted phrase could be anywhere in the input...beginning, middle, end, etc.
I haven't been able to come up with a regex that matches words not in quotes where there could be a space in the quotes...I've been able to match something like: hello "world" and only get hello.
Is there a way to write the regex I'm trying to?
There are two ways to tackle this, depending on what you want to do with the output.
First, match (but don't capture) any text within quotation marks. (This is specifically matching the stuff that you DON'T want.)
Using the | pipe, use capture groups to select everything that you DO want to keep.
Example:
".*?"|(\b\S+\b)
You can see an example of that here.
The other option, using look-arounds, is to specifically look backward from the beginning of the words to ensure that the " doesn't appear there:
(?<!")(\b\S+\b)(?!")
You can see that here.
This may have a problem when you start using multiple words, but this should get you on the right track, and you can indicate whether one of these methods works better for you than the other.

Remove all Trailing <br> Using Regex, Substitute Group not Returning Full Match

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.
EG:
This:
<span>Here is some<br></span><br>
<span><span>Here is some text</span><br><span><br> </span></span><br><br>
Becomes this:
<span>Here is some<br></span><br>
<span><span>Here is some text<span></span></span>
My first pass. I use this: Regex.Replace(htmlString, #"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.
I'm attempting to use this:
While(Regex.IsMatch(#"(<br>|\s| )*(<[^>]*>)*$")
{
Regex.Replace(htmlString, #"(<br>|\s| )*(<[^>]*>)*$", $2)
}
The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:
<span>Here is some<br></span><br>
<span><span>Here is some text</span></span>
The regular expression is in #"(<br>|\s| )*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.
Putting the repetition in a group will capture the whole repetition. Change the regular expression to be #"(<br>|\s| )*((<[^>]*>)*)$".
Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be #"(<br>|\s| )+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.
I guess you can use:
resultString = Regex.Replace(subjectString, #"<br>| |\n", "");
Regex Demo

Regex to match single words/character sets that aren't in quotes

I'm looking to write a regex (C#) that will match words that aren't surrounded by quotes. An example input string would be:
dbo.test line_length "quoted words" notquoted
And this needs to match
dbo.test
line_length
nonquoted
So 3 separate matches and "quoted words" is not matched. The quoted phrase could be anywhere in the input...beginning, middle, end, etc.
I haven't been able to come up with a regex that matches words not in quotes where there could be a space in the quotes...I've been able to match something like: hello "world" and only get hello.
Is there a way to write the regex I'm trying to?
There are two ways to tackle this, depending on what you want to do with the output.
First, match (but don't capture) any text within quotation marks. (This is specifically matching the stuff that you DON'T want.)
Using the | pipe, use capture groups to select everything that you DO want to keep.
Example:
".*?"|(\b\S+\b)
You can see an example of that here.
The other option, using look-arounds, is to specifically look backward from the beginning of the words to ensure that the " doesn't appear there:
(?<!")(\b\S+\b)(?!")
You can see that here.
This may have a problem when you start using multiple words, but this should get you on the right track, and you can indicate whether one of these methods works better for you than the other.

Optimizing a regex expression

I'm having issues where it's taking very long to run a match against this query. I'm trying to match up content that looks like the following:
One or more content paragraph of any length
Here is an optional paragraph
A single line or list item
A single line or list item
Here is my pattern. While it works for short expressions, it fails for longer ones.
^((.+[\r\n]?)+)\r\n\r\n([* -]*(.+)[\r\n]?)+$
My goal really is to separate out the first piece of content into a paragraph, and collect the last items into a list object using the matching pattern. I'm assuming two line breaks separate the paragraph(s) and a set of single-line items (only one line break).
Hope this isn't confusing. How can I optimize this regex? Thanks.
Time-consuming, inefficient backtracking can often be avoided by adding the ? modifier to the * and + quantifiers to make them match lazily or reluctantly, i.e. as few times as possible.
This can be particularly important when the quantifiers follow the . wildcard meta-character.
Try
(.+?)\r\n\r\n(?:[* -]*(.+?)(?:\r\n|$))+
with RegexOptions.Singleline so . matches any character including newlines.
(Alternatively use [\s\S] in place of the first .).
The first capture group will capture all that comes before the consecutive newlines, and then the next capture groups will capture each single line that follows. As in your regex, any leading *, - or space characters in the single lines will not be captured.
The paragraph/s will be match.Groups[1].Value, the first captured single line will be match.Groups[2].Captures[0].Value and the second match.Groups[2].Captures[1].Value) etc.
If the line-endings may be simply \n, change \r\n to \r?\n.
I'm not that good at regex but your one seems quite optimized to me. But to make it faster, use split instead to seperate the paragraph from the list
res = yourstring.Split('\r\n\r\n');
paragraph = res[0];
list=res[1];
then you can use regex or again split to seperate the list items from each other

UB: C#'s Regex.Match returns whole string instead of part when matching

Attention! This is NOT related to Regex problem, matches the whole string instead of a part
Hi all.
I try to do
Match y = Regex.Match(someHebrewContainingLine, #"^.{0,9} - \[(.*)?\s\d{1,3}");
Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:
Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).
Regex.Matches() is acting same way.
What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-") (sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ], but simply adding them to test string doesn't causes the same effect.
I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group member of Match is the nesting member. You have to traverse Groups in order to get down to lower captures. Groups also have Capture members. The Match is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures member represents the match of your full expression.
Whenever you have a single capture as you have, Group[1] will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1].
Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match call, then you would see Match.Captures.Count be greater than 1, but each top-level Match.Captures would be the full string matched by your full expression.
There is one capture group in the pattern; that is group 1.
There is always group 0, which is the entire match.
Therefore there are a total of 2 groups.
My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.
Now, why we should call Regex behaviour undocumented, not undefined:
let's do some matches against "1.234567890".
PCRE-like syntax: (.)\.2345678
lookahead syntax: (.)(?=\.\d)
When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.
So, the matches will return:
PCRE: 1.2345678 (at 2300, this looks like original string and I start yelling here at SO)
lookahead: 1

Categories