C# Regular Expression excluding a string - c#

I got a collection of string and all i want for regex is to collect all started with http..
href="http://www.test.com/cat/1-one_piece_episodes/"href="http://www.test.com/cat/2-movies_english_subbed/"href="http://www.test.com/cat/3-english_dubbed/"href="http://www.exclude.com"
this is my regular expression pattern..
href="(.*?)[^#]"
and return this
href="http://www.test.com/cat/1-one_piece_episodes/"
href="http://www.test.com/cat/2-movies_english_subbed/"
href="http://www.xxxx.com/cat/3-english_dubbed/"
href="http://www.exclude.com"
what is the pattern for excluding the last match.. or excluding matches that has the exclude domain inside like href="http://www.exclude.com"
EDIT:
for multiple exclusion
href="((?:(?!"|\bexclude\b|\bxxxx\b).)*)[^#]"

#ridgerunner and me would change the regex to:
href="((?:(?!\bexclude\b)[^"])*)[^#]"
It matches all href attributes as long as they don't end in # and don't contain the word exclude.
Explanation:
href=" # Match href="
( # Capture...
(?: # the following group:
(?! # Look ahead to check that the next part of the string isn't...
\b # the entire word
exclude # exclude
\b # (\b are word boundary anchors)
) # End of lookahead
[^"] # If successful, match any character except for a quote
)* # Repeat as often as possible
) # End of capturing group 1
[^#]" # Match a non-# character and the closing quote.
To allow multiple "forbidden words":
href="((?:(?!\b(?:exclude|this|too)\b)[^"])*)[^#]"

Your input doesn't look like a valid string (unless you escape the quotes in them) but you can do it without regex too:
string input = "href=\"http://www.test.com/cat/1-one_piece_episodes/\"href=\"http://www.test.com/cat/2-movies_english_subbed/\"href=\"http://www.test.com/cat/3-english_dubbed/\"href=\"http://www.exclude.com\"";
List<string> matches = new List<string>();
foreach(var match in input.split(new string[]{"href"})) {
if(!match.Contains("exclude.com"))
matches.Add("href" + match);
}

Will this do the job?
href="(?!http://[^/"]+exclude.com)(.*?)[^#]"

Related

How to find in string all matches

Assume that I have the following string:
xx##a#11##yyy##bb#2##z
Im trying to retrieve all occurrence of ##something#somethingElse##
(In my string I want to have 2 matches: ##a#11## and ##bb#2##)
I tried to get all matches using
Regex.Matches(MyString, ".*(##.*#.*##).*")
but it retrieves one match which is the whole row.
How can I get all matches from this string? Thanks.
Since you have .* at the start and end of your pattern, you only get the whole line match. Besides, .* in-between #s in your pattern is too greedy, and would grab all the expected matches into 1 match when encountered on a single line.
You may use
var results = Regex.Matches(MyString, "##[^#]*#[^#]*##")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
NOTE: If there must be at least 1 char in between ## and #, and # and ##, replace * quantifier (matching 0+ occurrences) with + quantifier (matching 1+ occurrences).
NOTE2: To avoid matches inside ####..#....#####, you may add lookarounds: "(?<!#)##[^#]+#[^#]+##(?!#)"
Pattern details:
## - 2 # symbols
[^#]* / [^#]+ - a negated character class matching 0+ chars (or 1+ chars) other than #
# - a single #
[^#]* / [^#]+ - 0+ (or 1+) chars other than #
## - double # symbol.
BONUS: To get the contents inside ## and ##, use a capturing group, a pair of unescaped (...) around the part of the pattern you need to extract, and grab Match.Groups[1].Values:
var results = Regex.Matches(MyString, #"##([^#]*#[^#]*)##")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Regex101
Regex.Matches(MyString, "(##[^#]+#[^#]+##)")
(##[^#]+#[^#]+##)
Description
1st Capturing Group (##[^#]+#[^#]+##)
## matches the characters ## literally (case sensitive)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
# matches the character # literally (case sensitive)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
## matches the characters ## literally (case sensitive)
Debuggex Demo

How can i match inner expression on nested expression with regular expressions?

I got this code on c#
This works:
string code = "dqwdSTART12sdaSTART12312ENDsdfSTARTasdsaENDasdaENDqwe";
string pattern = "START[^(START)(END)]*END";
But not this:
string code = "dqwdstart12sdastart12312endsdfstartasdsaendasdaendqwe";
string pattern = "start[^(start)(end)]*end";
How can i do the match ?
( preferably c # )
this pattern [^(start)(end)] does not mean what you think, it does not mean non of the words but non of the characters enclosed between [ and ]
the only reason why it worked is because you had numbers between start and end, if you add a letter like s it won't work.
use this pattern instead
START((?:(?!START|END).)*)END
with gi options
Demo
START # "START"
( # Capturing Group (1)
(?: # Non Capturing Group
(?! # Negative Look-Ahead
START # "START"
| # OR
END # "END"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
* # (zero or more)(greedy)
) # End of Capturing Group (1)
END # "END"
(?<=start)(?:(?!start|end).)*(?=end)
You can try this as well if you dont want to capture start and end and just the content between.See demo,
http://regex101.com/r/yP3iB0/23

Validator for file name with custom words between curly braces

I have regex:
[\w,\s-]+\.[A-Za-z]+$
and a filename:
test-file_name-5.pdf
And it works okay. But now I want to add something like this:
my-filename{time}.pdf
or this:
test{word}hello.pdf
and the regex should accept it.
If there is only opening/closing curly brace, it should fail. The braces could contain a-Z0-9.
I tried with RegExr but couldn't do it.
You can use the following regex:
^[\w,\s-]+(?:(?:{[A-Za-z\d]+}[\w,\s-]*)?)*\.[A-Za-z]+$
Explanation:
^ # Assert position at the beginning of the string
[\w,\s-]+ # Beginning of the filename
(?: # Begin group
(?: # Begin group
{[A-Za-z\d]+} # Match {...} part
[\w,\s-]* # Followed by optional characters
)? # Make the group optional
)* # Repeat the group zero or more times
\.[A-Za-z]+ # Match the filename extension
$ # Assert position at the end of the string
This matches:
test-file_name-5.pdf
my-filename{23m}.pdf
test{word1}hello{word2}xyz.pdf
test{word}hello.pdf
But doesn't match:
foo-filename{23m.pdf
foo-filename23m}.pdf
RegEx Demo

C#: Regex for string with enclosing single-quotes (and escaping by doubling the quotes)

I did not found a regex for my problem. There are always example-regex for escaping with back-slash.
But I need escaping by doubling the enclosing-character.
Example: 'o''reilly'
Result: o'reilly
'(?:''|[^']*)*'
will match a quote-delimited string that may contain double-escaped quotes. So that's your regex to find those strings.
Explanation:
' # Match a single quote.
(?: # Either match... (use (?> instead of (?: if you can)
'' # a doubled quote
| # or
[^']* # anything that's not a quote
)* # any number of times.
' # Match a single quote.
To now remove the quotes correctly, you could do it in two steps:
First, search for (?<!')'(?!') to find all single quotes; replace them with nothing.
Explanation:
(?<!') # Assert that the previous character (if present) isn't a quote
' # Match a quote
(?!') # Assert that the next character (if present) isn't a quote
Second, search for '' and replace all with '.

Regular expression to find separator dots in formula

The C# expression library I am using will not directly support my table/field parameter syntax:
The following are table/field parameter names that are not directly supported:
TableName1.FieldName1
[TableName1].[FieldName1]
[Table Name 1].[Field Name 1]
It accepts alphanumeric parameters without spaces, or most characters enclosed within square brackets. I would like to use C# regular expressions to replace the dot separators and neighboring brackets to a different delimiter, so the results would be as follows:
[TableName1|FieldName1]
[TableName1|FieldName1]
[Table Name 1|Field Name 1]
I also need to skip any string literals within single quotes, like:
'TableName1.FieldName1'
And, of course, ignore any numeric literals like:
12345.6789
EDIT: Thank you for your feedback on improving my question. Hopefully it is clearer now.
I've written a completely new answer, now that the problem is clarified:
You can do this in a single regex. It is quite bulletproof, I think, but as you can see, it's not exactly self-explanatory, which is why I've commented it liberally. Hope it makes sense.
You're lucky that .NET allows re-use of named capturing groups, otherwise you would have had to do this in several steps.
resultString = Regex.Replace(subjectString,
#"(?: # Either match...
(?<before> # (and capture into backref <before>)
(?=\w*\p{L}) # (as long as it contains at least one letter):
\w+ # one or more alphanumeric characters,
) # (End of capturing group <before>).
\. # then a literal dot,
(?<after> # (now capture again, into backref <after>)
(?=\w*\p{L}) # (as long as it contains at least one letter):
\w+ # one or more alphanumeric characters.
) # (End of capturing group <after>) and end of match.
| # Or:
\[ # Match a literal [
(?<before> # (now capture into backref <before>)
[^\]]+ # one or more characters except ]
) # (End of capturing group <before>).
\]\.\[ # Match literal ].[
(?<after> # (capture into backref <after>)
[^\]]+ # one or more characters except ]
) # (End of capturing group <after>).
\] # Match a literal ]
) # End of alternation. The match is now finished, but
(?= # only if the rest of the line matches either...
[^']*$ # only non-quote characters
| # or
[^']*'[^']*' # contains an even number of quote characters
[^']* # plus any number of non-quote characters
$ # until the end of the line.
) # End of the lookahead assertion.",
"[${before}|${after}]", RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

Categories