How to extract text with line continuation using Regex? - c#

How can I extract the following from the source string that uses line continuation character "_" using Regex. Note, the line continuation character must be the last character on that line. Also, the search should start from the end of the string and terminate at the first "(" encountered. That's because I am only interested what's happening at the end of the text.
Wanted Output:
var1, _
var2, _
var3
Source:
...
Func(var1, _
var2, _
var3

Try this
(?<=Func\()(?<match>(?:[^\r\n]+_\r\n)+[^\r\n]+)
Explanation
#"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
Func # Match the characters “Func” literally
\( # Match the character “(” literally
)
(?<match> # Match the regular expression below and capture its match into backreference with name “match”
(?: # Match the regular expression below
[^\r\n] # Match a single character NOT present in the list below
# A carriage return character
# A line feed character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
_ # Match the character “_” literally
\r # Match a carriage return character
\n # Match a line feed character
)+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
[^\r\n] # Match a single character NOT present in the list below
# A carriage return character
# A line feed character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

Related

Regex quotes and not a specific word before quotes

What I'm trying to do is that simple but the same time is not.
I have a function of RegEx in C# to find all the words inside quotes,
But if a specific word exist before the quotes, Ignore the whole word and continue to the next row.
While still looking for a specific kind of symbols inside the quotes and Ignore too.
Example -
My RegEx = #"(?<!Foo\()\""[^{}\r\n]*\""";
Text -
dontfindme1 = "Hello{}"
dontfindme2 = Foo("ABC")
findme1 = "Just a simple text to find"
findme2 = SuperFoo("WORKS")
Output example -
"ABC"
"Just a simple text to find"
"WORKS"
Now my problem is that I dont want to find the name "Foo(" before the quotes
And I dont want to find "{" or "}" or "(" or ")" or new lines
I only need "ABC" not to be found and skip to the next row.
You could use a negative lookahead (?! to check that the string does not match either {} between double quotes or Foo(
^(?!.*\bFoo\()(?!.*"[^"\r\n]*[{}][^"\r\n]*").*$
In C# string pattern = #"^(?!.*\bFoo\()(?!.*""[^""\r\n]*[{}][^""\r\n]*"").*$";
Regex demo
Explanation
^ Assert the start of the string
(?! Negative lookahead, assert that what follows does not
.*\bFoo\( Match any character 0+ times followed by a word boundary and Foo(
) Close negative lookahead
(?! Negative lookahead, assert that what follows does not
.* Match any character 0+ times
"[^"\r\n]* Match a double quote, match 0+ times not ", \r, \n
[{}] Match { or }
[^"\r\n]*" Match 0+ times not ", \r, \n followed by matching a double quote
) Close negative lookahead
.* Match any character 0+ times
$ Assert the end of the string

How to find in string all matches

Assume that I have the following string:
xx##a#11##yyy##bb#2##z
Im trying to retrieve all occurrence of ##something#somethingElse##
(In my string I want to have 2 matches: ##a#11## and ##bb#2##)
I tried to get all matches using
Regex.Matches(MyString, ".*(##.*#.*##).*")
but it retrieves one match which is the whole row.
How can I get all matches from this string? Thanks.
Since you have .* at the start and end of your pattern, you only get the whole line match. Besides, .* in-between #s in your pattern is too greedy, and would grab all the expected matches into 1 match when encountered on a single line.
You may use
var results = Regex.Matches(MyString, "##[^#]*#[^#]*##")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
NOTE: If there must be at least 1 char in between ## and #, and # and ##, replace * quantifier (matching 0+ occurrences) with + quantifier (matching 1+ occurrences).
NOTE2: To avoid matches inside ####..#....#####, you may add lookarounds: "(?<!#)##[^#]+#[^#]+##(?!#)"
Pattern details:
## - 2 # symbols
[^#]* / [^#]+ - a negated character class matching 0+ chars (or 1+ chars) other than #
# - a single #
[^#]* / [^#]+ - 0+ (or 1+) chars other than #
## - double # symbol.
BONUS: To get the contents inside ## and ##, use a capturing group, a pair of unescaped (...) around the part of the pattern you need to extract, and grab Match.Groups[1].Values:
var results = Regex.Matches(MyString, #"##([^#]*#[^#]*)##")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Regex101
Regex.Matches(MyString, "(##[^#]+#[^#]+##)")
(##[^#]+#[^#]+##)
Description
1st Capturing Group (##[^#]+#[^#]+##)
## matches the characters ## literally (case sensitive)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
# matches the character # literally (case sensitive)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
## matches the characters ## literally (case sensitive)
Debuggex Demo

Which regular expression will let me match just the first and last letters?

Examples:
i General Biology i
i General Biology
General Biology i
I need to catch any phrase that begins with a single letter or number, ends with a letter or number, or both begins and ends with a single letter or number so that I can pre-parse the data to this:
General Biology
I've tried tons of examples on Rubular but can't seem to figure this one out. I've used literal match groups to get those characters but I don't want the match groups per se I literally just want the regex to only capture those two letters.
You can use the following to achieve this:
String result = Regex.Replace(input, #"(?i)^[a-z0-9]\s+|\s+[a-z0-9]$", "");
Explanation:
This removes a single letter/number at the beginning/end of the string followed or preceded by whitespace.
(?i) # set flags for this block (case-insensitive)
^ # the beginning of the string
[a-z0-9] # any character of: 'a' to 'z', '0' to '9'
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
| # OR
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
[a-z0-9] # any character of: 'a' to 'z', '0' to '9'
$ # before an optional \n, and the end of the string
Working Demo

RegEx to match a number in the second line

I need a regex to match a number in the second line. Similar input is like this:
^C1.1
xC20
SS3
M 4
Decimal pattern (-?\d+(\.\d+)?) matches all numbers and second number can be get in a loop on the code behind but I need a regular expression to get directly the number in the second line.
/^[^\r\n]*\r?\n\D*?(-?\d+(\.\d+)?)/
This operates by capturing a single line at the beginning of the input:
^ Beginning of the string
[^\r\n]* Anything that isn't a line terminator
\r?\n A newline, optionally preceded by a carriage return
Then all the non digit characters, then your numbers.
Since you've now repeatedly changed your needs, try this on for size:
/(?<=\n\D*)-?\d+(\.\d+)?/
I was able to capture it with this regex.
.*\n\D*(\d*).*\n
Check out group 1 of anything that this matches:
^.*?\r\n.*?(\d+)
If that doesn't work, try this:
^.*?\r\n.*?(\d+)
Both are with multiline NOT set...
I would probably use the captured group in /^.*?\r?\n.*?(-?\d+(?:\.\d+)?)/ where…
^ # beginning of string
.*? # anything...
\r?\n # followed by a new line
.*? # anything...
( # followed by...
-? # an optional negative sign (minus)
\d+ # a number
(?: # -this part not captured explicitly-
\.\d+ # a dot and a number
)? # -and is optional-
)
If it is a flavor that supports lookbehind then there are other alternatives.

Make Regex stop looking at \n

I have the following string:
"\t Product: ces DEVICE TYPE \nSometext" //between ":" and "ces" are 9 white spaces
I need to parse the part "DEVICE TYPE". I'm trying to do this with Regex. I use this expression, which works.
((?<=\bProduct:)(\W+\w+){3}\b)
this expression returns:
" ces DEVICE TYPE"
The problem is here: Some devices have a string like this:
"\t Product: ces DEVICETYPE \nSometext"
If I use the same expression to parse the device type I get this as result:
" ces DEVICETYPE \nSometext"
How do I get my regex to stop when a \n is found?
Perhaps this?
(?<=ces)[^\\n]+
If all you want is what's after ces and before \n that is..
In .NET you can use RegexOptions.Multiline. This changes the behaviour of ^ and $.
Rather than meaning the start and end of your string, they now mean start and end of any line within your string.
Regex r = new Regex(#"(?<=\bProduct:).+$", RegexOptions.Multiline);
You could use:
(?m)((?<=\bProduct:).+)
Explanation:
(?m)((?<=\bProduct:).+)
Match the remainder of the regex with the options: ^ and $ match at line breaks (m) «(?m)»
Match the regular expression below and capture its match into backreference number 1 «((?<=\bProduct:).+)»
Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=\bProduct:)»
Assert position at a word boundary «\b»
Match the characters “Product:” literally «Product:»
Match any single character that is not a line break character «.+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
or
((?<=\bProduct:)[^\r\n]+)
Explanation
((?<=\bProduct:)[^\r\n]+)
Match the regular expression below and capture its match into backreference number 1 «((?<=\bProduct:)[^\r\n]+)»
Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=\bProduct:)»
Assert position at a word boundary «\b»
Match the characters “Product:” literally «Product:»
Match a single character NOT present in the list below «[^\r\n]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A carriage return character «\r»
A line feed character «\n»

Categories