Regex to stop parsing after semicolon is encountered - c#

I am using this regex to parse URL from a semicolon separated string.
\b(?:https?:|http?:|www\.)\S+\b
It is working fine if my input text is in these formats:
"Google;\"https://google.com\""
//output - https://google.com
"Yahoo;\"www.yahoo.com\""
//output - www.yahoo.com
but in this case it gives incorrect string
"https://google.com;\"https://google.com\""
//output - https://google.com;\"https://google.com
how can I stop the parsing when I encounter the ';' ?

Looking at your examples, I would just match any URL between quotation marks. Something like this:
(?<=")(?:https?:|www\.)[^"]*
You can try it out here
Or as others have said, split the input string by the semicolon character using string.Split, and check each string sequentially for your desired match.

For your example data you might use a positive lookahead (?=) and a positive lookbehind (?<=)
(?<=")(?:https?:|www\.).+?(?=;?\\")
That would match
(?<=") Positive lookbehind to assert that what is on the left side is a double quote
(?:https?:|www\.) Match either http with an optional s or www.
.+? Match any character one or more times non greedy
(?=;?\\") Positive lookahead which asserts that what follows is an optional ; followed by\"

I would personally just modify the regex to look specifically for URLs and add some conditionals to the https:// protocols and www quantifier. Using \S+ can be kind of iffy because it will grab every non whitespace character, in which in a URL, it's limited on the characters you can use.
Something like this should work great for your particular needs.
(https?:\/{2})?([w]{3}.)?\w+\.[a-zA-Z]+
This sets up a conditional on the http (s also optional) protocol which would then be immediately be followed by the ://. Then, it will grab all letters, numbers, and underscores as many as possible until the ., followed by the last set of characters to end it. You can exchange the [a-zA-Z] character set for a explicit set of domains if you'd prefer.

Related

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

Regular Expression that matches on values after a pipe in between brackets

I'm still learning a lot about regex, so please forgive any naivety.
I've been using this site to test:
http://www.systemtextregularexpressions.com/regex.match
Basically, I'm having issues writing a regular expression that will match on any value after a pipe in between brackets.
Given an example string of:
"<div> \n [dont1.dont2|match1|match2] |dont3 [dont4] dont5. \n </div>"
Expected output would be a collection:
match1,
match2
The closest I've been able to get so far is:
(?!\[.*(\|)\])(?:\|)([\w-_.,:']*)
Above gives me the values, including the pipes, and dont3.
I've also tried this guy:
\|(.*(?=\]))
but it outputs:
|match1|match2
Here's one way of doing it:
(?<=\[[^\]]*\|)[^\]|]*
Here's the meaning of the pattern:
(?<=\[[^\]]*\|) - Lookbehind expression to ensure that any match must be preceded by an open bracket, followed by any number of non-close-bracket characters, followed by a pipe character
(?<= ... ) - Declares a lookbehind expression. Something matching the lookbehind must immediately precede the text in order for it the match. However, the part matched by the lookbehind is not included in the resulting match.
\[ - Matches an open bracket character
[^\]]* - Matches any number of non-close-bracket characters
\| - Matches a pipe character
[^\]|]* - Matches any number of characters which are neither close brackets nor pipe characters.
The lookbehind is greedy, so it will allow for any number of pipes between the open bracket and the matching text.
try this:
\[.*?(?:\|(?<mydata>.*?))+\]
note: the online tool will only show you the last capture inside a quantifed () for a given match, but .NET will remember each capture of a group that matches multiple times
Try this:
^<div>\s*[^|]+|([^|]+)|([^|]+)

Regex to match comma separated string with no comma at the end of the line

I am trying to write a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line. I have tried do this,that includes all the possible characters,but it still does not give me the correct output:
[RegularExpression("^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+$", ErrorMessage = "Comma is not allowed at the end of {0} ")]
^.*[^,]$
.* means all char,don't need so long
^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+(?<!,)$
^^
Just add lookbehind at the end.
a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line.
Mind that you can type much more than what you typed using a keyboard. Basically, you want to allow any character but a comma at the end of the line.
So,
(?!,).(?=\r\n|\z)
This regex is checking each line (because of the (?=\r\n|$) look-ahead), and the (?!,) look-ahead makes sure the last character (that we match using .) is not a comma. \z is an unambiguous string end anchor.
See regex demo
This will work even on a client side.
To also get the full line match, you can just add .* at the beginning of the pattern (as we are not using singleline flag, . does not match newline symbols):
.*(?!,).(?=\r\n|\z)
Or (making it faster with an atomic group or an inline multiline option with ^ start of line anchor, but will not work on the client side)
(?>.*)(?!,).(?=\r\n|\z)
(?m)^.*?(?!,).(?=\r\n|\z) // The fastest of the last three
See demo

Regex pattern in C# with empty space

I am having issue with a reg ex expression and can't find the answer to my question.
I am trying to build a reg ex pattern that will pull in any matches that have # around them. for example #match# or #mt# would both come back.
This works fine for that. #.*?#
However I don't want matches on ## to show up. Basically if there is nothing between the pound signs don't match.
Hope this makes sense.
Thanks.
Please use + to match 1 or more symbols:
#+.+#+
UPDATE:
If you want to only match substrings that are enclosed with single hash symbols, use:
(?<!#)#(?!#)[^#]+#(?!#)
See regex demo
Explanation:
(?<!#)#(?!#) - a # symbol that is not preceded with a # (due to the negative lookbehind (?<!#)) and not followed by a # (due to the negative lookahead (?!#))
[^#]+ - one or more symbols other than # (due to the negated character class [^#])
#(?!#) - a # symbol not followed with another # symbol.
Instead of using * to match between zero and unlimited characters, replace it with +, which will only match if there is at least one character between the #'s. The edited regex should look like this: #.+?#. Hope this helps!
Edit
Sorry for the incorrect regex, I had not expected multiple hash signs. This should work for your sentence: #+.+?#+
Edit 2
I am pretty sure I got it. Try this: (?<!#)#[^#].*?#. It might not work as expected with triple hashes though.
Try:
[^#]?#.+#[^#]?
The [^ character_group] construction matches any single character not included in the character group. Using the ? after it will let you match at the beginning/end of a string (since it matches the preceeding character zero or more times. Check out the documentation here

C# Regex match on special characters

I know this stuff has been talked about a lot, but I'm having a problem trying to match the following...
Example input: "test test 310-315"
I need a regex expression that recognizes a number followed by a dash, and returns 310. How do I include the dash in the regex expression though. So the final match result would be: "310".
Thanks a lot - kcross
EDIT: Also, how would I do the same thing but with the dash preceding, but also take into account that the number following the dash could be a negative number... didnt think of this one when I wrote the question immediately. for example: "test test 310--315" returns -315 and "test 310-315" returns 315.
Regex regex = new Regex(#"\d+(?=\-)");
\d+ - Looks for one or more digits
(?=\-) - Makes sure it is followed by a dash
The # just eliminates the need to escape the backslashes to keep the compiler happy.
Also, you may want this instead:
\d+(?=\-\d+)
This will check for a one or more numbers, followed by a dash, followed by one or more numbers, but only match the first set.
In response to your comment, here's a regex that will check for a number following a -, while accounting for potential negative (-) numbers:
Regex regex = new Regex(#"(?<=\-)\-?\d+");
(?<=\-) - Negative lookbehind which will check and make sure there is a preceding -
\-? - Checks for either zero or one dashes
\d+ - One or more digits
(?'number'\d+)- will work ( no need to escape ). In this example the group containing the single number is the named group 'number'.
if you want to match both groups with optional sign try:
#"(?'first'-?\d+)-(?'second'-?\d+)"
See it working here.
Just to describe, nothing complicated, just using -? to match an optional - and \d+ to match one or more digit. a literal - match itself.
here's some documentation that I use:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
in the comments section of that page, it suggests escaping the dash with '\-'
make sure you escape your escape character \
You would escape the special meaning of - in regex language (means range) using a backslash (\). Since backslash has a special meaning in C# literals to escape quotes or be part of some characters, you need to escape that with another backslash(\). So essentially it would be \d+\\-.
\b\d*(?=\-) you will want to look ahead for the dash
\b = is start at a word boundry
\d = match any decimal digit
* = match the previous as many times as needed
(?=\-) = look ahead for the dash
Edited for Formatting issue with the slash not showing after posting

Categories