Regex: Match any punctuation character except . and _ - c#

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.

Use Regex Subtraction
[\p{P}-[._]]
See the .NET Regex documentation. I'm not sure if other flavors support it.
C# example
string pattern = #"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = #"_""'a:;%^&*~`bc!##.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}
Explanation
The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._], which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.

The answers so far do not respect ALL punctuation. This should work:
(?![\._])\p{P}
(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)

Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).
[^\w\s.]

You could possibly use a negated character class like this:
[^0-9A-Za-z._\s]
This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

Related

Split String At Every Non-Letter/Non-Number Character

Imagine a string that contains special characters like $§%%,., numbers and letters.
I want to receive the letter and number junks of an arbitrary string as an array of strings.
A good solution seems to be the use of regex, but I don't know how to express [numbers and letters]
// example
"abc" = {"abc"};
"ab .c" = {"ab", "c"}
"ab123,cd2, ,,%&$§56" = {"ab123", "cd2", "56"}
// try
string input = "jdahs32455$§&%$§df233§$fd";
string[] output = input.Split(Regex("makejunksfromstring"));
To extract chunks of 1 or more letters/digits you may use
[A-Za-z0-9]+ # ASCII only letters/digits
[\p{L}0-9]+ # Any Unicode letters and ASCII only digits
[\p{L}\p{N}]+ # Any Unicode letters/digits
See a regex demo.
C# usage:
string[] output = Regex.Matches(input, #"[\p{L}\p{N}]+").Cast<Match>().Select(x => x.Value).ToArray();
Yes, regex is indeed a good solution for this.
And in fact, to just match all standard words in the input sequence, this is all you need:
(\w+)
Let me quickly explain
\w matches any word character and is equivalent to [a-zA-Z0-9_] - matching a through z or A through Z or 0-9 or _, you might wanna go with [a-zA-Z0-9] instead to avoid that underscore.
Wrapping an expression in () means that you want to capture that part as a group.
The + means that you want sequences of 1 or more of the preceding characters.
Refer to a regular expression cheat sheet to see all the possibilities, such as
https://cheatography.com/davechild/cheat-sheets/regular-expressions/
Or any that you find online.
Also there are tools available to quickly test out your regular expressions, such as
https://regex101.com/ (quite well visualised matching)
or http://regexstorm.net/tester specifically for .NET

Problem with regex, how do I get all with \S up until a special character?

Ive got the text:
192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)
And im trying to get the uniquePlayerReference and the videoId
Ive tried this regular expression:
(?<=uniquePlayerReference=)\S*
but it matches:
81781956||videoId=1)
And then I try and get the video id with this:
(?<=videoId=)\S*
But it matches the ) after the videoId.
My question is two fold:
1) How do I use the \S character and get it to stop at a character? (essentially what is the regex to do what i want) I cant get it to stop at a defined character, I think I need to use a positive lookahead to match but not include the double pipe).
2) When should I use brackets?
The problem is the mul;tiplicity operator you have here - the * - which means "as many as possible". If you have an explicit number in mind you can use the operator {a,b} where a is a minimum and b a maximum number fo matches, but if you have an unknown number, you can't use \S (which is too generic).
As for brackets, if you mean () you use them to capture a part of a match for backreferencing. Bit complicated, think you need to use a reference for that.
I think you want something like this:
/uniquePlayerReference=(\d+)||videoId=(\d+)/i
and then backreference to \1 and \2 respectively.
Given that both id's are numeric you are probably better off using \d instead of \S. \d only matches numeric digits whereas \S matches any non-whitespace character.
What you might also do is a non gready match up till the character you do not want to match like so:
uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)
Note that I have escaped both the | and ) characters because otherwise they would have a special meaning inside a regex.
In C# you would use this like so: (which also answers your question what the brackets are for, they are meant to capture parts of the matched result).
Regex regex = new Regex(#"uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)");
Match match = regex.Match(
"192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)");
if (match.Success)
{
string playerReference = match.Groups[1].Value;
string videoId = match.Groups[2].Value;
// Etc.
}
If the ID isn't just digits then you could use [^|] instead of \S, i.e.
(?<=uniquePlayerReference=)[^|]*
Then you can use
(?<=videoId=)[^)]*
For the video ID
The \S means it matches any non-whitespace character, including the closing parenthesis. So if you had to use \S, you would have to explicitly say stop at the closing parenthesis, like this:
videoId=(\S+)\)
Therefore, you are better off using the \d, since what you are looking for are numeric:
uniquePlayerReference=(\d+)
videoId=(\d+)

Simple Regex Question

I am new to regex (15 minutes of experience) so I can't figure this one out. I just want something that will match an alphanumeric string with no spaces in it. For example:
"ThisIsMyName" should match, but
"This Is My Name" should not match.
^[a-zA-Z0-9]+$ will match any letters and any numbers with no spaces (or any punctuation) in the string. It will also require at least one alphanumeric character. This uses a character class for the matching. Breakdown:
^ #Match the beginning of the string
[ #Start of a character class
a-z #The range of lowercase letters
A-Z #The range of uppercase letters
0-9 #The digits 0-9
] #End of the character class
+ #Repeat the previous one or more times
$ #End of string
Further, if you want to "capture" the match so that it can be referenced later, you can surround the regex in parens (a capture group), like so:
^([a-zA-Z0-9]+)$
Even further: since you tagged this with C#, MSDN has a little howto for using regular expressions in .NET. It can be found here. You can also note the fact that if you run the regex with the RegexOptions.IgnoreCase flag then you can simplify it to:
^([a-z0-9])+$
this will match any sequence of non-space characters:
\S+
Take a look at this link for a good basic Regex information source: http://regexlib.com/CheatSheet.aspx
They also have a handy testing tool that I use quite a bit: http://regexlib.com/RETester.aspx
That said, #eldarerathis' or #Nicolas Bottarini's answers should work for you.
I have just written a blog entry about regex, maybe it's something you may find useful:)
http://blogs.appframe.com/erikv/2010-09-23-Regular-Expression
Try using this regex to see if it works: (\w+)

Regex issue with reserved characters in c#

I've got a working regex that scans a chunk of text for a list of keywords defined in a db. I dynamically create my regex from the db to get this:
\b(?:keywords|from|database|with|esc\#ped|characters|\#ss|gr\#ss)\b
Notice that special characters are escaped. This works for the vast majority of cases, EXCEPT where the first character of the keyword is a regex special character like # or $. So in the above example, #ss will not be matched, but gr#ss and esc#ped will.
Any ideas how to get this regex to work for these special cases? I've tried both with and without escaping the special characters in the regex string, but to no avail.
Thanks in advance,
David
new Regex(#"(?<=^|\W)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(?=\W|$)")
will match. It checks whether there is a non-word character (or beginning/end of string) before/after the keyword to be matched. I chose \W over \s because of punctuation and other non-word characters that might constitute a word boundary.
Edit: Even better (thanks to Alan Moore! - both versions will produce the same results):
new Regex(#"(?<!\w)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(?!\w)")
Both will fail to match #ass in l#ss which is probably what you want.
When you get the keywords from the database, escape them with Regex.Escape before creating the Regex string.
The # does not denote a word boundary.
Use: (\s|^)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(\s|$)
Tested with the following program:
static void Main(string[] args)
{
string pattern = "(\\s|^)(?:keywords|from|database|with|esc#ped|characters|#ss|gr#ss)(\\s|$)"
var matches = Regex.Matches("#ss is gr#ss is esc#ped keywordsnospace keywords", pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[2]);
}
}
Giving the result:
#ss
gr#ss
esc#ped
keywords

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex
I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5
Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.
How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.
Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.
I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.
It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);
Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.
I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

Categories