How to find abbreviations as words in a C# regular expression

How to find abbreviations as words in a C# regular expression - c#

I have been given a list of strings to find as whole "words" in my string. Generally, using the \b anchor works for most things except when I'm trying to find the & character as a word or if the abbreviation has a dot after it since the \b doesn't match between the space and the & character, or after a period and space.
For instance to find these strings:
&
b&w
bpi
p.
I'm trying to write something like:
\b((&)|(b&w)|(bpi)|(p\.))\b
In a test string:
my b&w and & and p. test.
I've also tried using \s to check for whitespace but I don't want to capture the whitespace and I haven't been able to figure out how not to. It would also then need to check for beginning and ending of the string as well I believe.

Instead of using word boundaries (\b) you could use look around assertions for (space) OR ^beginning or $end of line.. like so:
(?<=^|\s)([^\s]*)(?=\s|$)
Working regex example:
http://regex101.com/r/rJ0wU4
Test string:
my b&w and & and p. test.
Matches:
"my", "b&w", "and", "&", "and", "p.", "test."

Try to use all abbrs in one group like:
(^|\s+)(&|b&w|bpi|p\.)(\s+|$)

Related

Match text not surrounded by & and ;

I am currently using the following regular expression:
(?<!&)[^&;]*(?!;)
To match text like this:
match1<match2>
And extract:
match1
match2
However, this seems to match an extra five empty strings. See Regex Storm.
How can I only match the two listed above?
Note the existing pattern ((?<=^|;)[^&]+) by #xanatos will only match matches 1 to 3 in the following string and not match4:
match1&lte;match2<match;3+match&4

Try changing the * to a +:
(?<!&)[^&;]+(?!;)
Test here
More correct regex:
(?<=^|;)[^&]+
Test here
The basic idea here is that a "good" substring starts at the beginning of the string (^) or right after the ;, and ends when you encounter a & ([^&]+).
Third version... But here we are showing how if you have a problem, and you decide to use regexes, now you have two problems:
(?<=^|;)([^&]|&(?=[^&;]*(?:&|$)))+
Test here

I have managed it with:
(?<Text>.+?)(?:&[^&;]*?;|$)
This seems to match all of the corner cases but it might not work with a case I can't think of at the moment.
This won't work if the string starts with a &...; pattern or is only that.
See Regex Storm.

Extract string from a pattern preceded by any length

I'm looking for a regular expression to extract a string from a file name
eg if filename format is "anythingatallanylength_123_TESTNAME.docx", I'm interested in extracting "TESTNAME" ... probably fixed length of 8. (btw, 123 can be any three digit number)
I think I can use regex match ...
".*_[0-9][0-9][0-9]_[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z].docx$"
However this matches the whole thing. How can I just get "TESTNAME"?
Thanks

Use parenthesis to match a specific piece of the whole regex.
You can also use the curly braces to specify counts of matching characters, and \d for [0-9].
In C#:
var myRegex = new Regex(#"*._\d{3}_([A-Za-z]{8})\.docx$");
Now "TESTNAME" or whatever your 8 letter piece is will be found in the captures collection of your regex after using it.
Also note, there will be a performance overhead for look-ahead and look-behind, as presented in some other solutions.

You can use a look-behind and a look-ahead to check parts without matching them:
(?<=_[0-9]{3}_)[A-Z]{8}(?=\.docx$)
Note that this is case-sensitive, you may want to use other character classes and/or quantifiers to fit your exact pattern.

In your file name format "anythingatallanylength_123_TESTNAME.docx", the pattern you are trying to match is a string before .docx and the underscore _. Keeping the thing in mind that any _ before doesn't get matched I came up with following solution.
Regex: (?<=_)[A-Za-z]*(?=\.docx$)
Flags used:
g global search
m multi-line search.
Explanation:
(?<=_) checks if there is an underscore before the file name.
(?=\.docx$) checks for extension at the end.
[A-Za-z]* checks the required match.
Regex101 Demo

Thanks to #Lucero #noob #JamesFaix I came up with ...
#"(?<=.*[0-9]{3})[A-Z]{8}(?=.docx$)"
So a look behind (in brackets, starting with ?<=) for anything (ie zero or more any char (denoted by "." ) followed by an underscore, followed by thee numerics, followed by underscore. Thats the end of the look behind. Now to match what I need (eight letters). Finally, the look ahead (in brackets, starting with ?=), which is the .docx
Nice work, fellas. Thunderbirds are go.

How to insert spaces between characters using Regex?

Trying to learn a little more about using Regex (Regular expressions). Using Microsoft's version of Regex in C# (VS 2010), how could I take a simple string like:
"Hello"
and change it to
"H e l l o"
This could be a string of any letter or symbol, capitals, lowercase, etc., and there are no other letters or symbols following or leading this word. (The string consists of only the one word).
(I have read the other posts, but I can't seem to grasp Regex. Please be kind :) ).
Thanks for any help with this. (an explanation would be most useful).

You could do this through regex only, no need for inbuilt c# functions.
Use the below regexes and then replace the matched boundaries with space.
(?<=.)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<=.)(?!$)", " ");
Explanation:
(?<=.) Positive lookbehind asserts that the match must be preceded by a character.
(?!$) Negative lookahead which asserts that the match won't be followed by an end of the line anchor. So the boundaries next to all the characters would be matched but not the one which was next to the last character.
OR
You could also use word boundaries.
(?<!^)(\B|b)(?!$)
DEMO
string result = Regex.Replace(yourString, #"(?<!^)(\B|b)(?!$)", " ");
Explanation:
(?<!^) Negative lookbehind which asserts that the match won't be at the start.
(\B|\b) Matches the boundary which exists between two word characters and two non-word characters (\B) or match the boundary which exists between a word character and a non-word character (\b).
(?!$) Negative lookahead asserts that the match won't be followed by an end of the line anchor.

Regex.Replace("Hello", "(.)", "$1 ").TrimEnd();
Explanation
The dot character class matches every character of your string "Hello".
The paranthesis around the dot character are required so that we could refer to the captured character through the $n notation.
Each captured character is replaced by the replacement string. Our replacement string is "$1 " (notice the space at the end). Here $1 represents the first captured group in the input, therefore our replacement string will replace each character by that character plus one space.
This technique will add one space after the final character "o" as well, so we call TrimEnd() to remove that.
A demo can be seen here.
For the enthusiast, the same effect can be achieve through LINQ using this one-liner:
String.Join(" ", YourString.AsEnumerable())
or if you don't want to use the extension method:
String.Join(" ", YourString.ToCharArray())

It's very simple. To match any character use . dot and then replace with that character along with one extra space
Here parenthesis (...) are used for grouping that can be accessed by $index
Find what : "(.)"
Replace with "$1 "
DEMO

Using Visual Studio 2013 regular expression find - how to invalidate multiple prefixes

I'm trying to find the "symbol" all in a bunch of C#, XML, and JS files. My project is huge and doing a naive search for "all" results in over 8,000 lines found so I'm trying to eliminate some of them.
For example, I don't want to match on "call" or "balloon" or "Balloon" (those are UX element styles.
Looking at the using Regular Expressions MSDN page (http://msdn.microsoft.com/en-us/library/vstudio/2k3te2cs%28v=vs.110%29.aspx) I found out how to invalidate on one of those but I can't figure out how to do it on multiple and make it case-insensitive.
I started off using:
(?!c)all
And that filtered out call and things like that but I can't get one to filter out multiple to work.
(?!b|c)all
Is the form I've been playing around with, trying to get it to ignore balloon. Ideally I could do something like (warning! - invalid regex below)
(?!b|c|B|C|)all
If anyone could point me in the right direction that would be great. The reason why I'm not looking for all surrounded by spaces is because I don't know if the reference I'm looking for is going to be:
.All
.all
("All")
(all)
and etc...

Have you tried: [aA][lL][lL]\b
any version of "all" or "ALL" anchored at a word/non-word boundary
Here is another reference..
Regular Expression to match specific string

The following regex: (?<!(b|c))all (with the IgnoreCase flag)
With the following input: ball all stall .all( "ALL"
Has the following matches: ball [all] st[all] .[all]( "[ALL]"

I think you are on the right track with lookarounds, but you want to use a character class with it. (More info: http://www.regular-expressions.info/charclass.html)
There are convenient shorthand character classes, like \w, to represent common classes. \w for example represents all alphanumeric characters and is shorthand for [A-Za-z0-9_].
\b represents a "word boundary," or in other words the beginning/end of the string and a boundary between a word character and a non-word character. It is zero-length and doesn't won't match any characters.
Here are some examples using a word boundary, positive lookaround, and negative lookaround respectively:
\b[aA][lL][lL]\b
(?<=[^\w])[aA][lL][lL](?=[^\w])
(?<!\w)[aA][lL][lL](?!\w)
Basically, these will find case-insensitive matches of "all" that are surrounded by non-alphanumeric characters. If you want to exclude certain surrounding characters, you can replace \w with your own character class (e.g. to exclude surrounding quote marks, use [A-Za-z0-9_"] instead of \w).

Match text after colon

I want to match the word after the "type :".
What I have?
My actual pattern:
(?<=type\s:\s)(\w*)
Text:
"type : text,"
It work exact as I want when I have just one whitespace before/after color...
"type_SPACE_:_SPACE_text
But if I have 2 spaces or none, it doesn't work.
I already try with this, but doesn't match.
(?<=type\s*:\s*)(\w*)
Also, I try with this, best approach. But with this, the matched text contain the colon.
(?<=type)(\s*):(\s*)(.*)(?=,)
To do the test I use gskinner's tester...
http://gskinner.com/RegExr/

If you're doing this in C# and using the included Regex engine, your original regex should work, with a slight modification:
string myString = "type : something";
var match = Regex.Match(myString, #"(?<=type\s*:\s*)\w+");
Console.Write(match);
Edit: The reason why the ?<=type\s*:\s*)\w* version wasn't working for you with multiple spaces, is because the regex match was happily returning various combinations of strings with 0 characters after the variable number of spaces following the colon.
You can view the various matched strings by using Regex.Matches, you'll see that your matched word is in there, but it's not the first result.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to find abbreviations as words in a C# regular expression - c#

Instead of using word boundaries (\b) you could use look around assertions for (space) OR ^beginning or $end of line.. like so: (?<=^|\s)([^\s]*)(?=\s|$) Working regex example: http://regex101.com/r/rJ0wU4 Test string: my b&w and & and p. test. Matches: "my", "b&w", "and", "&", "and", "p.", "test."

Try to use all abbrs in one group like: (^|\s+)(&|b&w|bpi|p\.)(\s+|$)

Related

Match text not surrounded by & and ;

Extract string from a pattern preceded by any length

How to insert spaces between characters using Regex?

Using Visual Studio 2013 regular expression find - how to invalidate multiple prefixes

Match text after colon

Categories

Resources