C# Regex syntax for pattern

C# Regex syntax for pattern - c#

I need to make sure a field has the proper syntax using Regex in C#, before proceeding. Here is my code:
Description = 'AB1234567,AB3456789;AB2345678';
Regex reg = new Regex("(AB.{7},?)*;?(AB.{7},?)*");
Match match = reg.Match(Description);
if (!match.Success)
{
//code to raise error
}
So, some syntax rules:
The field has elements of 2 letters (in this case AB) followed by 7 characters.
These elements are comma separated, either on the left or right side of a ";". Which side they are in indicating their properties, but either side can be empty.
If the right side is not empty then ";" is mandatory, if empty it is optional.
The last element of each side cannot end with a ",".
Correct examples:
- AB1234567,AB3456789;AB2345678
- AB1234567,AB3456789;
- AB1234567
- ;AB2345678,AB34567890
Wrong examples:
- AB1234567,;AB2345678
- AB3456789;AB2345678,
My regular expression is not complete, but I cannot come up with how to consider all cases. What is the correct regular expression for this problem?

Straight answer and much more simplified version, Should work in all options.
bool IsValid(string line)
{
if (string.IsNullOrEmpty(line)) return true;
return !line.Trim().EndsWith(",");
}
IEnumerable<string> GetTokens(string line)
{
var pattern = #"(AB\d{7}([,;]|[^0-9a-zA-Z]|$))";
var matches = Regex.Matches(line, pattern, RegexOptions.Singleline);
foreach (Match match in matches)
{
yield return match.Value;
}
}
string inputLine = ";AB2345678,AB34567890";
string[] leftRight = inputLine.Split(new[] { ';' });
string left =string.Empty, right = string.Empty;
if (leftRight.Length > 0) left = leftRight[0];
if (leftRight.Length > 1) right = leftRight[1];
bool isLeftValid = IsValid(left);
bool isRightValid = IsValid(right);
IEnumerable<string> leftTokens = null, rightTokens = null;
if (isLeftValid) leftTokens = GetTokens(left);
if (isRightValid) rightTokens = GetTokens(right);

I think your expression is almost correct, you just need to ensure that a comma is followed by another AB group. You can do that with a positive lookahead, like this:
^(AB.{7}(,(?=AB))?)*;?(AB.{7}(,(?=AB))?)*$
You also need to put in the start and end markers otherwise you will get multiple submatches.
This expression will not match the ;AB2345678,AB34567890 sample because it has 8 digits in the last group instead of 7
Edit: If you want the AB groups in a nice collection, try
^((?<left>AB.{7})(,(?=AB))?)*;?((?<right>AB.{7})(,(?=AB))?)*$
Then match.Groups["left"]?.Captures and match.Groups["right"]?.Captures will give you the respective matched strings (or null). This is called a named capture.

Related

How to remove substring after occurence of certain characters in a string?

I have the requirement as follows:
input => "Employee.Addresses[].Address.City"
output => "Empolyee.Addresses[].City"
(Address is removed which is present after [].)
input => "Employee.Addresses[].Address.Lanes[].Lane.Name"
output => "Employee.Addresses[].Lanes[].Name"
(Address is removed which is present after []. and Lane is removed which is present after [].)
How to do this in C#?

private static IEnumerable<string> Filter(string input)
{
var subWords = input.Split('.');
bool skip = false;
foreach (var word in subWords)
{
if (skip)
{
skip = false;
}
else
{
yield return word;
}
if (word.EndsWith("[]"))
{
skip = true;
}
}
}
And now you use it like this:
var filtered = string.Join(".", Filter(input));

How about a regular expression?
Regex rgx = new Regex(#"(?<=\[\])\..+?(?=\.)");
string output = rgx.Replace(input, String.Empty);
Explanation:
(?<=\[\]) //positive lookbehind for the brackets
\. //match literal period
.+? //match any character at least once, but as few times as possible
(?=\.) //positive lookahead for a literal period

Your description of what you need is lacking. Please correct me if I have understood it incorrectly.
You need to find the pattern "[]." and then remove everything after this pattern until the next dot .
If this is the case, I believe using a Regular Expression could solve the problem easily.
So, the pattern "[]." can be written in a regular expression as
"\[\]\."
Then you need to find everything after this pattern until the next dot: ".*?\." (The .*? means every character as many times as possible but in a non-greedy way, i.e. stopping at the first dot it finds).
So, the whole pattern would be:
var regexPattern = #"\[\]\..*?\.";
And you want to replace all matches of this pattern with "[]." (i.e. removing what was match after the brackets until the dot).
So you call the Replace method in the Regex class:
var result = Regex.Replace(input, regexPattern, "[].");

How to check the string has specific set of character in it or not?

I have a requirement to check wether the incoming string has any character and - in the begining?
sample code is:
string name = "e-rob";
if (name.Contains("[a-z]-"))
{
Console.WriteLine(name);
}
else
{
Console.WriteLine("no match found");
}
Console.ReadLine();`
The above code is not working. It is not neccessarily e- all the time it could be any character and then -
How can I do this?

Try using some RegEx:
Regex reg = new Regex("^[a-zA-Z]-");
bool check = reg.IsMatch("e-rob");
Or even more concise:
if (Regex.IsMatch("e-rob", "^[a-zA-Z]-")) {
// do stuff for when it matches here
}
The ^[a-zA-Z] is where the magic happens. Breaking it down piece-by-piece:
^: tells it to start at the beginning of whatever it's checking the pattern against
[a-zA-Z]: tells it to check for one upper- or lower-case letter between A and Z
-: tells it to check for a "-" character directly after the letter
So e-rob or E-rob would both return true where abcdef-g would return false
Also, as a note, in order to use RegEx you need to include
using System.Text.RegularExpressions;
in your class file
Here's a great link to teach you a bit about RegEx which is the best tool ever when you're talking about matching patterns

Try Regex
Regex reg = new Regex("[a-z]-");
if(reg.IsMatch(name.SubString(0, 2))
{...}

Another way to do this, kind of LINQish:
StartsWithLettersOrDash("Test123");
public bool StartsWithLettersOrDash(string str)
{
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
char [] alphas = (alphabet + alphabet.ToLower()).ToCharArray();
return alphas.Any(z => str.StartsWith(z.ToString() + "-"));
}

C# Regex match all occurrences

I'm trying to make a Regular Expression in C# that will match strings like"", but my Regex stops at the first match, and I'd like to match the whole string.
I've been trying with a lot of ways to do this, currently, my code looks like this:
string sPattern = #"/&#\d{2};/";
Regex rExp = new Regex(sPattern);
MatchCollection mcMatches = rExp.Matches(txtInput.Text);
foreach (Match m in mcMatches) {
if (!m.Success) {
//Give Warning
}
}
And also tried lblDebug.Text = Regex.IsMatch(txtInput.Text, "(&#[0-9]{2};)+").ToString(); but it also only finds the first match.
Any tips?
Edit:
The end result I'm seeking is that strings like &# are labeled as incorrect, as it is now, since only the first match is made, my code marks this as a correct string.
Second Edit:
I changed my code to this
string sPattern = #"&#\d{2};";
Regex rExp = new Regex(sPattern);
MatchCollection mcMatches = rExp.Matches(txtInput.Text);
int iMatchCount = 0;
foreach (Match m in mcMatches) {
if (m.Success) {
iMatchCount++;
}
}
int iTotalStrings = txtInput.Text.Length / 5;
int iVerify = txtInput.Text.Length % 5;
if (iTotalStrings == iMatchCount && iVerify == 0) {
lblDebug.Text = "True";
} else {
lblDebug.Text = "False";
}
And this works the way I expected, but I still think this can be achieved in a better way.
Third Edit:
As #devundef suggest, the expression "^(&#\d{2};)+$" does the work I was hopping, so with this, my final code looks like this:
string sPattern = #"^(&#\d{2};)+$";
Regex rExp = new Regex(sPattern);
lblDebug.Text = rExp.IsMatch(txtInput.Text).ToString();
I always neglect the start and end of string characters (^ / $).

Remove the / at the start and end of the expression.
string sPattern = #"&#\d{2};";
EDIT
I tested the pattern and it works as expected. Not sure what you want.
Two options:
&#\d{2}; => will give N matches in the string. On the string  it will match 2 groups,  and 
(&#\d{2};)+ => will macth the whole string as one single group. On the string  it will match 1 group, 
Edit 2:
What you want is not get the groups but know if the string is in the right format. This is the pattern:
Regex rExp = new Regex(#"^(&#\d{2};)+$");
var isValid = rExp.IsMatch("") // isValid = true
var isValid = rExp.IsMatch("xyz") // isValid = false

Here you go: (&#\d{2};)+ This should work for one occurence or more

(&#\d{2};)*
Recommend: http://www.weitz.de/regex-coach/

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5

The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Formatting sentences in a string using C#

I have a string with multiple sentences. How do I Capitalize the first letter of first word in every sentence. Something like paragraph formatting in word.
eg ."this is some code. the code is in C#. "
The ouput must be "This is some code. The code is in C#".
one way would be to split the string based on '.' and then capitalize the first letter and then rejoin.
Is there a better solution?

In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.
I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.
Here's the code:
public static string Capitalise(string input)
{
//now the first character
return Regex.Replace(input, #"(?<=(^|[.;:])\s*)[a-z]",
(match) => { return match.Value.ToUpper(); });
}
The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.
This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).
All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).
Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.
EDIT
Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.
Here's the crude benchmark I did:
public string LowerCaseLipsum
{
get
{
//went to lipsum.com and generated 10 paragraphs of lipsum
//which I then initialised into the backing field with #"[lipsumtext]".ToLower()
return _lowerCaseLipsum;
}
}
[TestMethod]
public void CapitaliseAhmadsWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1)));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
[TestMethod]
public void CapitaliseLookAroundWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).
Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.

Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.
string input = #"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = #"(^|\p{P}\s+)(\w+)";
// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);
string result = rx.Replace(input, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:
string pattern = #"(^|[.?!""]\s+)(\w+)";
EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.
The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:
P: all punctuations (comprises all of the categories below)
Pc: underscore _
Pd: dash -
Ps: open parenthesis, brackets and braces ( [ {
Pe: closing parenthesis, brackets and braces ) ] }
Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
Pf: final single/double quotes (MSDN Pi note applies)
Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /
Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.
string input = #"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";
string[] patterns = {
#"(^|\p{P}\s+)(\w+)", // all punctuation chars
#"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
#"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};
// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
Console.WriteLine("*** Current pattern: {0}", pattern);
string result = Regex.Replace(input, pattern,
m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
Console.WriteLine(result);
Console.WriteLine();
}
Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.
Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.

You have a few different options:
Your approach of splitting the string, capitalizing and then re-joining
Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.
The code below uses an iterator:
public static string ToSentenceCase( string someString )
{
var sb = new StringBuilder( someString.Length );
bool wasPeriodLastSeen = true; // We want first letter to be capitalized
foreach( var c in someString )
{
if( wasPeriodLastSeen && !c.IsWhiteSpace )
{
sb.Append( c.ToUpper() );
wasPeriodLastSeen = false;
}
else
{
if( c == '.' ) // you may want to expand this to other punctuation
wasPeriodLastSeen = true;
sb.Append( c );
}
}
return sb.ToString();
}

I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.
static IEnumerable<char> CapitalLetters(string sentence)
{
//capitalize first letter
bool capitalize = true;
char lastLetter;
for (int i = 0; i < sentence.Length; i++)
{
lastLetter = sentence[i];
yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];
if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
continue;
capitalize = false;
if (lastLetter == '.' || lastLetter == '!') //etc
capitalize = true;
}
}
To use it:
string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());

Do your work in a StringBuffer.
Lowercase the whole thing.
Loop through and uppercase leading chars.
Call ToString.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regex syntax for pattern - c#

Related

How to remove substring after occurence of certain characters in a string?

How to check the string has specific set of character in it or not?

C# Regex match all occurrences

How to find repeatable characters

Formatting sentences in a string using C#

Categories

Resources