Order a set of regex patterns OR Get the biggest regex match

Order a set of regex patterns OR Get the biggest regex match - c#

Given a list of numbers that are regex patterns, sort them by the last 4 numbers in that numeric value.
I have a list of regex (phone number) patterns and I am trying to sort them by the last 4 characters. Here's a sample list of phone numbers:
8062
\+13066598273
4083100
408320[0-3]
408320[4-6]
752[234569]
\+13066598305
8059
I would like to re-order these numbers by the last 4 numbers so that I would end up with a list like this:
4083100
408320[0-3]
408320[4-6]
752[234569]
8059
8062
\+13066598273
\+13066598305
Now if my patterns were nothing but numbers, I could sort them easily in either SQL or my MVC C# project. In SQL, I could use ORDER BY RIGHT(pattern, 4), or in C# MVC, I could sort my IQueryable list with patterns = patterns.OrderByDescending(s => s.Substring(...etc...)).
Patterns are a bit more difficult. The brackets count as characters, so sorting by the last 4 characters doesn't work as well.
Are there any built-in utilities in C#, MVC, or SQL that allow me to convert regex patterns to the largest possible match?
Given a regex pattern, return the largest possible matching regex that matches my condition. For example, if I had the pattern 4[12]00[1-3], I'd have 6 possible results: 41001, 41002, 41003, 42001, 42002, 42003. Then I can get the biggest number possible, and use that for ordering in my original regex list.
The regex pattern doesn't contain anything like * or +, special characters that may cause infinite combinations.
There may be a C# library that does exactly what I ask for, sorting regex pattern strings.
EDIT:
I've accepted Diego's answer, but it took me a bit of time to wrap my head around it. For other readers who want to know what it's doing, this is what I think Diego is doing:
Make a range of ints, starting at 9999, all the way back to 0. [9999], [9998], [9997]...[0].
Replace the regex-ish part of the strings with a single character. Example, "500[1-5]" would become "500X", or "20[1-9]00[89]" would become "20X00X", so on, so forth.
Get the length of the "last" 4 characters + regex characters.
var len = lastNChars + pattern.Length - Regex.Replace(pattern, #"\[[^\]]+\]", "X").Length;
So for the pattern 20[1-9]00[89], the above formula translates to "len = 4 + 13 - 6", or 11.
Using the len variable from above, get a substring that represents the "last" 4 numbers of the phone number, even with regex characters. The original string = "20[1-9]00[89]", while the new substring = "[1-9]00[89]" (the 20 is gone now)
Enumerate through and compare array values to the substring regex pattern. [9999] doesn't match regex pattern, [9998] doesn't match...[9997] doesn't match...aha! 9009 matches! The first match I get is going to the biggest possible regex match, which is what I want.
So each regex pattern has been converted to its largest possible matching pattern. Now we can use C#/LINQ/other in-built methods that can sort by those sub-regex matches for us!
Thank god I'm dealing with only numbers. If I were trying to do sort regex patterns that were actually words/had alpha characters, that would be way harder, and that array would be hella bigger (I think).

It is hard to find sample strings that match a regular expression without enumerating them all and testing them. I also don't think you will be able to find a C# library that sorts regexes. However, you can solve this problem, however, for the special case of patterns that do not contain quantifiers (such as [0-9]+ or [3-6]{4}) as follows:
const int lastNChars = 4;
var patterns = new string[]{#"8062", #"\+13066598273", #"4083100",
#"408320[0-3]", #"408320[4-6]", #"752[234569]",
#"\+13066598305", #"8059"};
var range = Enumerable.Range(0, (int) Math.Pow(10, lastNChars))
.Reverse().ToArray();
var sortedPatterns = patterns.OrderBy(pattern=> {
var len = lastNChars + pattern.Length
- Regex.Replace(pattern, #"\[[^\]]+\]", "X").Length;
// Get the biggest number in range that matches this regex:
var biggestNumberMatched = range.FirstOrDefault(x =>
Regex.IsMatch(x.ToString(new String('0', lastNChars)),
pattern.Substring(pattern.Length - len, len))
);
return biggestNumberMatched;
}).ToArray();
After which, sortedPatterns contains the desired output.

Here is one solution, credits to Matt Hamilton from this question:
var pList = new List<string>()
{
"01233[0-3]", "12356[1-3]", "55555[7-9]"
};
var paired =
pList.Select(x =>
new KeyValuePair<int, string>
(Int32.Parse(new String((new String(x.Where(Char.IsDigit).Reverse().ToArray()))
.Substring(0, 4).Reverse().ToArray())), x));
var pairedOrdered = paired.OrderByDescending(x => x.Key);
foreach(var kvp in pairedOrdered)
{
Console.WriteLine("Key: {0} Value: {1}", kvp.Key, kvp.Value);
}
Output:
Key: 5613 Value: 12356[1-3]
Key: 5579 Value: 55555[7-9]
Key: 3303 Value: 01233[0-3]

Related

how to extract a number only, not any operators

Hi I am trying to match numbers only in a string that contain operators.
However the following regEx is also giving me operators I dnt know why?
For example I have the string "2X/8" and I am trying to get rid of 8.
if(Regex.IsMatch(elements[i], #"\d"))
{
Console.WriteLine("Adding to numberstack:+ ", elements[i]);
numberStack.Push(elements[i]);
}
if (i >= elements.Length - 1)
{
Console.WriteLine("Inside the popper");
if ((i - 2) >= 0)
{
Console.WriteLine(numberStack.Peek());
if (elements[i - 1].Contains("/*") && elements[i - 2].Contains("X"))
{
numberStack.Pop();
}
}
}

The question seems a bit confusing to me, but I'm assuming the following (correct me if I'm wrong):
You are trying to discover if there is a number at the beginning of your string, and extract that number
"elements" is an array of string elements.
"numberStack" is a stack of string elements.
If so, notice you are pushing the "elements[i]" value to the stack, which I understand contains the whole string of the expression you are trying to evaluate ("2X/8" as per your example), and not only the number at the start of the expression. This would explain why you are getting not only the number "2" on your result, but the whole "2X/8" value.
There are several ways to extract the numbers for an expression, and selecting one of them depends on your specific needs.
As a quick example, if you just want to extract each set of numbers from a string, you can get all matches for your regular expression inside that string and iterate through them:
string theExpression = "2X/8+123";
MatchCollection matches = Regex.Matches(theExpression, #"\d+");
foreach (Match m in matches)
Console.WriteLine(m.Value);
Console.ReadLine();
This example would print the numbers 2, 8 and 123 from the given expression to the output.

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?

If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345

Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo

For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.

Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());

const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));

Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

How to cut rest of string after substring in c#?

I got a string like this (my readline):
alfa: 10662 beta: -64 gama: 70679 delta: 1001
I need to use some of this numbers as a parameters but these numbers can have varying length. I can imagine that extracting value alfa I can do with:
str1 = readline.Substring(6, 5);
But how would I get the value of gamma if the values of beta and alpha can vary?

You can use a regex to match all the name:value pairs and use capture groups to extract the names and values:
var readline = "alpha: 10662 beta: -64 gamma: 70679 delta: 1001";
var matches = Regex.Matches(readline, #"(?<parameter>\w*):\s*(?<value>-?\d*)");
var dictionary = new Dictionary<string,int>();
foreach (Match m in matches)
{
dictionary.Add(m.Groups["parameter"].Value,int.Parse(m.Groups["value"].Value));
}
Console.WriteLine(dictionary["gamma"]); // output: 70679

I would go about doing it a different way that using substring. First, split on the separators to produce an array of keys/values with the keys in the even positions and the values in the odd positions. Then you can either iterate through the array by 2s choosing the value associated with key desired or, if they are always in the same order, just choose the correct array element to convert.
Apply input validation as needed to make sure you don't have corrupt inputs.
var parameters = line.Split( new [] { ':', ' ' }, StringSplitOptions.RemoveEmptyEntries );
for (var i = 0; i < parameters.Length; i += 2 )
{
var key = parameters[i];
var value = int.Parse( parameters[i+1] );
// do something with the value based on the key
}

It seems like a good fit for a regular expression:
var regex = new Regex(#"(\w+)\:\s(-?\d+)");
var values = from pair in regex.Matches("alfa: 10662 beta: -64 gama: 70679 delta: 1001").OfType<Match>()
select new { Key = pair.Groups[1].Value, pair.Groups[2].Value };

I wouldn't use SubString for this; it will be more verbose and error prone.
At it's simplest, it looks as though all of your data is separated by whitespace. Is this a fair assumption? Is the order of each variable always the same? If so, then you can simply split on whitespace and grab every other number;
If the data is not always of the same form, then I would use a regular expression instead. You can use something of the form:
/alfa: ([+-]\d+)/
Which will capture the number after "alpha:" and the sign. You will need something a bit fancier for floating point values. Honestly, I very rarely use regular expressions, and when I write a non-trivial regex I always use regex buddy, so I don't want to write a comprehensive one here for you as it will take me too long =)
EDIT: See, Mark's Regex is much better than mine.

Regex: replace inner string

I'm working with X12 EDI Files (Specifically 835s for those of you in Health Care), and I have a particular vendor who's using a non-HIPAA compliant version (3090, I think). The problem is that in a particular segment (PLB- again, for those who care) they're sending a code which is no longer supported by the HIPAA Standard. I need to locate the specific code, and update it with a corrected code.
I think a Regex would be best for this, but I'm still very new to Regex, and I'm not sure where to begin. My current methodology is to turn the file into an array of strings, find the array that starts with "PLB", break that into an array of strings, find the code, and change it. As you can guess, that's very verbose code for something which should be (I'd think) fairly simple.
Here's a sample of what I'm looking for:
~PLB|1902841224|20100228|49>KC15X078001104|.08~
And here's what I want to change it to:
~PLB|1902841224|20100228|CS>KC15X078001104|.08~
Any suggestions?
UPDATE: After review, I found I hadn't quite defined my question well enough. The record above is an example, but it is not necessarilly a specific formatting match- there are three things which could change between this record and some other (in another file) I'd have to fix. They are:
The Pipe (|) could potentially be any non-alpha numeric character. The file itself will define which character (normally a Pipe or Asterisk).
The > could also be any other non-alpha numeric character (most often : or >)
The set of numbers immediately following the PLB is an identifier, and could change in format and length. I've only ever seen numeric Ids there, but technically it could be alpha numeric, and it won't necessarilly be 10 characters.
My Plan is to use String.Format() with my Regex match string so that | and > can be replaced with the correct characters.
And for the record. Yes, I hate ANSI X12.

Assuming that the "offending" code is always 49, you can use the following:
resultString = Regex.Replace(subjectString, #"(?<=~PLB|\d{10}|\d{8}|)49(?=>\w+|)", "CS");
This looks for 49 if it's the first element after a | delimiter, preceded by a group of 8 digits, another |, a group of 10 digits, yet another |, and ~PLB. It also looks if it is followed by >, then any number of alphanumeric characters, and one more |.
With the new requirements (and the lucky coincidence that .NET is one of the few regex flavors that allow variable repetition inside lookbehind), you can change that to:
resultString = Regex.Replace(subjectString, #"(?<=~PLB\1\w+\1\d{8}(\W))49(?=\W\w+\1)", "CS");
Now any non-alphanumeric character is allowed as separator instead of | or > (but in the case of | it has to be always the same one), and the restrictions on the number of characters for the first field have been loosened.

Another, similar approach that works on any valid X12 file to replace a single data value with another on a matching segment:
public void ReplaceData(string filePath, string segmentName,
int elementPosition, int componentPosition,
string oldData, string newData)
{
string text = File.ReadAllText(filePath);
Match match = Regex.Match(text,
#"^ISA(?<e>.).{100}(?<c>.)(?<s>.)(\w+.*?\k<s>)*IEA\k<e>\d*\k<e>\d*\k<s>$");
if (!match.Success)
throw new InvalidOperationException("Not an X12 file");
char elementSeparator = match.Groups["e"].Value[0];
char componentSeparator = match.Groups["c"].Value[0];
char segmentTerminator = match.Groups["s"].Value[0];
var segments = text
.Split(segmentTerminator)
.Select(s => s.Split(elementSeparator)
.Select(e => e.Split(componentSeparator)).ToArray())
.ToArray();
foreach (var segment in segments.Where(s => s[0][0] == segmentName &&
s.Count() > elementPosition &&
s[elementPosition].Count() > componentPosition &&
s[elementPosition][componentPosition] == oldData))
{
segment[elementPosition][componentPosition] = newData;
}
File.WriteAllText(filePath,
string.Join(segmentTerminator.ToString(), segments
.Select(e => string.Join(elementSeparator.ToString(),
e.Select(c => string.Join(componentSeparator.ToString(), c))
.ToArray()))
.ToArray()));
}
The regular expression used validates a proper X12 interchange envelope and assures that all segments within the file contain at least a one character name element. It also parses out the element and component separators as well as the segment terminator.

Assuming that your code is always a two digit number that comes after a pipe character | and before the greater than sign > you can do it like this:
var result = Regex.Replace(yourString, #"(\|)(\d{2})(>)", #"$1CS$3");

You can break it down with regex yes.
If i understand your example correctly the 2 characters between the | and the > need to be letters and not digits.
~PLB\|\d{10}\|\d{8}\|(\d{2})>\w{14}\|\.\d{2}~
This pattern will match the old one and capture the characters between the | and the >. Which you can then use to modify (lookup in a db or something) and do a replace with the following pattern:
(?<=|)\d{2}(?=>)

This will look for the ~PLB|#|#| at the start and replace the 2 numbers before the > with CS.
Regex.Replace(testString, #"(?<=~PLB|[0-9]{10}|[0-9]{8})(\|)([0-9]{2})(>)", #"$1CS$3")

The X12 protocol standard allows the specification of element and component separators in the header, so anything that hard-codes the "|" and ">" characters could eventually break. Since the standard mandates that the characters used as separators (and segment terminators, e.g., "~") cannot appear within the data (there is no escape sequence to allow them to be embedded), parsing the syntax is very simple. Maybe you're already doing something similar to this, but for readability...
// The original segment string (without segment terminator):
string segment = "PLB|1902841224|20100228|49>KC15X078001104|.08";
// Parse the segment into elements, then the fourth element
// into components (bounds checking is omitted for brevity):
var elements = segment.Split('|');
var components = elements[3].Split('>');
// If the first component is the bad value, replace it with
// the correct value (again, not checking bounds):
if (components[0] == "49")
components[0] = "CS";
// Reassemble the segment by joining the components into
// the fourth element, then the elements back into the
// segment string:
elements[3] = string.Join(">", components);
segment = string.Join("|", elements);
Obviously more verbose than a single regular expression but parsing X12 files is as easy as splitting strings on a single character. Except for the fixed length header (which defines the delimiters), an entire transaction set can be parsed with Split:
// Starting with a string that contains the entire 835 transaction set:
var segments = transactionSet.Split('~');
var segmentElements = segments.Select(s => s.Split('|')).ToArray();
// segmentElements contains an array of element arrays,
// each composite element can be split further into components as shown earlier

What I found is working is the following:
parts = original.Split(record);
for(int i = parts.Length -1; i >= 0; i--)
{
string s = parts[i];
string nString =String.Empty;
if (s.StartsWith("PLB"))
{
string[] elems = s.Split(elem);
if (elems[3].Contains("49" + subelem.ToString()))
{
string regex = string.Format(#"(\{0})49({1})", elem, subelem);
nString = Regex.Replace(s, regex, #"$1CS$2");
}
I'm still having to split my original file into a set of strings and then evaluate each string, but the that seams to be working now.
If anyone knows how to get around that string.Split up at the top, I'd love to see a sample.

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.

I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}

Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.

Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.

Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.

If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Order a set of regex patterns OR Get the biggest regex match - c#

Related

how to extract a number only, not any operators

Get sub-strings from a string that are enclosed using some specified character

How to cut rest of string after substring in c#?

Regex: replace inner string

How can I get a regex match to only be added once to the matches collection?

Categories

Resources