Using RegEx to match Month-Day in C# - c#

Let me preface this by saying I am new to Regex and C# so I am still trying to figure it out. I also realize that Regex is a deep subject that takes time to understand. I have done a little research to figure this out but I don't have the time needed to properly study the art of Regex syntax as I need this program finished tomorrow. (no this is not homework, it is for my job)
I am using c# to search through a text file line by line and I am trying to use a Regex expression to check whether any lines contain any dates of the current month in the format MM-DD. The Regex expression is used within a method that is passed each line of the file.
Here is the method I am currently using:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
string month = "12";
Regex regExPattern = new Regex(#"\s" + month + #"-\d(0[1-9]|[1-2][0-9]|3[0-1])\s");
Match match = regExPattern.Match(line);
return match.Success;
}
Essentially I need it to match if it is preceded by a space and followed by a space. Only if it is the current month (in this case 12), an hyphen, and a day of the month ( " 12-01 " should match but not " 12-99 "). It should always be 2 digits on either side of the hyphen.
This Regex (The only thing I can make match) will work, but also picks up items outside the necessary range:
Regex regExPattern = new Regex(#"\s" + month + #"-\d{2}\s");
I have also tried this without sucess:
Regex regExPattern = new Regex(#"\s" + month + #"-\d[01-30]{2}\s");
Can anyone tell me what I need to change to get the results I need?
Thanks in advance.

If you just need to find out if the line contains any valid match, something like this will work:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
int month = DateTime.Now.Month;
int daysInMonth = DateTime.DaysInMonth(DateTime.Today.Year, DateTime.Today.Month);
Regex pattern = new Regex(string.Format(#"{0:00}-(?<DAY>[0123][0-9])", month));
int day = 0;
foreach (Match match in pattern.Matches(line))
{
if (int.TryParse(match.Groups["DAY"].Value, out day))
{
if (day <= daysInMonth)
{
return true;
}
}
}
return false;
}
Here's how it works:
You determine the month to search for (here, I use the current month), and the number of days in that month.
Next, the regex pattern is built using a string.Format function that puts the left-zero-padded month, followed by dash, followed by any two digit number 00 to 39 (the [0123] for the first digit, the [0-9] for the second digit). This narrows the regex matches, but not conclusively for a date. The (?<DAY>...) that surrounds it creates a regex group, which will make processing it later easier. Note that I didn't check for a whitespace, in case the line begins with a valid date. You could easily add a space to the pattern, or modify the pattern to your specific needs.
Next, we check all possible matches on that line (pattern.Matches) in a loop.
If a match is found, we then try to parse it as an integer (it should always work, based on the pattern we are matching). We use the DAY group of that match that we defined in the pattern.
After parsing that match into an integer day, we check to see if that day is a valid number for the month specified. If it is, we return true from the function, as we found a valid date.
Finally, if we found no matches, or if none of the matches is valid, we return false from the function (only if we hadn't returned true earlier).

One thing to note is that \s matches any white space character, not just a space:
\s match any white space character [\r\n\t\f ]
However, a Regex that literally looks for a space would not, one like this (12-\d{2}). However, I've got to go with the rest of the community a bit on what to do with the matches. You're going to need to go through every match and validate the date with a better approach:
var input = string.Format(
" 11-20 2690 E 28.76 12-02 2468 E* 387.85{0}11-15 3610 E 29.34 12-87 2534 E",
Environment.NewLine);
var pattern = string.Format(#" ({0}-\d{{2}}) ", DateTime.Now.ToString("MM"));
var lines = new List<string>();
foreach (var line in input.Split(new string[] { Environment.NewLine },
StringSplitOptions.RemoveEmptyEntries))
{
var m = Regex.Match(line, pattern);
if (!m.Success)
{
continue;
}
DateTime dt;
if (!DateTime.TryParseExact(m.Value.Trim(),
"MM-dd",
null,
DateTimeStyles.None,
out dt))
{
continue;
}
lines.Add(line);
}
The reason I went through the lines one at a time is because presumably you need to know what line is good and what line is bad. My logic may not exactly match what you need but you can easily modify it.

Related

Create a Regex pattern to validate this example expression (.1);

I am looking for some help in validating that this string is valid. I need a regex pattern that will catch any letters within the set of parenthesis. I also need to make sure there is a semi-colon at the end of the parentheses. Any ideas? My regex is absolutely terrible......
This is what I want to match:
Total Hours Worked (.5);
Total Hours Worked (.A);
Total Hours Worked (A);
First result should be false while the last 2 should be true.
This is what I have tried:
Match validateLettersAndSemiColon = Regex.Match(StringToMatch, "[a-z]);");
This is just an example using as input the following 3 strings:
Total Hours Worked (.5);
Total Hours Worked (.A);
Total Hours Worked (A);
I am not considering any nested inner parenthesis only that the possible combinations inside the parenthesis are letters and dot.
Here is a simple example:
string[] data = new string[] { "Total Hours Worked (.5);", "Total Hours Worked (.A);", "Total Hours Worked (A);" };
foreach (string input in data)
{
Console.WriteLine("Result for:" + input);
Match match = Regex.Match(input, #"\([a-z.]+\);$", RegexOptions.IgnoreCase);
if (match.Success)
{
Console.WriteLine("YES");
}
else
{
Console.WriteLine("NO");
}
}
#"\([a-z.]+\);$" the \ before the parenthesis escapes it to be captured as a normal parenthesis, the [a-z.]+ means we want to match any amount of letters and dot, can also limit it but should give you an idea. The $ at the end means we want it to end with );
If you want to limit it to a single dot right after the first parenthesis you may use the below regex instead, it will turn the dot as a single optional character at the begin right after the (
#"\(\.?[a-z]+\);$"
The result of the above would be:
Total Hours Worked (.5);
NO
Total Hours Worked (.A);
YES
Total Hours Worked (A);
YES
Your regex is /\([^)]+\);/ or /\(.+?\)/ if you don't have nested parenthesis. It works even if you have two or more of these parenthesis group in the same line.
If you have nested parenthesis use /\(.+\);/, but this will not work if you have two or more parenthesis group in the same line.
In the end, if you have a string like:
(aba(cc);a);eeee(dd(e););
can be pretty hard for a single regex.
Edit 1
If your parenthesis group you want to validate takes the whole string, you can use a ^ to signal the beginning of the string and a $ for the end. Thus the regex becomes
/^\([^)]+\);$/
Try following regex:
\([^0-9]+\)\s*;
This will match any characters within parenthesis except digits.
I would recommend to put \s* between ) and ; to allow space as in most of the programming language.
Try this;
string[] inputstrings = new string[] { "Total Hours Worked (.5);", "Total Hours Worked (.A);", "Total Hours Worked (A);" };//Collection of inputs.
Regex rgx = new Regex(#"\(\.?(?<StringValue>[a-zA-Z]*)\)\;{1}");//Regular expression to find all matches.
foreach (string input in inputstrings)//Iterate through each string in collection.
{
Match match = rgx.Match(input);
if (match.Success)//If a match is found.
{
string value = match.Groups[1].Value;//Capture first named group.
Console.WriteLine(value);//Display captured substring.
}
else//If nothing is found.
{
Console.WriteLine("A match was not found.");
}
}
Here is Ideone sample.

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

Simple regex failing test to determine a 3 digit number

I have had a difficult time wrapping my head around regular expressions. In the following code, I used a Regex to determine if the data passed was a 1 to 3 digit number. The expression worked if the data started with a number (ex. "200"), but also passed if the data had a letter not in the first digit (ex. "3A5"). I managed to handle the error with the INT32.TryParse() method, but it seems there should be an easier way.
if (LSK == MainWindow.LSK6R)
{
int ci;
int length = SP_Command.Length;
if (length > 3) return MainWindow.ENTRY_OUT_OF_RANGE; //Cannot be greater than 999
String pattern = #"[0-9]{1,3}"; //RegEx pattern for 1 to 3 digit number
if (Regex.IsMatch(SP_Command, pattern)) //Does not check for ^A-Z. See below.
{
bool test = Int32.TryParse(SP_Command, out ci); //Trying to parse A-Z. Only if
if (test) //it no letter will it succeed
{
FlightPlan.CostIndex = ci; //Update the flightplan CI
CI.Text = ci.ToString(); //Update the Init page
}
else return MainWindow.FORMAT_ERROR; //It contained a letter
}
else return MainWindow.FORMAT_ERROR; //It didn't fit the RegEx
}
Regex.IsMatch searches the input string for the pattern (and thus returns true for 3A5 because it finds 3).
You should also include start (^) and end ($) of string:
String pattern = #"^[0-9]{1,3}$";
Adding line begin/end should help.
^[0-9]{1,3}$

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Regex: replace inner string

I'm working with X12 EDI Files (Specifically 835s for those of you in Health Care), and I have a particular vendor who's using a non-HIPAA compliant version (3090, I think). The problem is that in a particular segment (PLB- again, for those who care) they're sending a code which is no longer supported by the HIPAA Standard. I need to locate the specific code, and update it with a corrected code.
I think a Regex would be best for this, but I'm still very new to Regex, and I'm not sure where to begin. My current methodology is to turn the file into an array of strings, find the array that starts with "PLB", break that into an array of strings, find the code, and change it. As you can guess, that's very verbose code for something which should be (I'd think) fairly simple.
Here's a sample of what I'm looking for:
~PLB|1902841224|20100228|49>KC15X078001104|.08~
And here's what I want to change it to:
~PLB|1902841224|20100228|CS>KC15X078001104|.08~
Any suggestions?
UPDATE: After review, I found I hadn't quite defined my question well enough. The record above is an example, but it is not necessarilly a specific formatting match- there are three things which could change between this record and some other (in another file) I'd have to fix. They are:
The Pipe (|) could potentially be any non-alpha numeric character. The file itself will define which character (normally a Pipe or Asterisk).
The > could also be any other non-alpha numeric character (most often : or >)
The set of numbers immediately following the PLB is an identifier, and could change in format and length. I've only ever seen numeric Ids there, but technically it could be alpha numeric, and it won't necessarilly be 10 characters.
My Plan is to use String.Format() with my Regex match string so that | and > can be replaced with the correct characters.
And for the record. Yes, I hate ANSI X12.
Assuming that the "offending" code is always 49, you can use the following:
resultString = Regex.Replace(subjectString, #"(?<=~PLB|\d{10}|\d{8}|)49(?=>\w+|)", "CS");
This looks for 49 if it's the first element after a | delimiter, preceded by a group of 8 digits, another |, a group of 10 digits, yet another |, and ~PLB. It also looks if it is followed by >, then any number of alphanumeric characters, and one more |.
With the new requirements (and the lucky coincidence that .NET is one of the few regex flavors that allow variable repetition inside lookbehind), you can change that to:
resultString = Regex.Replace(subjectString, #"(?<=~PLB\1\w+\1\d{8}(\W))49(?=\W\w+\1)", "CS");
Now any non-alphanumeric character is allowed as separator instead of | or > (but in the case of | it has to be always the same one), and the restrictions on the number of characters for the first field have been loosened.
Another, similar approach that works on any valid X12 file to replace a single data value with another on a matching segment:
public void ReplaceData(string filePath, string segmentName,
int elementPosition, int componentPosition,
string oldData, string newData)
{
string text = File.ReadAllText(filePath);
Match match = Regex.Match(text,
#"^ISA(?<e>.).{100}(?<c>.)(?<s>.)(\w+.*?\k<s>)*IEA\k<e>\d*\k<e>\d*\k<s>$");
if (!match.Success)
throw new InvalidOperationException("Not an X12 file");
char elementSeparator = match.Groups["e"].Value[0];
char componentSeparator = match.Groups["c"].Value[0];
char segmentTerminator = match.Groups["s"].Value[0];
var segments = text
.Split(segmentTerminator)
.Select(s => s.Split(elementSeparator)
.Select(e => e.Split(componentSeparator)).ToArray())
.ToArray();
foreach (var segment in segments.Where(s => s[0][0] == segmentName &&
s.Count() > elementPosition &&
s[elementPosition].Count() > componentPosition &&
s[elementPosition][componentPosition] == oldData))
{
segment[elementPosition][componentPosition] = newData;
}
File.WriteAllText(filePath,
string.Join(segmentTerminator.ToString(), segments
.Select(e => string.Join(elementSeparator.ToString(),
e.Select(c => string.Join(componentSeparator.ToString(), c))
.ToArray()))
.ToArray()));
}
The regular expression used validates a proper X12 interchange envelope and assures that all segments within the file contain at least a one character name element. It also parses out the element and component separators as well as the segment terminator.
Assuming that your code is always a two digit number that comes after a pipe character | and before the greater than sign > you can do it like this:
var result = Regex.Replace(yourString, #"(\|)(\d{2})(>)", #"$1CS$3");
You can break it down with regex yes.
If i understand your example correctly the 2 characters between the | and the > need to be letters and not digits.
~PLB\|\d{10}\|\d{8}\|(\d{2})>\w{14}\|\.\d{2}~
This pattern will match the old one and capture the characters between the | and the >. Which you can then use to modify (lookup in a db or something) and do a replace with the following pattern:
(?<=|)\d{2}(?=>)
This will look for the ~PLB|#|#| at the start and replace the 2 numbers before the > with CS.
Regex.Replace(testString, #"(?<=~PLB|[0-9]{10}|[0-9]{8})(\|)([0-9]{2})(>)", #"$1CS$3")
The X12 protocol standard allows the specification of element and component separators in the header, so anything that hard-codes the "|" and ">" characters could eventually break. Since the standard mandates that the characters used as separators (and segment terminators, e.g., "~") cannot appear within the data (there is no escape sequence to allow them to be embedded), parsing the syntax is very simple. Maybe you're already doing something similar to this, but for readability...
// The original segment string (without segment terminator):
string segment = "PLB|1902841224|20100228|49>KC15X078001104|.08";
// Parse the segment into elements, then the fourth element
// into components (bounds checking is omitted for brevity):
var elements = segment.Split('|');
var components = elements[3].Split('>');
// If the first component is the bad value, replace it with
// the correct value (again, not checking bounds):
if (components[0] == "49")
components[0] = "CS";
// Reassemble the segment by joining the components into
// the fourth element, then the elements back into the
// segment string:
elements[3] = string.Join(">", components);
segment = string.Join("|", elements);
Obviously more verbose than a single regular expression but parsing X12 files is as easy as splitting strings on a single character. Except for the fixed length header (which defines the delimiters), an entire transaction set can be parsed with Split:
// Starting with a string that contains the entire 835 transaction set:
var segments = transactionSet.Split('~');
var segmentElements = segments.Select(s => s.Split('|')).ToArray();
// segmentElements contains an array of element arrays,
// each composite element can be split further into components as shown earlier
What I found is working is the following:
parts = original.Split(record);
for(int i = parts.Length -1; i >= 0; i--)
{
string s = parts[i];
string nString =String.Empty;
if (s.StartsWith("PLB"))
{
string[] elems = s.Split(elem);
if (elems[3].Contains("49" + subelem.ToString()))
{
string regex = string.Format(#"(\{0})49({1})", elem, subelem);
nString = Regex.Replace(s, regex, #"$1CS$2");
}
I'm still having to split my original file into a set of strings and then evaluate each string, but the that seams to be working now.
If anyone knows how to get around that string.Split up at the top, I'd love to see a sample.

Categories