Search file contents for a match with regular expression - c#

I have a regular expression which matches a date format like : 26 August 2011
and I'm trying to read each line in a file and capture the line that contains the date in above format. But it does not seem to be working:
Regex test = new Regex(#"^((31(?!\ (Feb(ruary)?|Apr(il)?|June?|(Sep(?=\b|t)t?|Nov)(ember)?)))|((30|29)(?!\ Feb(ruary)?))|(29(?=\ Feb(ruary)?\ (((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))|(0?[1-9])|1\d|2[0-8])\ (Jan(uary)?|Feb(ruary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sep(?=\b|t)t?|Nov|Dec)(ember)?)\ ((1[6-9]|[2-9]\d)\d{2})$");
StreamReader file = new StreamReader(outputFile);
while ((line2 = file.ReadLine()) != null)
{
lines.Add(line2);
foreach (Match match in test.Matches(line2))
{
v += match.Value;
}
}
Ok, so this is the scenario..
1st - If line contains: "26 August 2011", it returns that date.
2nd - If line contains : " some text etc 26 August 2011", it returns null.
Any idea how this issue can be tackled?

The leading ^ character in your regular expression says, "match starting at the beginning of the line." And the last character is $, meaning that the line has to end with the expression. So if your line contains anything other than a date in the format you specified, the regular expression isn't going to match.
Remove the ^ at the front and the $ at the end.

I'm guessing test is defined as Regex test=new Regex("26 August 2011");
Try this
StreamReader file = new StreamReader(outputFile);
while ((line2 = file.ReadLine()) != null)
{
lines.Add(line2);
if (test.IsMatch(line2))
{
v += line2;
}
}
Albeit you probably want to use a StringBuilder for performance (eg v = new StringBuilder()) and then instead of v += line2 you do v.Append(line2)
--UPDATE
Reading your updated answer with the provided regex, if you just use your existing code and remove the ^ at the begining of the regex and the $ at the end then your code will find all dates within the file regardless of position if that is what you are after.

Related

Regex format returns empty result - C#

I have below text line and I intend to extract the "date" after the ",", i,e,
1 Sep 2015
Allocation/bundle report 10835.0000 Days report step 228, 1 Sep 2015
I wrote the below regex code and it returns empty in the match.
`Regex regexdate = new Regex(#"\Allocation/bundle\s+\report\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\,\+(\S)+\s+(\S)+\s+(\S)"); // to get dates
MatchCollection matchesdate = regexdate.Matches(text);
Can you advice about what's wrong with the Regex format that I mentioned?
The \A is an anchor asserting the start of string. You must have meant A. (\S)+ must be turned into (\S+). Also, \r is a carriage return matching pattern, again remove the backslash to turn \r into r.
Use
#"Allocation/bundle\s+report\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\,\s+(\S+)\s+(\S+)\s+(\S+)"
See the regex demo
Note that the last part of the regex may be made a bit more specific to match 1+ digits, then some letters and then 4 digits: (\S+)\s+(\S+)\s+(\S+) -> (\d+)\s+(\p{L}+)\s+(\d{4})
Can you do it without Regex? Here's an example using a bit of help from LINQ.
var text = "Allocation/bundle report 10835.0000 Days report step 228, 1 Sep 2015";
var sDate = text.Split(',').Last().Trim();
if (string.IsNullOrEmpty(sDate))
{
Console.WriteLine("No date found.");
}
else
{
Console.WriteLine(sDate); // Returns "1 Sep 2015"
}

Regular Expression to trim just the last () from sentence

I am having difficulty to figure out a regular expression.
I have sentences:
"1A11 - Vehicle Engine Control Unit (VECU) (Behind Plate)"
"1A1K5 - Vehicle Rear View (Front View)"
I want to trim my sentence from (----), I have this regular expression to do so "#"\s*([^)]*)" but the problem with this one is that like in my first sentence the (VECU) is the abbreviation so I need to keep it. But this regular expression doesn't work if i have 2 () (). How can I modify my regular expression 2 trim only that last () from the sentence?
if (!reportMode)
{
//Look line by line for Title
stream = GetStream(files);
List<String> fileContent = new List<String>();
using (StreamReader sr = new StreamReader(stream))
{
String line = "";
Boolean isInThere = false;
while (!sr.EndOfStream)
{
line = sr.ReadLine();
if (line.Contains(title))
{
//check for exact match
Int32 index = line.IndexOf(" - ");
String revisedLine = line.Substring(index + 3).Trim();
String str = Regex.Replace(revisedLine, #"\s*\([^\)]*\)", "").Trim();
if (Regex.IsMatch(str, String.Format("^{0}$", title)))
isInThere = true;
}
fileContent.Add(line);
}
You could anchor the regexp at the end of the line. This is usually done adding a '$' sign at the end: "\s*\([^\)]*\)$". If the closing parenthesis is the last character of the string this should do. Otherwise you can add expression to ignore whitespace.
(Fixed regexp syntax, thanks Patrick)
--
MaxP
In case you need to remove a parenthetical expression that is last but can appear not only at the end, you may use
Regex rx = new Regex(#"\s*\([^()]*\)(?=[^()]*$)");
String str = rx.Replace(revisedLine, "").Trim();
REGEX:
\s* - 0 or more whitespace symbols
\([^()]*\) - round bracket followed by any number of characters other than ) or (
(?=[^()]*$) - A lookahead that checks if before the end of string there is no ( nor ) symbols.
Mind that you do not need to escape the round brackets inside the character classes.

Using RegEx to match Month-Day in C#

Let me preface this by saying I am new to Regex and C# so I am still trying to figure it out. I also realize that Regex is a deep subject that takes time to understand. I have done a little research to figure this out but I don't have the time needed to properly study the art of Regex syntax as I need this program finished tomorrow. (no this is not homework, it is for my job)
I am using c# to search through a text file line by line and I am trying to use a Regex expression to check whether any lines contain any dates of the current month in the format MM-DD. The Regex expression is used within a method that is passed each line of the file.
Here is the method I am currently using:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
string month = "12";
Regex regExPattern = new Regex(#"\s" + month + #"-\d(0[1-9]|[1-2][0-9]|3[0-1])\s");
Match match = regExPattern.Match(line);
return match.Success;
}
Essentially I need it to match if it is preceded by a space and followed by a space. Only if it is the current month (in this case 12), an hyphen, and a day of the month ( " 12-01 " should match but not " 12-99 "). It should always be 2 digits on either side of the hyphen.
This Regex (The only thing I can make match) will work, but also picks up items outside the necessary range:
Regex regExPattern = new Regex(#"\s" + month + #"-\d{2}\s");
I have also tried this without sucess:
Regex regExPattern = new Regex(#"\s" + month + #"-\d[01-30]{2}\s");
Can anyone tell me what I need to change to get the results I need?
Thanks in advance.
If you just need to find out if the line contains any valid match, something like this will work:
private bool CheckTransactionDates(string line)
{
// in the actual code this is dynamically set based on other variables
int month = DateTime.Now.Month;
int daysInMonth = DateTime.DaysInMonth(DateTime.Today.Year, DateTime.Today.Month);
Regex pattern = new Regex(string.Format(#"{0:00}-(?<DAY>[0123][0-9])", month));
int day = 0;
foreach (Match match in pattern.Matches(line))
{
if (int.TryParse(match.Groups["DAY"].Value, out day))
{
if (day <= daysInMonth)
{
return true;
}
}
}
return false;
}
Here's how it works:
You determine the month to search for (here, I use the current month), and the number of days in that month.
Next, the regex pattern is built using a string.Format function that puts the left-zero-padded month, followed by dash, followed by any two digit number 00 to 39 (the [0123] for the first digit, the [0-9] for the second digit). This narrows the regex matches, but not conclusively for a date. The (?<DAY>...) that surrounds it creates a regex group, which will make processing it later easier. Note that I didn't check for a whitespace, in case the line begins with a valid date. You could easily add a space to the pattern, or modify the pattern to your specific needs.
Next, we check all possible matches on that line (pattern.Matches) in a loop.
If a match is found, we then try to parse it as an integer (it should always work, based on the pattern we are matching). We use the DAY group of that match that we defined in the pattern.
After parsing that match into an integer day, we check to see if that day is a valid number for the month specified. If it is, we return true from the function, as we found a valid date.
Finally, if we found no matches, or if none of the matches is valid, we return false from the function (only if we hadn't returned true earlier).
One thing to note is that \s matches any white space character, not just a space:
\s match any white space character [\r\n\t\f ]
However, a Regex that literally looks for a space would not, one like this (12-\d{2}). However, I've got to go with the rest of the community a bit on what to do with the matches. You're going to need to go through every match and validate the date with a better approach:
var input = string.Format(
" 11-20 2690 E 28.76 12-02 2468 E* 387.85{0}11-15 3610 E 29.34 12-87 2534 E",
Environment.NewLine);
var pattern = string.Format(#" ({0}-\d{{2}}) ", DateTime.Now.ToString("MM"));
var lines = new List<string>();
foreach (var line in input.Split(new string[] { Environment.NewLine },
StringSplitOptions.RemoveEmptyEntries))
{
var m = Regex.Match(line, pattern);
if (!m.Success)
{
continue;
}
DateTime dt;
if (!DateTime.TryParseExact(m.Value.Trim(),
"MM-dd",
null,
DateTimeStyles.None,
out dt))
{
continue;
}
lines.Add(line);
}
The reason I went through the lines one at a time is because presumably you need to know what line is good and what line is bad. My logic may not exactly match what you need but you can easily modify it.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

C# - Removing a Line that matches a Regex

I have some data.. it looks similar to this:
0423 222222 ADH, TEXTEXT
0424 1234 ADH,MORE TEXT
0425 98765 ADH, TEXT 3609
2000 98765-4 LBL,IUC,PCA,S/N
0010 99999-27 LBL,IUI,1.0x.25
9000 12345678 HERE IS MORE, TEXT
9010 123-123 SOMEMORE,TEXT1231
9100 SD178 YAYFOR, TEXT01
9999 90123 HEY:HOW-TO DOTHIS
And I would like to remove each entire line that begins with a 9xxx. Right now I have tried Replacing the value using Regex. Here is what I have for that:
output = Regex.Replace(output, #"^9[\d]{3}\s+[\d*\-*\w*]+\s+[\d*\w*\-*\,*\:*\;*\.*\d*\w*]+", "");
However, this is really hard to read and it actually does not delete the entire line.
CODE:
Here is the section of the code I am using:
try
{
// Resets the formattedTextRichTextBox so multiple files aren't loaded on top of eachother.
formattedTextRichTextBox.ResetText();
foreach (string line in File.ReadAllLines(openFile.FileName))
{
// Uses regular expressions to find a line that has, digit(s), space(s), digit(s) + letter(s),
// space(s), digit(s), space(s), any character (up to 25 times).
Match theMatch = Regex.Match(line, #"^[\.*\d]+\s+[\d\w]+\s+[\d\-\w*]+\s+.{25}");
if (theMatch.Success)
{
// Stores the matched value in string output.
string output = theMatch.Value;
// Replaces the text with the required layout.
output = Regex.Replace(output, #"^[\.*\d]+\s+", "");
//output = Regex.Replace(output, #"^9[\d]{3}\s+[\d*\-*\w*]+\s+[\d*\w*\-*\,*\:*\;*\.*\d*\w*]+", "");
output = Regex.Replace(output, #"\s+", " ");
// Sets the formattedTextRichTextBox to the string output.
formattedTextRichTextBox.AppendText(output);
formattedTextRichTextBox.AppendText("\n");
}
}
}
OUTCOME:
So what I would like the new data to look like is in this format (removed 9xxx):
0423 222222 ADH, TEXTEXT
0424 1234 ADH,MORE TEXT
0425 98765 ADH, TEXT 3609
2000 98765-4 LBL,IUC,PCA,S/N
0010 99999-27 LBL,IUI,1.0x.25
QUESTIONS:
Is there an easier way to go about this?
If so, can I use regex to go about this or must I use a different way?
Just reformulate the regex that tests your format to match everything that doesn't begin with 9 - that way lines starting with 9 are not added to the rich text box.
Try this(Uses Linq):
//Create a regex to identify lines that start with 9XXX
Regex rgx = new Regex(#"^9\d{3}");
//Below is the linq expression to filter the lines that start with 9XXX
var validLines =
(
//This following line specifies what enumeration to pick the data from
from ln in File.ReadAllLines(openFile.FileName)
//This following specifies what is the filter that needs to be applied to select the data.
where !rgx.IsMatch(ln)
//This following specifies what to select from the filtered data.
select ln;
).ToArray(); //This line makes the IQueryable enumeration to an array of Strings (since variable ln in the above expression is a String)
//Finally join the filtered entries with a \n using String.Join and then append it to the textbox
formattedTextRichTextBox.AppendText = String.Join(validLines, "\n");
Yes, there is a simpler way. Just use Regex.Replace method, and provide Multiline option.
Why don't you just match the first 9xxx part the use a wildcard to match the rest of the line, it would be a lot more readable.
output = Regex.Replace(output, #"^9[\d{3}].*", "")

Categories