how to handle new lines in Regular expressions - c#

I have an application that reads codes from a text file written in C#.
The codes will generally follow the same pattern each time
example:
QUES10100
From what i have written so far this results in the regular expression looking like this:
string expr = "^[A-Z]{4}[0-9]{5}$";
The question then is when the codes are read from a text file ( One per new line ) the codes have the \r new line character appended. This is from placing a breakpoint on to see what was really being passed through.
What am i missing from the expression provided above?
Also if i am adding the codes individually the /r characters are not appended so its fine, in this case i would need an or operand in there somewhere.
Summary
What I have so far: ^[A-Z]{4}[0-9]{5}$
What I need: ^[A-Z]{4}[0-9]{5}$ OR ^[A-Z]{4}[0-9]{5}$ with /r characters accounted for.
Thanks, any clarifications please let me know as my experience with
REGEX is very limited.
Update
string expr = "^[A-Z]{4}[0-9]{5}";
Regex regex = new Regex(expr , RegexOptions.IgnoreCase);
Match match = regex.Match( code );
if (!match.Success) //Pattern must match
{
MessageBox.Show("Code does not match the necessary pattern");
return false;
}
return true;

Why do you want to use regex for that? Use File.ReadLines and use the regex for validation.
foreach(string line in File.ReadLines(#"c:\file path here")) {
if (Regex.Test(expr, line)) {
Console.WriteLine(line);
}
}

If you have no control over how are the strings being read, you could also take a look at the String.Trim(char\[\] values) method, which would allow you to sanitize your string before hand:
Something like the below:
string str = "....".Trim(new char[] {'\r', '\n'});
This is usually recommended (since almost anything is better than regex :)).
Then you would feed it to the regular expression you have built.

Related

any character between keywords with a space

i'm creating a blacklist of keywords which I want to check for in text files, however, i'm having trouble finding any regex documentation which will help me with the following issue.
I have a set of blacklisted keywords:
welcome, goodbye, join us
I want to check some text files for any matches. I'm using the following regex to match exact words and also the pluralized version.
string.Format(#"\b{0}s*\b", keyword)
However, I've run into an issue matching keywords with two words and a any character in between. The above regex matches 'join us', but I need to match 'join#us' or 'join_us' for example as well.
Any help would be greatly appreciated.
I thing, that the "any character in between" may cause you a lot of troubles. For example let's consider this:
We want to find "my elf"... but you probably don't want to match "myself".
Anyway. If this is OK with you replace space character with dot in the keyword using string.Replace.
. in regex will match any character.
If you are new to regexes, check this useful cheat sheet: http://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet
To solve the issue with "myself" and "my elf", use something more careful than . in the regex. For example [^a-zA-Z] which will match anything except letters from a to z and A to Z, or maybe \W, which will match non-word character, which means anything except a-zA-Z0-9_, so it is equivalent to [^a-zA-Z0-9_].
Also be careful about plural forms like city - cities and all the irregular ones.
You could try something like this (I left only the {0} part of the regex):
var relevantChars = new char[]{',', '#'}; // add here anything you like
string.Format(#"{0}", keyword.Replace(" ", "(" + string.Join("|", relevantChars ) + ")"));
If you're set on using pluralization, you will have to use the PluralizationService (see this answer for more details).
And seeing that you're using a string.Format, I assume you're looping your backlist array.
So why not do it all in a neat method?
public static string GetBlacklistRegexString(string[] blacklist)
{
//It seems that this service only support engligh natively, to check later
var ps = PluralizationService.CreateService(CultureInfo.GetCultureInfo("en"));
//Using a StringBuilder for ease of use and performance,
//even though it's not easy on the eye :p
StringBuilder sb = new StringBuilder().Append(#"\b(");
//We're just going to make a unique regex with all the words
//and their plurals in a list, so we're looping here
foreach (var word in blacklist)
{
//Using a dot wasn't careful indeed... Feel free to replace
//"\W" with anything that does it for you. It will match
//any non-alphanumerical character
var regexPlural = ps.Pluralize(word).Replace(" ", #"\W");
var regexWord = word.Replace(" ", #"\W");
sb.Append(regexWord).Append('|').Append(regexPlural).Append('|');
}
sb.Remove(sb.Length - 1, 1); //removing the last '|'
sb.Append(#")\b");
return sb.ToString();
}
The usage is nothing surprising if you're already using regular expressions in .NET:
static void Main(string[] args)
{
string[] blacklist = {"Goodbye","Welcome","join us"};
string input = "Welcome, come join us at dummywebsite.com for fun and games, goodbye!";
//I assume that you want it case insensitive
Regex blacklistRegex = new Regex(GetBlacklistRegexString(blacklist), RegexOptions.IgnoreCase);
foreach (Match match in blacklistRegex.Matches(input))
{
Console.WriteLine(match);
}
Console.ReadLine();
}
We get written on the console the expected output:
Welcome
join us
goodbye
Edit: still have a problem (working on it later), if "man" is in your keywords, it will match the "men" in "women"... Weirdly I don't get this behaviour on regexhero.
Edit 2: duh, of course if I don't group the words with parenthesis, the word boundaries are just applied to the first and last one... Corrected.

Regex in C# How to replace only capture groups and not non-capture groups

I am writing a regular expression in Visual Studios 2013 Express using C#. I am trying to put single quotes around every single string that includes words and !##$%^&*()_- except for:
and
or
not
empty()
notempty()
currentdate()
any string that already has single quotes around it.
Here is my regex and a sample of what it does:
https://regex101.com/r/nI1qP0/1
I want put single quotes only around the capture groups and leave the non-capture groups untouched. I know this can be done with lookarounds, but I don't know how.
You can use this regex:
(?:'[^']*'|(?:\b(?:(?:not)?empty|currentdate)\(\)|and|or|not))|([!##$%^&*_.\w-]‌​+)
Here ignored matches are not captured and words to be quoted can be retrieved using Match.Groups[1]. You can then add quotes around Match.Groups[1] and get the whole input replaced as you want.
RegEx Demo
You need to use a match evaluator, or a callback method. The point is that you can examine the match and captured groups inside this method, and decide what action to take depending on your pattern.
So, add this callback method (may be non-static if the calling method is non-static):
public static string repl(Match m)
{
return !string.IsNullOrEmpty(m.Groups[1].Value) ?
m.Value.Replace(m.Groups[1].Value, string.Format("'{0}'", m.Groups[1].Value)) :
m.Value;
}
Then, use an overload of Regex.Replace with the match evaluator (=callback method):
var s = "'This is not captured' but this is and not or empty() notempty() currentdate() capture";
var rx = new Regex(#"(?:'[^']*'|(?:\b(?:(?:not)?empty|currentdate)\(\)|and|or|not))|([!##$%^&*_.\w-]+)");
Console.WriteLine(rx.Replace(s, repl));
Note you can shorten the code with a lambda expression:
Console.WriteLine(rx.Replace(s, m => !string.IsNullOrEmpty(m.Groups[1].Value) ?
m.Value.Replace(m.Groups[1].Value, string.Format("'{0}'", m.Groups[1].Value)) :
m.Value));
See IDEONE demo
Instead of trying to ignore the strings with words and!##$%^&*()_- in them, I just included them in my search, placed an extra single quote on either end, and then remove all instances of two single quotes like so:
// Find any string of words and !##$%^&*()_- in and out of quotes.
Regex getwords = new Regex(#"(^(?!and\b)(?!or\b)(?!not\b)(?!empty\b)(?!notempty\b)(?!currentdate\b)([\w!##$%^&*())_-]+)|((?!and\b)(?!or\b)(?!not\b)(?!empty\b)(?!notempty\b)(?!currentdate\b)(?<=\W)([\w!##$%^&*()_-]+)|('[\w\s!##$%^&*()_-]+')))", RegexOptions.IgnoreCase);
// Find all cases of two single quotes
Regex getQuotes = new Regex(#"('')");
// Get string from user
Console.WriteLine("Type in a string");
string search = Console.ReadLine();
// Execute Expressions.
search = getwords.Replace(search, "'$1'");
search = getQuotes.Replace(search, "'");

Parsing a file or directory from an semi-random text

I've got a method that is going to perform some SVN Commands (using SharpSVN) on a collection of files and or directories, based on what the user has selected within a textbox on the form.
Quickly storing some highlighted text in a variable and looking at it, sample data might be like this:
Modified -- C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r\nModified --
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\\\myFile.cs
Trying to write a Regex to parse out anything inbetween a Space and the \r character, but I can't figure it out.
I thought the pattern would be something like this:
#"\s\S*\\r"
But using my sample data here it yields this as a result:
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r
Then I'm just going to throw each result (ie proper path/file) into a collection of strings which will be used elsewhere in the application.
Is there a better way to do this using the Path class, hopefully?
One thing I can think of would be to split up the data using substring any time it finds \r\n, then simply drop the "prefix" (Modified --, NotVersioned --, Normal --) from the strings.
That seems really... poor though.
If it helps, I do know the that the top-most directory will always be C:\\folder\\trunk
I would recommend that you split the string on "\r\n" and then match each string. For example:
Regex re = new Regex(#"\s(\S*?)$");
foreach (var line in s.Split(new[]{"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
Match m = re.Match(s);
Console.WriteLine("{0},{1},'{2}'", m.Index, m.Length, m.Groups[1].Value);
}
That works when tested against your sample text.
You can use regex lookahead and lookbehind
String pattern = #"(?<=--\s)\S*(?=\\r|$)";
var result = Regex.Matches(input, pattern);
foreach (Match match in result)
{
Console.WriteLine(match.Value);
}
Parsing invalid values is not for Path class. You should either use regex or split and substring. Both ways are good, you should prefer the one you can easy read, explain and change.
var paths =
Regex.Split(input, #"\\r\\n")
.Select(row => row.Substring(row.LastIndexOf(' ') + 1, row.Length - row.LastIndexOf(' ') - 1));

Convert C# regex Code to Java

I have found this Regex extractor code in C#.
Can someone tell me how this works, and how do I write the equivalent in Java?
// extract songtitle from metadata header.
// Trim was needed, because some stations don't trim the songtitle
fileName =
Regex.Match(metadataHeader,
"(StreamTitle=')(.*)(';StreamUrl)").Groups[2].Value.Trim();
This should be what you want.
// Create the Regex pattern
Pattern p = Pattern.compile("(StreamTitle=')(.*)(';StreamUrl)");
// Create a matcher that matches the pattern against your input
Matcher m = p.matcher(metadataHeader);
// if we found a match
if (m.find()) {
// the filename is the second group. (The `(.*)` part)
filename = m.group(2);
}
It pulls "MyTitle" from a string such as "StreamTitle='MyTitle';StreamUrl".
The () operators define match groups, there are 3 in your regex. The second one contains the string of interest, and is gotten in the Groups[2].Value.
There's a few very good regex designers out there. The one I use is Rad Software's Regular Expression Designer (www.radsoftware.com.au). It is very useful for figuring out stuff like this (and it uses C# RegEx's).

C# regex need characters after \player_n\

I need a regex pattern which will accommodate for the following.
I get a response from a UDP server, it's a very long string and each word is separated by \, for example:
\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7
I need the strings after \player_n\, so in the above example I would need name0, name1 and name3,
I know this is the second regex question of the day but I have the book (Mastering Regular Expressions) on order! Thank you.
UPDATE. elusive's regex pattern will suffice, and I can add the match(0) to a textbox. However, what if I want to add all the matches to the text box ?
textBox1.Text += match.Captures[0].ToString(); //this works fine.
How do I add "all" match.captures to the text box? :s sorry for being so lame, this Regex class is brand new to me .
Try this one:
\\player_\d+\\([^\\]+)
i think that this test sample can help you
string inp = #"\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7";
string rex = #"[\w]*[\\]player_[0-9]+[\\](?<name>[A-Za-z0-9]*)\b";
Regex re = new Regex(rex);
Match mat = re.Match(inp);
for (Match m = re.Match(inp); m.Success; m = m.NextMatch())
{
Console.WriteLine(m.Groups["name"]);
}
you can take the name of the player from the m.Groups["name"]
To get only the player name, you could use:
(?<=\\player_\d+\\)[^\\]+
This (?<=\\player_\d+\\) is something called a positive look-behind. It makes sure that the actual match [^\\]+ is preceded by the expression in the parentheses.
In this case, it's even specific to only a few regex engines (.NET being among them, luckily), in that it contains a variable length expression (due to \d+). Most regex engines only support fixed-length look-behind.
In any case, look-behind is not necessarily the best approach to this problem, match groups are simpler easier to read.

Categories