Match a string against an easy pattern - c#

I am trying to future proof a program I am creating so that the pattern I need to have users put in is not hard coded. There is always a chance that the letter or number patter can change, but when it does I need everyone to remain consistent. Plus I want the managers to be to control what goes in without relying on me. Is it possible to use regex or another string tool to compare input against a list stored in a database. I want it to be easy so the patterns stored in the database would look like X###### or X######-X####### and so on.

Sure, just store the regular expression rules in a string column in a table and then load them into an IEnumerable<Regex> in your app. Then, a match is simply if ANY of those rules match. Beware that conflicting rules could be prone to greedy race (first one to be checked wins) so you'd have to be careful there. Also be aware that there are many optimizations that you could perform beyond my example, which is designed to be simple.
List<string> regexStrings = db.GetRegexStrings();
var result = new List<Regex>(regexStrings.Count);
foreach (var regexString in regexStrings)
{
result.Add(new Regex(regexString);
}
...
// The check
bool matched = result.Any(i => i.IsMatch(testInput));

You could store your patterns as-is in your database, and then translate them to regexes.
I don't know specifically what characters you'd need in your format, but let's suppose you just want to substitute a number to # and leave the rest as-is, here's some code for that:
public static Regex ConvertToRegex(string pattern)
{
var sb = new StringBuilder();
sb.Append("^");
foreach (var c in pattern)
{
switch (c)
{
case '#':
sb.Append(#"\d");
break;
default:
sb.Append(Regex.Escape(c.ToString()));
break;
}
}
sb.Append("$");
return new Regex(sb.ToString());
}
You can also use options like RegexOptions.IgnoreCase if that's what you need.
NB: For some reason, Regex.Escape escapes the # character, even though it's not special... So I just went for the character-by-character approach.

private bool TestMethod()
{
const string textPattern = "X###";
string text = textBox1.Text;
bool match = true;
if (text.Length == textPattern.Length)
{
char[] chrStr = text.ToCharArray();
char[] chrPattern = textPattern.ToCharArray();
int length = text.Length;
for (int i = 0; i < length; i++)
{
if (chrPattern[i] != '#')
{
if (chrPattern[i] != chrStr[i])
{
return false;
}
}
}
}
else
{
return false;
}
return match;
}
This is doing everything I need it to do now. Thanks for all the tips though. I will have to look into the regex more in the future.

Using MaskedTextProvider, you could do do something like this:
using System.Globalization;
using System.ComponentModel;
string pattern = "X&&&&&&-X&&&&&&&";
string text = "Xabcdef-Xasdfghi";
var culture = CultureInfo.GetCultureInfo("sv-SE");
var matcher = new MaskedTextProvider(pattern, culture);
int position;
MaskedTextResultHint hint;
if (!matcher.Set(text, out position, out hint))
{
Console.WriteLine("Error at {0}: {1}", position, hint);
}
else if (!matcher.MaskCompleted)
{
Console.WriteLine("Not enough characters");
}
else if (matcher.ToString() != text)
{
Console.WriteLine("Missing literals");
}
else
{
Console.WriteLine("OK");
}
For a description of the format, see: http://msdn.microsoft.com/en-us/library/system.windows.forms.maskedtextbox.mask

Related

C# Regex to replace specific hashtags with certain block of text

I am a new C# developer and I am struggling right now to write a method to replace a few specific hashtags in a sample of tweets with certain block of texts. For example if the tweet has a hashtag like #StPaulSchool, I want to replace this hashtag with this certain text "St. Paul School" without the '#' tag.
I have a very small list of the certain words which I need to replace. If there is no match, then I would like remove the hashtag (replace it with empty string)
I am using the following method to parse the tweet and convert it into a formatted tweet but I don't know how to enhance it in order to handle the specific hashtags. Could you please tell me how to do that?
Here's the code:
public string ParseTweet(string rawTweet)
{
Regex link = new Regex(#"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?");
Regex screenName = new Regex(#"#\w+");
Regex hashTag = new Regex(#"#\w+");
var words_to_replace = new string[] { "StPaulSchool", "AzharSchool", "WarwiSchool", "ManMet_School", "BrumSchool"};
var inputWords = new string[] { "St. Paul School", "Azhar School", "Warwick School", "Man Metapolian School", "Brummie School"};
string formattedTweet = link.Replace(rawTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("URL");
return string.Empty;
});
formattedTweet = screenName.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value.Trim('#');
//return string.Format("USERNAME");
return string.Empty;
});
formattedTweet = hashTag.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("HASHTAG");
return string.Empty;
});
return formattedTweet;
}
The following code works for the hashtags:
static void Main(string[] args)
{
string longTweet = #"Long sentence #With #Some schools like #AzharSchool and spread out
over two #StPaulSchool lines ";
string result = Regex.Replace(longTweet, #"\#\w+", match => ReplaceHashTag(match.Value), RegexOptions.Multiline);
Console.WriteLine(result);
}
private static string ReplaceHashTag(string input)
{
switch (input)
{
case "#StPaulSchool": return "St. Paul School";
case "#AzharSchool": return "Azhar School";
default:
return input; // hashtag not recognized
}
}
If the list of hashtags to convert becomes very long it would be more succint to use a Dictionary, eg:
private static Dictionary<string, string> _hashtags
= new Dictionary<string, string>
{
{ "#StPaulSchool", "St. Paul School" },
{ "#AzharSchool", "Azhar School" },
};
and rewrite the body of the ReplaceHashTag method with this:
if (!_hashtags.ContainsKey(hashtag))
{
return hashtag;
}
return _hashtags[hashtag];
I believe that using regular expressions makes this code unreadable and difficult to maintain. Moreover, you are using regular expression to find a very simple pattern - to find strings that starts with the hashtag (#) character.
I suggest a different approach: Break the sentence into words, transform each word according to your business rules, then join the words back together. Although this sounds like a lot of work, and it may be the case in another language, the C# String class makes this quite easy to implement.
Here is a basic example of a console application that does the requested functionality, the business rules are hard-coded, but this should be enough so you could continue:
static void Main(string[] args)
{
string text = "Example #First #Second #NoMatch not a word ! \nSecond row #Second";
string[] wordsInText = text.Split(' ');
IEnumerable<string> transformedWords = wordsInText.Select(selector: word => ReplaceHashTag(word: word));
string transformedText = string.Join(separator: " ", values: transformedWords);
Console.WriteLine(value: transformedText);
}
private static string ReplaceHashTag(string word)
{
if (!word.StartsWith(value: "#"))
{
return word;
}
string wordWithoutHashTag = word.Substring(startIndex: 1);
if (wordWithoutHashTag == "First")
{
return "FirstTransformed";
}
if (wordWithoutHashTag == "Second")
{
return "SecondTransformed";
}
return string.Empty;
}
Note that this approach gives you much more flexibility chaining your logic, and by making small modifications you can make this code a lot more testable and incremental then the regular expression approach

String to Sequence of Tokens

I'm parsing command sequence strings and need to convert each string into a string[] that will contain command tokens in the order that they're read.
The reason being is that these sequences are stored in a database to instruct a protocol client to carry out a certain prescribed sequence for individual distant applications. There are special tokens in these strings that I need to add to the string[] by themselves because they don't represent data being transmitted; instead they indicate blocking pauses.
The sequences do not contain delimiters. There can be any amount of special tokens found anywhere in a command sequence which is why I can't simply parse the strings with regex. Also, all of these special commands within the sequence are wrapped with ${}
Here's an example of the data that I need to parse into tokens (P1 indicates blocking pause for one second):
"some data to transmit${P1}more data here"
Resulting array should look like this:
{ "some data to transmit", "${P1}", "more data here" }
I would think LINQ could help with this, but I'm not so sure. The only solution I can come up with would be to loop through each character until a $ is found and then detect if a special pause command is available and then parse the sequence from there using indexes.
One option is to use Regex.Split(str, #"(\${.*?})") and ignore the empty strings that you get when you have two special tokens next to each other.
Perhaps Regex.Split(str, #"(\${.*?})").Where(s => s != "") is what you want.
Alright, so as was mentioned in the comments, I suggest you read about lexers. They have the power to do everything and more of what you described.
Since your requirements are so simple, I'll say that it is not too difficult to write the lexer by hand. Here's some pseudocode that could do it.
IEnumerable<string> tokenize(string str) {
var result = new List<string>();
int pos = -1;
int state = 0;
int temp = -1;
while( ++pos < str.Length ) {
switch(state) {
case 0:
if( str[pos] == "$" ) { state = 1; temp = pos; }
break;
case 1:
if( str[pos] == "{" ) { state = 2; } else { state = 0; }
break;
case 2:
if( str[pos] == "}" } {
state = 0;
result.Add( str.Substring(0, temp) );
result.Add( str.Substring(temp, pos) );
str = str.Substring(pos);
pos = -1;
}
break;
}
}
if( str != "" ) {
result.Add(str);
}
return result;
}
Or something like that. I usually get the parameters of Substring wrong on the first try, but that's the general idea.
You can get a much more powerful (and easier to read) lexer by using something like ANTLR.
Using a little bit of Gabe's suggestion, I've come up with a solution that does exactly what I was looking to do:
string tokenPattern = #"(\${\w{1,4}})";
string cmdSequence = "${P}test${P}${P}test${P}${Cr}";
string[] tokenized = (from token in Regex.Split(cmdSequence, tokenPattern)
where token != string.Empty
select token).ToArray();
With the command sequence in the above example, the array contains this:
{ "${P}", "test", "${P}", "${P}", "test", "${P}", "${Cr}"}

Splitting a string into words in a culture neutral way

I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function
P.S: I can't use a full blown full text search engine like Lucene.Net for several reasons (Silverlight, Overkill for project scope etc).
public string[] SplitWords(string Text)
{
bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
var result = new List<string>();
var sbWord = new StringBuilder();
for (int i = 0; i < Text.Length; i++)
{
Char c = Text[i];
// non separator char?
if(!Char.IsSeparator(c) && !Char.IsControl(c))
{
if (!inWord)
{
sbWord = new StringBuilder();
inWord = true;
}
if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
sbWord.Append(c);
}
// it is a separator or control char
else
{
if (inWord)
{
string word = sbWord.ToString();
if (word.Length > 0)
result.Add(word);
sbWord.Clear();
inWord = false;
}
}
}
return result.ToArray();
}
Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...

Is there a ReadWord() method in the .NET Framework?

I'd hate to reinvent something that was already written, so I'm wondering if there is a ReadWord() function somewhere in the .NET Framework that extracts words based some text delimited by white space and line breaks.
If not, do you have a implementation that you'd like to share?
string data = "Four score and seven years ago";
List<string> words = new List<string>();
WordReader reader = new WordReader(data);
while (true)
{
string word =reader.ReadWord();
if (string.IsNullOrEmpty(word)) return;
//additional parsing logic goes here
words.Add(word);
}
Not that I'm aware of directly. If you don't mind getting them all in one go, you could use a regular expression:
Regex wordSplitter = new Regex(#"\W+");
string[] words = wordSplitter.Split(data);
If you have leading/trailing whitespace you'll get an empty string at the beginning or end, but you could always call Trim first.
A different option is to write a method which reads a word based on a TextReader. It could even be an extension method if you're using .NET 3.5. Sample implementation:
using System;
using System.IO;
using System.Text;
public static class Extensions
{
public static string ReadWord(this TextReader reader)
{
StringBuilder builder = new StringBuilder();
int c;
// Ignore any trailing whitespace from previous reads
while ((c = reader.Read()) != -1)
{
if (!char.IsWhiteSpace((char) c))
{
break;
}
}
// Finished?
if (c == -1)
{
return null;
}
builder.Append((char) c);
while ((c = reader.Read()) != -1)
{
if (char.IsWhiteSpace((char) c))
{
break;
}
builder.Append((char) c);
}
return builder.ToString();
}
}
public class Test
{
static void Main()
{
// Give it a few challenges :)
string data = #"Four score and
seven years ago ";
using (TextReader reader = new StringReader(data))
{
string word;
while ((word = reader.ReadWord()) != null)
{
Console.WriteLine("'{0}'", word);
}
}
}
}
Output:
'Four'
'score'
'and'
'seven'
'years'
'ago'
Not as such, however you could use String.Split to split the string into an array of string based on a delimiting character or string. You can also specify multiple strings / characters for the split.
If you'd prefer to do it without loading everything into memory then you could write your own stream class that does it as it reads from a stream but the above is a quick fix for small amounts of data word splitting.

How can I split this string into an array?

My string is as follows:
smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
I need back:
smtp:jblack#test.com
SMTP:jb#test.com
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
The problem is the semi-colons seperate the addresses and also part of the X400 address. Can anyone suggest how best to split this?
PS I should mentioned the order differs so it could be:
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com
There can be more than 3 address, 4, 5.. 10 etc including an X500 address, however they do all start with either smtp: SMTP: X400 or X500.
EDIT: With the updated information, this answer certainly won't do the trick - but it's still potentially useful, so I'll leave it here.
Will you always have three parts, and you just want to split on the first two semi-colons?
If so, just use the overload of Split which lets you specify the number of substrings to return:
string[] bits = text.Split(new char[]{';'}, 3);
May I suggest building a regular expression
(smtp|SMTP|X400|X500):((?!smtp:|SMTP:|X400:|X500:).)*;?
or protocol-less
.*?:((?![^:;]*:).)*;?
in other words find anything that starts with one of your protocols. Match the colon. Then continue matching characters as long as you're not matching one of your protocols. Finish with a semicolon (optionally).
You can then parse through the list of matches splitting on ':' and you'll have your protocols. Additionally if you want to add protocols, just add them to the list.
Likely however you're going to want to specify the whole thing as case-insensitive and only list the protocols in their uppercase or lowercase versions.
The protocol-less version doesn't care what the names of the protocols are. It just finds them all the same, by matching everything up to, but excluding a string followed by a colon or a semi-colon.
Split by the following regex pattern
string[] items = System.Text.RegularExpressions.Split(text, ";(?=\w+:)");
EDIT: better one can accept more special chars in the protocol name.
string[] items = System.Text.RegularExpressions.Split(text, ";(?=[^;:]+:)");
http://msdn.microsoft.com/en-us/library/c1bs0eda.aspx
check there, you can specify the number of splits you want. so in your case you would do
string.split(new char[]{';'}, 3);
Not the fastest if you are doing this a lot but it will work for all cases I believe.
string input1 = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string input2 = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com";
Regex splitEmailRegex = new Regex(#"(?<key>\w+?):(?<value>.*?)(\w+:|$)");
List<string> sets = new List<string>();
while (input2.Length > 0)
{
Match m1 = splitEmailRegex.Matches(input2)[0];
string s1 = m1.Groups["key"].Value + ":" + m1.Groups["value"].Value;
sets.Add(s1);
input2 = input2.Substring(s1.Length);
}
foreach (var set in sets)
{
Console.WriteLine(set);
}
Console.ReadLine();
Of course many will claim Regex: Now you have two problems. There may even be a better regex answer than this.
You could always split on the colon and have a little logic to grab the key and value.
string[] bits = text.Split(':');
List<string> values = new List<string>();
for (int i = 1; i < bits.Length; i++)
{
string value = bits[i].Contains(';') ? bits[i].Substring(0, bits[i].LastIndexOf(';') + 1) : bits[i];
string key = bits[i - 1].Contains(';') ? bits[i - 1].Substring(bits[i - 1].LastIndexOf(';') + 1) : bits[i - 1];
values.Add(String.Concat(key, ":", value));
}
Tested it with both of your samples and it works fine.
This caught my curiosity .... So this code actually does the job, but again, wants tidying :)
My final attempt - stop changing what you need ;=)
static void Main(string[] args)
{
string fneh = "X400:C=US400;A= ;P=Test;O=Exchange;S=Jack;G=Black;x400:C=US400l;A= l;P=Testl;O=Exchangel;S=Jackl;G=Blackl;smtp:jblack#test.com;X500:C=US500;A= ;P=Test;O=Exchange;S=Jack;G=Black;SMTP:jb#test.com;";
string[] parts = fneh.Split(new char[] { ';' });
List<string> addresses = new List<string>();
StringBuilder address = new StringBuilder();
foreach (string part in parts)
{
if (part.Contains(":"))
{
if (address.Length > 0)
{
addresses.Add(semiColonCorrection(address.ToString()));
}
address = new StringBuilder();
address.Append(part);
}
else
{
address.AppendFormat(";{0}", part);
}
}
addresses.Add(semiColonCorrection(address.ToString()));
foreach (string emailAddress in addresses)
{
Console.WriteLine(emailAddress);
}
Console.ReadKey();
}
private static string semiColonCorrection(string address)
{
if ((address.StartsWith("x", StringComparison.InvariantCultureIgnoreCase)) && (!address.EndsWith(";")))
{
return string.Format("{0};", address);
}
else
{
return address;
}
}
Try these regexes. You can extract what you're looking for using named groups.
X400:(?<X400>.*?)(?:smtp|SMTP|$)
smtp:(?<smtp>.*?)(?:;+|$)
SMTP:(?<SMTP>.*?)(?:;+|$)
Make sure when constructing them you specify case insensitive. They seem to work with the samples you gave
Lots of attempts. Here is mine ;)
string src = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
Regex r = new Regex(#"
(?:^|;)smtp:(?<smtp>([^;]*(?=;|$)))|
(?:^|;)x400:(?<X400>.*?)(?=;x400|;x500|;smtp|$)|
(?:^|;)x500:(?<X500>.*?)(?=;x400|;x500|;smtp|$)",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
foreach (Match m in r.Matches(src))
{
if (m.Groups["smtp"].Captures.Count != 0)
Console.WriteLine("smtp: {0}", m.Groups["smtp"]);
else if (m.Groups["X400"].Captures.Count != 0)
Console.WriteLine("X400: {0}", m.Groups["X400"]);
else if (m.Groups["X500"].Captures.Count != 0)
Console.WriteLine("X500: {0}", m.Groups["X500"]);
}
This finds all smtp, x400 or x500 addresses in the string in any order of appearance. It also identifies the type of address ready for further processing. The appearance of the text smtp, x400 or x500 in the addresses themselves will not upset the pattern.
This works!
string input =
"smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string[] parts = input.Split(';');
List<string> output = new List<string>();
foreach(string part in parts)
{
if (part.Contains(":"))
{
output.Add(part + ";");
}
else if (part.Length > 0)
{
output[output.Count - 1] += part + ";";
}
}
foreach(string s in output)
{
Console.WriteLine(s);
}
Do the semicolon (;) split and then loop over the result, re-combining each element where there is no colon (:) with the previous element.
string input = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G="
+"Black;;smtp:jblack#test.com;SMTP:jb#test.com";
string[] rawSplit = input.Split(';');
List<string> result = new List<string>();
//now the fun begins
string buffer = string.Empty;
foreach (string s in rawSplit)
{
if (buffer == string.Empty)
{
buffer = s;
}
else if (s.Contains(':'))
{
result.Add(buffer);
buffer = s;
}
else
{
buffer += ";" + s;
}
}
result.Add(buffer);
foreach (string s in result)
Console.WriteLine(s);
here is another possible solution.
string[] bits = text.Replace(";smtp", "|smtp").Replace(";SMTP", "|SMTP").Replace(";X400", "|X400").Split(new char[] { '|' });
bits[0],
bits[1], and
bits[2]
will then contains the three parts in the order from your original string.

Categories