Add wildcard to RegEx for phrases and text matching - c#

I have a text file which consists of:
stemmed words (e.g. manipulat - stemmed from "manipulating"), and
stemmed phrases which are usually two words or more (e.g.
"acknowledg him regard the invest" - stemmed from "acknowledging him
regarding the investment").
Each word/phrase is presented in a new line. My C# code reads each line in this text file, then for each line, search all rows in the DataTable to match them. i.e. if a word/phrase appears in any rows of DataTable, my system will flag the row..
For single word, it's easily done/matched using the algorithm I have. I can match "manipulat" to words like "manipulate", "manipulating", "manipulated" and "manipulation" if they appear in the DataTable rows.
But for phrases, my algorithm can only match exactly what it is. Here I mean if my phrase is "acknowledg him regard the invest", it will only search for the exact phrase, and it won't match/flag if "acknowledging him regarding the investment" exists in DataTable rows.
I have very little knowledge in both Regex and C#. I tried to modify the below code to use wildcards but no luck so far. Would appreciate if anyone can help in this. Thank you in advanced.
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString() + ".txt");
var query = LoadComments().AsEnumerable().Where(r =>
words.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));

When comparing the lines with the stem-words from your database using
RegEx you could extend your pattern in your code.
This will match 1 or more occurrences of any word character
\w+
This will match 0 or more occurrences of any word character
\w*
as Abbodanza already mentioned this will match any character between a and z 0 or more occurrences.
[a-z]*
EDIT:
If your algorithm works for single words you could split each phrase
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString();
foreach(var word in words)
{
// moreOrOneWord.Length would allow you to check whether it is a phrase
string [] moreOrOneWord = words.Split(' ');
var query = LoadComments().AsEnumerable().Where(r =>
moreOrOneWord.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));
// Do something with the query...
}
This should allow you to apply your algorithm to every single word in the text.
here you can find an example to start with regular expression.
and here is a List of RegEx elements that you can use.
Hope this can help

If you split the wordOrPhrase with space, and add \w* to match 0+ alphanumeric or underscore chars (or more specific pattern to only match letters like [\p{L}\p{M}]*) to each chunk, you could use
Regex.IsMatch(r.Field<string>("Column_name"),
string.Join(" +", wordOrPhrase.Split()
.Select(p => string.Format(#"\b{0}\w*\b", Regex.Escape(p)))),
RegexOptions.IgnoreCase)
If you have a acknowledg him regard the invest wordOrPhrase, the regex will be \backnowledg\w*\b +\bhim\w*\b +\bregard\w*\b +\bthe\w*\b +\binvest\w*\b and will find a match. See this IDEONE demo.
However, with this approach, himself will get matched with him (that would be turned into him\w*).

Related

Find a string pattern using Regular expression

How can I use regular expressions to find if the string matches a pattern like [sometextornumber] is a [sometextornumber].
For instance, if the input is This is a test, the output should be this and test.
I was thinking something like ([a-zA-Z0-9]) is a([a-zA-Z0-9]) but looks like I am way off the correct path.
Your question is geared towards grabbing the first and last word of a sentence. If this is all you're going to be interested in, this pattern will suffice:
"^(\\w+)|(\\w+)$"
Pattern breakdown:
^ indicates the beginning of a line
^(\\w+) capture group for a word at the beginning of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
| acts as an OR operator in Regex
$ indicates the end of a line
(\\w+)$ capture group for a word at the end of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
This pattern allows you to ignore what's in between the first and last word, so it doesn't care about "is a", and give you one capture group to pull from.
Usage:
string data = "This is going to be a test";
Match m = Regex.Match(data, "^(\\w+)|(\\w+)$");
while (m.Success)
{
Console.WriteLine(m.Groups[0]);
m = m.NextMatch();
}
Results:
This
test
If you're really only interested in the first and last word of a sentence, you also don't need to bother with Regex. Just split the sentence by a space and grab the first and last element of the array.
string[] dataPieces = data.Split(' ');
Console.WriteLine(dataPieces[0]);
Console.WriteLine(dataPieces[dataPieces.Length - 1]);
And the results are the same.
References:
https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
Try this:
([a-zA-Z0-9])+ is a ([a-zA-Z0-9])+
Edit:
You need a space after the a since it is another word. Without the + it will only match from last letter of the first word till the first letter of the last word. The + will match 1 or more of whatever is in the (), so in this case the whole word.
If you're looking to match a specific pattern such as "This" or "Test" you can simply do a case insensitive string compare.
From your question, I'm not sure that you necessarily need a regular expression here.
Here is a quick LINQpad:
var r = new Regex("(.*) is a (.*)");
var match = r.Match("This is a test");
match.Groups.OfType<Group>().Skip(1).Select(g=>g.Value).Dump();
That outputs:
IEnumerable<String> (2 items)
This
test

regex to find a word or words between spaces

I want to find the words in a sentence between spaces. So the words till the first space before and after the search word
This is anexampleof what I want should return anexampleof if my search word is example
I now have this regex "(?:^|\S*\s*)\S*" + searchword + "\S*(?:$|\s*\S*)" but this gives me an extra word in the beginning and the end.
'This is anexampleof what I want' --> returns 'is anexampleof what'
I tried to change the regex but I'm not good at it at all..
I'm using c#. Thx for the help.
Full C# code:
MatchCollection m1 = Regex.Matches(content, #"(?:^|\S*\s*)\S*" + searchword + #"\S*(?:$|\s*\S*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
You can simply leave out the non-capturing groups at the end:
#"\S*" + searchword + #"\S*";
Due to greediness you will get as many non-space characters on each side as possible.
Also, the idea of non-capturing groups is not, that they are not included in the match. All they do is not to produce captures of sub-matches. If you wanted to check that there is something, but don't to include it in the match, you want lookarounds:
#"(?<=^|\S*\s*)\S*" + searchword + #"\S*(?=$|\s*\S*)"
However these lookarounds don't really do anything in this case, because \s*\S* is satisfied with an empty string (because * makes both characters optional). But just for further reference... if you want to make assertions at the boundary of your match, which should not be part of the match... lookarounds are the way to go.

Regular expression to match individual words in a phrase

I am using regular expressions for performing site search.
If I search for this : "Villas at Millwood" (this is a community name) and the corresponding community name is "Villas at Millwood" , I get the results.
If I search for "Millwood villas" , there are no results populated.
I mean, the phrase is taken as a whole, and matched. Is there any way to match the any occurance of individual words in the entered phrase? so that "millwood Villas" would still bring the result of
"Villas at Millwood" ?
Here is what I have to match the community name :
Regex.IsMatch(MarketingCommunityName.Trim(), pattern, RegexOptions.IgnoreCase)
where pattern is the entered search phrase and the MarketingCommunityName is the actual community name.
Thanks in Advance!
Although I think that you should Split your search pattern at a space, and then check every word separately, it would not be too hard to construct an order-independent regular expression from your search pattern:
var searchWords = searchString.Trim().Split(new Char[] {' '});
string pattern = #"^(?=.*" + String.Join(#")(?=.*", searchWords) + ")";
This constructs a regex that contains one lookahead assertion per search word. Each lookahead assertion starts from the beginning of the string and looks whether the search word shows up anywhere inside the string. Note that you will likely get problems, if your searchString contains regex meta-characters, so these should probably be escaped beforehand.
A regex pattern finding both patterns would be
\bMillwood\b.*\bvillas\b
where \b denotes the beginning or the end of a word and .* stands for any number of characters.
I you don't mind finding part of words, you can drop the \b's
Millwood.*villas
However you would not find "villas of Millwood" for instance. This pattern would
Millwood.*villas|villas.*Millwood
But if you want to expand this serach to patterns consisting of more than three words Regex is not the right choice to implement this kind of fuzzy logic. I would count the number of distinct maching words and return the phrases yielding a minimum count. (Maybe having at least 60% of the given words.)
Split the phrase and check every word
pattern.Split(' ')
.All(word=>Regex.IsMatch(MarketingCommunityName.Trim(), word, RegexOptions.IgnoreCase)

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Remove substring from a list of strings

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Categories