I want to find the words in a sentence between spaces. So the words till the first space before and after the search word
This is anexampleof what I want should return anexampleof if my search word is example
I now have this regex "(?:^|\S*\s*)\S*" + searchword + "\S*(?:$|\s*\S*)" but this gives me an extra word in the beginning and the end.
'This is anexampleof what I want' --> returns 'is anexampleof what'
I tried to change the regex but I'm not good at it at all..
I'm using c#. Thx for the help.
Full C# code:
MatchCollection m1 = Regex.Matches(content, #"(?:^|\S*\s*)\S*" + searchword + #"\S*(?:$|\s*\S*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
You can simply leave out the non-capturing groups at the end:
#"\S*" + searchword + #"\S*";
Due to greediness you will get as many non-space characters on each side as possible.
Also, the idea of non-capturing groups is not, that they are not included in the match. All they do is not to produce captures of sub-matches. If you wanted to check that there is something, but don't to include it in the match, you want lookarounds:
#"(?<=^|\S*\s*)\S*" + searchword + #"\S*(?=$|\s*\S*)"
However these lookarounds don't really do anything in this case, because \s*\S* is satisfied with an empty string (because * makes both characters optional). But just for further reference... if you want to make assertions at the boundary of your match, which should not be part of the match... lookarounds are the way to go.
Related
In C#, I want to use a regular expression to match any of these words:
string keywords = "(shoes|shirt|pants)";
I want to find the whole words in the content string. I thought this regex would do that:
if (Regex.Match(content, keywords + "\\s+",
RegexOptions.Singleline | RegexOptions.IgnoreCase).Success)
{
//matched
}
but it returns true for words like participants, even though I only want the whole word pants.
How do I match only those literal words?
You should add the word delimiter to your regex:
\b(shoes|shirt|pants)\b
In code:
Regex.Match(content, #"\b(shoes|shirt|pants)\b");
Try
Regex.Match(content, #"\b" + keywords + #"\b", RegexOptions.Singleline | RegexOptions.IgnoreCase)
\b matches on word boundaries. See here for more details.
You need a zero-width assertion on either side that the characters before or after the word are not part of the word:
(?=(\W|^))(shoes|shirt|pants)(?!(\W|$))
As others suggested, I think \b will work instead of (?=(\W|^)) and (?!(\W|$)) even when the word is at the beginning or end of the input string, but I'm not sure.
put a word boundary on it using the \b metasequence.
I have a text file which consists of:
stemmed words (e.g. manipulat - stemmed from "manipulating"), and
stemmed phrases which are usually two words or more (e.g.
"acknowledg him regard the invest" - stemmed from "acknowledging him
regarding the investment").
Each word/phrase is presented in a new line. My C# code reads each line in this text file, then for each line, search all rows in the DataTable to match them. i.e. if a word/phrase appears in any rows of DataTable, my system will flag the row..
For single word, it's easily done/matched using the algorithm I have. I can match "manipulat" to words like "manipulate", "manipulating", "manipulated" and "manipulation" if they appear in the DataTable rows.
But for phrases, my algorithm can only match exactly what it is. Here I mean if my phrase is "acknowledg him regard the invest", it will only search for the exact phrase, and it won't match/flag if "acknowledging him regarding the investment" exists in DataTable rows.
I have very little knowledge in both Regex and C#. I tried to modify the below code to use wildcards but no luck so far. Would appreciate if anyone can help in this. Thank you in advanced.
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString() + ".txt");
var query = LoadComments().AsEnumerable().Where(r =>
words.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));
When comparing the lines with the stem-words from your database using
RegEx you could extend your pattern in your code.
This will match 1 or more occurrences of any word character
\w+
This will match 0 or more occurrences of any word character
\w*
as Abbodanza already mentioned this will match any character between a and z 0 or more occurrences.
[a-z]*
EDIT:
If your algorithm works for single words you could split each phrase
string[] words = File.ReadAllLines(sourceDirTemp + comboBox_filename.SelectedItem.ToString();
foreach(var word in words)
{
// moreOrOneWord.Length would allow you to check whether it is a phrase
string [] moreOrOneWord = words.Split(' ');
var query = LoadComments().AsEnumerable().Where(r =>
moreOrOneWord.Any(wordOrPhrase => Regex.IsMatch(r.Field<string>("Column_name"), #"\b"
+ Regex.Escape(wordOrPhrase) + #"\b", RegexOptions.IgnoreCase)));
// Do something with the query...
}
This should allow you to apply your algorithm to every single word in the text.
here you can find an example to start with regular expression.
and here is a List of RegEx elements that you can use.
Hope this can help
If you split the wordOrPhrase with space, and add \w* to match 0+ alphanumeric or underscore chars (or more specific pattern to only match letters like [\p{L}\p{M}]*) to each chunk, you could use
Regex.IsMatch(r.Field<string>("Column_name"),
string.Join(" +", wordOrPhrase.Split()
.Select(p => string.Format(#"\b{0}\w*\b", Regex.Escape(p)))),
RegexOptions.IgnoreCase)
If you have a acknowledg him regard the invest wordOrPhrase, the regex will be \backnowledg\w*\b +\bhim\w*\b +\bregard\w*\b +\bthe\w*\b +\binvest\w*\b and will find a match. See this IDEONE demo.
However, with this approach, himself will get matched with him (that would be turned into him\w*).
I need to write a regex that matches the word versus, verse, vs., v., v but v should not be jumbled in words.
#"\b((.*?)" + Regex.Unescape(xz) + #"[.,:/s]?)\b", RegexOptions.IgnoreCase))
Here I'll pass the array and test it.
This will get you close:
\b(?:(?:vers(?:e|us)|v)\b|vs\.|v\.)
One difficulty is in word boundaries vs. (heh) words that end in a period. See Regex using word boundary but word ends with a . (period) for other options.
Note that "verse" can also mean "poetry" so there could be false positives.
I got the following regex call -
MatchCollection matches = Regex.Matches(text,#"( And )|( Or )|( Not )"
I got a problem with a string like this - " And Or Not "
Only "And" will be matched but "Or" and "Not" will not be, just because they are not the first word.
The reason as far as I understand is because the first match is " And " including the trailing white space, because of that the Regex does not recognize it as a potential white space for the next match, and ignores it, just because it was part of the first match.
So if for example this was my string instead - " And Or Not " - every word would have been matched.
Is there a way to somehow instruct the Regex to share the matched white-spaces between the matches?
Thanks!
Instead of looking explicitly for whitespace, you should look for word boundaries.
Just had to go look it up, but apparently, \b would be what you're looking for, e.g.:
#"(\bAnd\b)|(\bOr\b)|(\bNot\b)"
(Or, as #stema points out):
#"\b(And|Or|Not)\b"
The problem is, if you have matched a whitespace, the regex is continuing after the last match, the withespace is so to say "gone", because it has already been matched.
What you can do is to use a lookahead, like this:
MatchCollection matches = Regex.Matches(s, #" (?:And|Or|Not)(?= )");
The lookahead is not matching the space, it is just looking ahead, if there is a space following. The expression will not match, if there is no space.
But the result in your MatchCollection will not have this space at the end!
I would simplify the expression a little and use a look-ahead assertion (match something but don't make it part of the capture):
string text = " And Or Not ";
foreach (Match m in Regex.Matches(text, #"\s(And|Or|Not)(?=\s)")) {
Console.WriteLine(m.Value);
}
(note: I'm using \s instead of spaces)
I have a parsing question. I have a paragraph which has instances of : word . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.
So when I have those instances I want to convert the string so I have
A new line character after : and the word.
Removed the double space after the word.
Replace all double spaces with new line characters.
Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.
Thanks
Assuming your original string is exactly in the form you described, this will do:
var newString = myString.Trim().Replace(" ", "\n");
The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.
Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.
The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.
I suggest you read up on the String class and all its methods and properties.
You can try
var str = ": first : second ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
":\n${word}\n");
Using RegularExpressions will give you exact matches on what you are looking for.
The regex match for a colon, two spaces, a word, then two more spaces is:
Dim reg as New Regex(": [a-zA-Z]* ")
[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.
[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.
Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.
source: http://msdn.microsoft.com/en-us/library/ms972966.aspx
You can retrieve all the matches using reg.Matches(inputString).
Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out
edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.
You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.
The following is an example using Regular Expressions. See also this question for more info.
Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.
The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.
string Paragraph = "Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.";
string Pattern = #": (?<word>\S*) ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
{
var LoneWord = m.Groups[1].Value;
return #":" + Environment.NewLine + LoneWord + Environment.NewLine;
},
RegexOptions.IgnoreCase);
Input
Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.
Output
Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.
Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:
Result = Result.Replace(" ", Environment.NewLine);