Get only Whole Words from a .Contains() statement

Get only Whole Words from a .Contains() statement - c#

I've used .Contains() to find if a sentence contains a specific word however I found something weird:
I wanted to find if the word "hi" was present in a sentence which are as follows:
The child wanted to play in the mud
Hi there
Hector had a hip problem
if(sentence.contains("hi"))
{
//
}
I only want the SECOND sentence to be filtered however all 3 gets filtered since CHILD has a 'hi' in it and hip has a 'hi' in it. How do I use the .Contains() such that only whole words get picked out?

Try using Regex:
if (Regex.Match(sentence, #"\bhi\b", RegexOptions.IgnoreCase).Success)
{
//
};
This works just fine for me on your input text.

Here's a Regex solution:
Regex has a Word Boundary Anchor using \b
Also, if the search string might come from user input, you might consider escaping the string using Regex.Escape
This example should filter a list of strings the way you want.
string findme = "hi";
string pattern = #"\b" + Regex.Escape(findme) + #"\b";
Regex re = new Regex(pattern,RegexOptions.IgnoreCase);
List<string> data = new List<string> {
"The child wanted to play in the mud",
"Hi there",
"Hector had a hip problem"
};
var filtered = data.Where(d => re.IsMatch(d));
DotNetFiddle Example

You could split your sentence into words - you could split at each space and then trim any punctuation. Then check if any of these words are 'hi':
var punctuation = source.Where(Char.IsPunctuation).Distinct().ToArray();
var words = sentence.Split().Select(x => x.Trim(punctuation));
var containsHi = words.Contains("hi", StringComparer.OrdinalIgnoreCase);
See a working demo here: https://dotnetfiddle.net/AomXWx

You could write your own extension method for string like:
static class StringExtension
{
public static bool ContainsWord(this string s, string word)
{
string[] ar = s.Split(' ');
foreach (string str in ar)
{
if (str.ToLower() == word.ToLower())
return true;
}
return false;
}
}

Related

c# trying to change first letter to uppercase but doesn't work

I have to convert the first letter of every word the user inputs into uppercase. I don't think I'm doing it right so it doesn't work but I'm not sure where has gone wrong D: Thank you in advance for your help! ^^
static void Main(string[] args)
{
Console.Write("Enter anything: ");
string x = Console.ReadLine();
string pattern = "^";
Regex expression = new Regex(pattern);
var regexp = new System.Text.RegularExpressions.Regex(pattern);
Match result = expression.Match(x);
Console.WriteLine(x);
foreach(var match in x)
{
Console.Write(match);
}
Console.WriteLine();
}

If your exercise isn't regex operations, there are built-in utilities to do what you are asking:
System.Globalization.TextInfo ti = System.Globalization.CultureInfo.CurrentCulture.TextInfo;
string titleString = ti.ToTitleCase("this string will be title cased");
Console.WriteLine(titleString);
Prints:
This String Will Be Title Cased
If you operation is for regex, see this previous StackOverflow answer: Sublime Text: Regex to convert Uppercase to Title Case?

First of all, your Regex "^" matches the start of a line. If you need to match each word in a multi-word line, you'll need a different Regex, e.g. "[A-Za-z]".
You're also not doing anything to actually change the first letter to upper case. Note that strings in C# are immutable (they cannot be changed after creation), so you will need to create a new string which consists of the first letter of the original string, upper cased, followed by the rest of the string. Give that part a try on your own. If you have trouble, post a new question with your attempt.

string pattern = "(?:^|(?<= ))(.)"
^ doesnt capture anything by itself.You can replace by uppercase letters by applying function to $1.See demo.
https://regex101.com/r/uE3cC4/29

I would approach this using Model Extensions.
PHP has a nice method called ucfirst.
So I translated that into C#
public static string UcFirst(this string s)
{
var stringArr = s.ToCharArray(0, s.Length);
var char1ToUpper = char.Parse(stringArr[0]
.ToString()
.ToUpper());
stringArr[0] = char1ToUpper;
return string.Join("", stringArr);
}
Usage:
[Test]
public void UcFirst()
{
string s = "john";
s = s.UcFirst();
Assert.AreEqual("John", s);
}
Obviously you would still have to split your sentence into a list and call UcFirst for each item in the list.
Google C# Model Extensions if you need help with what is going on.

One more way to do it with regex:
string input = "this string will be title cased, even if there are.cases.like.that";
string output = Regex.Replace(input, #"(?<!\w)\w", m => m.Value.ToUpper());

I hope this may help
public static string CapsFirstLetter(string inputValue)
{
char[] values = new char[inputValue.Length];
int count = 0;
foreach (char f in inputValue){
if (count == 0){
values[count] = Convert.ToChar(f.ToString().ToUpper());
}
else{
values[count] = f;
}
count++;
}
return new string(values);
}

Extract sentence with a keyword

So I've got this question.
Write a program that extracts from a text all sentences that contain a particular word.
We accept that the sentences are separated from each other by the character "." and the words are separated from one another by a character which is not a letter.
Sample text:
We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.
Sample result:
We are living in a yellow submarine.
We will move out of it in 5 days.
This my code so far.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
var iter = sentence.GetEnumerator();
while(iter.MoveNext())
{
if(iter.Current.ToString() == keyword)
answer += sentence;
}
}
return answer;
}
Well it does not work. I call it with this code:
string example = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string keyword = "in";
string answer = Extract(example, keyword);
Console.WriteLine(answer);
which does not output anything. It's probably the iterator part since I'm not familiar with iterators.
Anyhow, the hint for the question says we should use split and IndexOf methods.

sentence.GetEnumerator() is returning a CharEnumerator, so you're examining each character in each sentence. A single character will never be equal to the string "in", which is why it isn't working. You'll need to look at each word in each sentence and compare with the term you're looking for.

Try:
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
//Add any other required punctuation characters for splitting words in the sentence
string[] words = sentence.Split(new char[] { ' ', ',' });
if(words.Contains(keyword)
{
answer += sentence;
}
}
return answer;
}

Your code goes through each sentence character by character using the iterator. Unless the keyword is a single-character word (e.g. "I" or "a") there will be no match.
One way of solving this is to use LINQ to check if a sentence has the keyword, like this:
foreach(string sentence in arr)
{
if(sentence.Split(' ').Any(w => w == keyword))
answer += sentence+". ";
}
Demo on ideone.
Another approach would be using regular expressions to check for matches only on word boundaries. Note that you cannot use a plain Contains method, because doing so results in "false positives" (i.e. finding sentences where the keyword is embedded inside a longer word).
Another thing to note is the use of += for concatenation. This approach is inefficient, because many temporary throw-away objects get created. A better way of achieving the same result is using StringBuilder.

string input = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
var lookup = input.Split('.')
.Select(s => s.Split().Select(w => new { w, s }))
.SelectMany(x => x)
.ToLookup(x => x.w, x => x.s);
foreach(var sentence in lookup["in"])
{
Console.WriteLine(sentence);
}

I would split the input at the periods and followed by searching each sentence for the given word.
string metin = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string[] metinDizisi = metin.Split('.');
string answer = string.Empty;
for (int i = 0; i < metinDizisi.Length; i++)
{
if (metinDizisi[i].Contains(" in "))
{
answer += metinDizisi[i];
}
}
Console.WriteLine(answer);

You can use sentence.Contains(keyword) to check if the string has the word you are looking for.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
if(sentence.Contains(keyword))
answer+=sentence;
return answer;
}

You could split on the period to get a collection of sentences, then filter those with a regex containing the keyword.
var results = example.Split('.')
.Where(s => Regex.IsMatch(s, String.Format(#"\b{0}\b", keyword)));

Match and split string with regex

I want to validate an input string against a regular expression and then split it.
The input string can be any combination of the letter A and letter A followed by an exclamation mark. For example these are valid input strings: A, A!, AA, AA!, A!A, A!A!, AAA, AAA!, AA!A, A!AA, ... Any other characters should yield an invalid match.
My code would probably look something like this:
public string[] SplitString(string s)
{
Regex regex = new Regex(#"...");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return regex.Split(s);
}
How should my regex look like?
Edit - some examples:
input string "AAA", function should return an array of 3 strings ("A", "A", "A")
input string "A!AAA!", function should return an array of 4 strings ("A!", "A", "A", "A!")
input string "AA!b", function should throw an ArgumentException

Doesn't seem like a Regex is a good plan here. Have a look at this:
private bool ValidString(string myString)
{
char[] validChars = new char[] { 'A', '!' };
if (!myString.StartsWith("A"))
return false;
if (myString.Contains("!!"))
return false;
foreach (char c in myString)
{
if (!validChars.Contains(c))
return false;
}
return true;
}
private List<string> SplitMyString(string myString)
{
List<string> resultList = new List<string>();
if (ValidString(myString))
{
string resultString = "";
foreach (char c in myString)
{
if (c == 'A')
resultString += c;
if (c == '!')
{
resultString += c;
resultList.Add(string.Copy(resultString));
resultString = "";
}
}
}
return resultList;
}
The reason for Regex not being a good plan is that you can write the logic out in a few simple if-statements that compile and function a lot faster and cheaper. Also Regex isn't so good at repeating patterns for an unlimited length string. You'll either end up writing a long Regex or something illegible.
EDIT
At the end of my code you will either have a List<string> with the split input string like in your question. Or an empty List<string>. You can adjust it a little to throw an ArgumentException if that requirement is very important to you. Alternatively you can do a Count on the list to see if it was successful.

Regex regex = new Regex(#"^(A!|A)+$");
Edit:
Use something like http://gskinner.com/RegExr/ to play with Regular Expressions
Edit after comment:
Ok, you have made it a bit more clear what you want. Don't approach it like that. Because in what you are doing, you cannot expect to match the entire input and then split as it would be the entire input. Either use separate regular expression for the split part, or use groups to get the matched values.
Example:
//Initial match part
Regex regex2 = new Regex(#"(A!)|(A)");
return regex2.Split(s);
And again, regular expressions are not always the answer. See how this might impact your application.

You could try something like:
Regex regex = new Regex(#"^[A!]+$");

((A+!?)+)
Try looking at Espresso http://www.ultrapico.com/Expresso.htm or Rad Software Regular Expression Designer http://www.radsoftware.com.au/regexdesigner/ for designing and testing RE's.

I think I have a solution that satisfies all examples. I've had to break it into two regular expressions (which I don't like)...
public string[] SplitString(string s)
{
Regex regex = new Regex(#"^[A!]+$");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return Regex.Split(s, #"(A!?)").Where(x => !string.IsNullOrEmpty(x)).ToArray();
}
Note the use of linq - required to remove the empty matches.

C# find exact-match in string

How can I search for an exact match in a string? For example, If I had a string with this text:
label
label:
labels
And I search for label, I only want to get the first match, not the other two. I tried the Contains and IndexOf method, but they also give me the 2nd and 3rd matches.

You can use a regular expression like this:
bool contains = Regex.IsMatch("Hello1 Hello2", #"(^|\s)Hello(\s|$)"); // yields false
bool contains = Regex.IsMatch("Hello1 Hello", #"(^|\s)Hello(\s|$)"); // yields true
The \b is a word boundary check, and used like above it will be able to match whole words only.
I think the regex version should be faster than Linq.
Reference

You can try to split the string (in this case the right separator can be the space but it depends by the case) and after you can use the equals method to see if there's the match e.g.:
private Boolean findString(String baseString,String strinfToFind, String separator)
{
foreach (String str in baseString.Split(separator.ToCharArray()))
{
if(str.Equals(strinfToFind))
{
return true;
}
}
return false;
}
And the use can be
findString("Label label Labels:", "label", " ");

It seems you've got a delimiter (crlf) between the words so you could include the delimiter as part of the search string.
If not then I'd go with Liviu's suggestion.

You could try a LINQ version:
string str = "Hello1 Hello Hello2";
string another = "Hello";
string retVal = str.Split(" \n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
.First( p => p .Equals(another));

Increasing Regex Efficiency

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.

I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";

I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.

Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.

one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.

This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}

Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".

If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.

If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx

If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}

If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get only Whole Words from a .Contains() statement - c#

Try using Regex: if (Regex.Match(sentence, #"\bhi\b", RegexOptions.IgnoreCase).Success) { // }; This works just fine for me on your input text.

You could write your own extension method for string like: static class StringExtension { public static bool ContainsWord(this string s, string word) { string[] ar = s.Split(' '); foreach (string str in ar) { if (str.ToLower() == word.ToLower()) return true; } return false; } }

Related

c# trying to change first letter to uppercase but doesn't work

Extract sentence with a keyword

Match and split string with regex

C# find exact-match in string

Increasing Regex Efficiency

Categories

Resources