Extract sentence with a keyword - c#

So I've got this question.
Write a program that extracts from a text all sentences that contain a particular word.
We accept that the sentences are separated from each other by the character "." and the words are separated from one another by a character which is not a letter.
Sample text:
We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.
Sample result:
We are living in a yellow submarine.
We will move out of it in 5 days.
This my code so far.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
var iter = sentence.GetEnumerator();
while(iter.MoveNext())
{
if(iter.Current.ToString() == keyword)
answer += sentence;
}
}
return answer;
}
Well it does not work. I call it with this code:
string example = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string keyword = "in";
string answer = Extract(example, keyword);
Console.WriteLine(answer);
which does not output anything. It's probably the iterator part since I'm not familiar with iterators.
Anyhow, the hint for the question says we should use split and IndexOf methods.

sentence.GetEnumerator() is returning a CharEnumerator, so you're examining each character in each sentence. A single character will never be equal to the string "in", which is why it isn't working. You'll need to look at each word in each sentence and compare with the term you're looking for.

Try:
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
//Add any other required punctuation characters for splitting words in the sentence
string[] words = sentence.Split(new char[] { ' ', ',' });
if(words.Contains(keyword)
{
answer += sentence;
}
}
return answer;
}

Your code goes through each sentence character by character using the iterator. Unless the keyword is a single-character word (e.g. "I" or "a") there will be no match.
One way of solving this is to use LINQ to check if a sentence has the keyword, like this:
foreach(string sentence in arr)
{
if(sentence.Split(' ').Any(w => w == keyword))
answer += sentence+". ";
}
Demo on ideone.
Another approach would be using regular expressions to check for matches only on word boundaries. Note that you cannot use a plain Contains method, because doing so results in "false positives" (i.e. finding sentences where the keyword is embedded inside a longer word).
Another thing to note is the use of += for concatenation. This approach is inefficient, because many temporary throw-away objects get created. A better way of achieving the same result is using StringBuilder.

string input = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
var lookup = input.Split('.')
.Select(s => s.Split().Select(w => new { w, s }))
.SelectMany(x => x)
.ToLookup(x => x.w, x => x.s);
foreach(var sentence in lookup["in"])
{
Console.WriteLine(sentence);
}

I would split the input at the periods and followed by searching each sentence for the given word.
string metin = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string[] metinDizisi = metin.Split('.');
string answer = string.Empty;
for (int i = 0; i < metinDizisi.Length; i++)
{
if (metinDizisi[i].Contains(" in "))
{
answer += metinDizisi[i];
}
}
Console.WriteLine(answer);

You can use sentence.Contains(keyword) to check if the string has the word you are looking for.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
if(sentence.Contains(keyword))
answer+=sentence;
return answer;
}

You could split on the period to get a collection of sentences, then filter those with a regex containing the keyword.
var results = example.Split('.')
.Where(s => Regex.IsMatch(s, String.Format(#"\b{0}\b", keyword)));

Related

Remove list of words from string

I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam
For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.
Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

Get only Whole Words from a .Contains() statement

I've used .Contains() to find if a sentence contains a specific word however I found something weird:
I wanted to find if the word "hi" was present in a sentence which are as follows:
The child wanted to play in the mud
Hi there
Hector had a hip problem
if(sentence.contains("hi"))
{
//
}
I only want the SECOND sentence to be filtered however all 3 gets filtered since CHILD has a 'hi' in it and hip has a 'hi' in it. How do I use the .Contains() such that only whole words get picked out?
Try using Regex:
if (Regex.Match(sentence, #"\bhi\b", RegexOptions.IgnoreCase).Success)
{
//
};
This works just fine for me on your input text.
Here's a Regex solution:
Regex has a Word Boundary Anchor using \b
Also, if the search string might come from user input, you might consider escaping the string using Regex.Escape
This example should filter a list of strings the way you want.
string findme = "hi";
string pattern = #"\b" + Regex.Escape(findme) + #"\b";
Regex re = new Regex(pattern,RegexOptions.IgnoreCase);
List<string> data = new List<string> {
"The child wanted to play in the mud",
"Hi there",
"Hector had a hip problem"
};
var filtered = data.Where(d => re.IsMatch(d));
DotNetFiddle Example
You could split your sentence into words - you could split at each space and then trim any punctuation. Then check if any of these words are 'hi':
var punctuation = source.Where(Char.IsPunctuation).Distinct().ToArray();
var words = sentence.Split().Select(x => x.Trim(punctuation));
var containsHi = words.Contains("hi", StringComparer.OrdinalIgnoreCase);
See a working demo here: https://dotnetfiddle.net/AomXWx
You could write your own extension method for string like:
static class StringExtension
{
public static bool ContainsWord(this string s, string word)
{
string[] ar = s.Split(' ');
foreach (string str in ar)
{
if (str.ToLower() == word.ToLower())
return true;
}
return false;
}
}

c# trying to change first letter to uppercase but doesn't work

I have to convert the first letter of every word the user inputs into uppercase. I don't think I'm doing it right so it doesn't work but I'm not sure where has gone wrong D: Thank you in advance for your help! ^^
static void Main(string[] args)
{
Console.Write("Enter anything: ");
string x = Console.ReadLine();
string pattern = "^";
Regex expression = new Regex(pattern);
var regexp = new System.Text.RegularExpressions.Regex(pattern);
Match result = expression.Match(x);
Console.WriteLine(x);
foreach(var match in x)
{
Console.Write(match);
}
Console.WriteLine();
}
If your exercise isn't regex operations, there are built-in utilities to do what you are asking:
System.Globalization.TextInfo ti = System.Globalization.CultureInfo.CurrentCulture.TextInfo;
string titleString = ti.ToTitleCase("this string will be title cased");
Console.WriteLine(titleString);
Prints:
This String Will Be Title Cased
If you operation is for regex, see this previous StackOverflow answer: Sublime Text: Regex to convert Uppercase to Title Case?
First of all, your Regex "^" matches the start of a line. If you need to match each word in a multi-word line, you'll need a different Regex, e.g. "[A-Za-z]".
You're also not doing anything to actually change the first letter to upper case. Note that strings in C# are immutable (they cannot be changed after creation), so you will need to create a new string which consists of the first letter of the original string, upper cased, followed by the rest of the string. Give that part a try on your own. If you have trouble, post a new question with your attempt.
string pattern = "(?:^|(?<= ))(.)"
^ doesnt capture anything by itself.You can replace by uppercase letters by applying function to $1.See demo.
https://regex101.com/r/uE3cC4/29
I would approach this using Model Extensions.
PHP has a nice method called ucfirst.
So I translated that into C#
public static string UcFirst(this string s)
{
var stringArr = s.ToCharArray(0, s.Length);
var char1ToUpper = char.Parse(stringArr[0]
.ToString()
.ToUpper());
stringArr[0] = char1ToUpper;
return string.Join("", stringArr);
}
Usage:
[Test]
public void UcFirst()
{
string s = "john";
s = s.UcFirst();
Assert.AreEqual("John", s);
}
Obviously you would still have to split your sentence into a list and call UcFirst for each item in the list.
Google C# Model Extensions if you need help with what is going on.
One more way to do it with regex:
string input = "this string will be title cased, even if there are.cases.like.that";
string output = Regex.Replace(input, #"(?<!\w)\w", m => m.Value.ToUpper());
I hope this may help
public static string CapsFirstLetter(string inputValue)
{
char[] values = new char[inputValue.Length];
int count = 0;
foreach (char f in inputValue){
if (count == 0){
values[count] = Convert.ToChar(f.ToString().ToUpper());
}
else{
values[count] = f;
}
count++;
}
return new string(values);
}

How to find the number of occurrences of a letter in only the first sentence of a string?

I want to find number of letter "a" in only first sentence. The code below finds "a" in all sentences, but I want in only first sentence.
static void Main(string[] args)
{
string text; int k = 0;
text = "bla bla bla. something second. maybe last sentence.";
foreach (char a in text)
{
char b = 'a';
if (b == a)
{
k += 1;
}
}
Console.WriteLine("number of a in first sentence is " + k);
Console.ReadKey();
}
This will split the string into an array seperated by '.', then counts the number of 'a' char's in the first element of the array (the first sentence).
var count = Text.Split(new[] { '.', '!', '?', })[0].Count(c => c == 'a');
This example assumes a sentence is separated by a ., ? or !. If you have a decimal number in your string (e.g. 123.456), that will count as a sentence break. Breaking up a string into accurate sentences is a fairly complex exercise.
This is perhaps more verbose than what you were looking for, but hopefully it'll breed understanding as you read through it.
public static void Main()
{
//Make an array of the possible sentence enders. Doing this pattern lets us easily update
// the code later if it becomes necessary, or allows us easily to move this to an input
// parameter
string[] SentenceEnders = new string[] {"$", #"\.", #"\?", #"\!" /* Add Any Others */};
string WhatToFind = "a"; //What are we looking for? Regular Expressions Will Work Too!!!
string SentenceToCheck = "This, but not to exclude any others, is a sample."; //First example
string MultipleSentencesToCheck = #"
Is this a sentence
that breaks up
among multiple lines?
Yes!
It also has
more than one
sentence.
"; //Second Example
//This will split the input on all the enders put together(by way of joining them in [] inside a regular
// expression.
string[] SplitSentences = Regex.Split(SentenceToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase);
//SplitSentences is an array, with sentences on each index. The first index is the first sentence
string FirstSentence = SplitSentences[0];
//Now, split that single sentence on our matching pattern for what we should be counting
string[] SubSplitSentence = Regex.Split(FirstSentence, WhatToFind, RegexOptions.IgnoreCase);
//Now that it's split, it's split a number of times that matches how many matches we found, plus one
// (The "Left over" is the +1
int HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the first sentence, {0} '{1}'.", HowMany, WhatToFind));
//Do all this again for the second example. Note that ideally, this would be in a separate function
// and you wouldn't be writing code twice, but I wanted you to see it without all the comments so you can
// compare and contrast
SplitSentences = Regex.Split(MultipleSentencesToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase | RegexOptions.Singleline);
SubSplitSentence = Regex.Split(SplitSentences[0], WhatToFind, RegexOptions.IgnoreCase | RegexOptions.Singleline);
HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the second sentence, {0} '{1}'.", HowMany, WhatToFind));
}
Here is the output:
We found, in the first sentence, 3 'a'.
We found, in the second sentence, 4 'a'.
You didn't define "sentence", but if we assume it's always terminated by a period (.), just add this inside the loop:
if (a == '.') {
break;
}
Expand from this to support other sentence delimiters.
Simply "break" the foreach(...) loop when you encounter a "." (period)
Well, assuming you define a sentence as being ended with a '.''
Use String.IndexOf() to find the position of the first '.'. After that, searchin a SubString instead of the entire string.
find the place of the '.' in the text ( you can use split )
count the 'a' in the text from the place 0 to instance of the '.'
string SentenceToCheck = "Hi, I can wonder this situation where I can do best";
//Here I am giving several way to find this
//Using Regular Experession
int HowMany = Regex.Split(SentenceToCheck, "a", RegexOptions.IgnoreCase).Length - 1;
int i = Regex.Matches(SentenceToCheck, "a").Count;
// Simple way
int Count = SentenceToCheck.Length - SentenceToCheck.Replace("a", "").Length;
//Linq
var _lamdaCount = SentenceToCheck.ToCharArray().Where(t => t.ToString() != string.Empty)
.Select(t => t.ToString().ToUpper().Equals("A")).Count();
var _linqAIEnumareable = from _char in SentenceToCheck.ToCharArray()
where !String.IsNullOrEmpty(_char.ToString())
&& _char.ToString().ToUpper().Equals("A")
select _char;
int a =linqAIEnumareable.Count;
var _linqCount = from g in SentenceToCheck.ToCharArray()
where g.ToString().Equals("a")
select g;
int a = _linqCount.Count();

C# Capitalizing string, but only after certain punctuation marks

I'm trying to find an efficient way to take an input string and capitalize the first letter after every punctuation mark (. : ? !) which is followed by a white space.
Input:
"I ate something. but I didn't:
instead, no. what do you think? i
think not! excuse me.moi"
Output:
"I ate something. But I didn't:
Instead, no. What do you think? I
think not! Excuse me.moi"
The obvious would be to split it and then capitalize the first char of every group, then concatenate everything. But it's uber ugly. What's the best way to do this? (I'm thinking Regex.Replace using a MatchEvaluator that capitalizes the first letter but would like to get more ideas)
Thanks!
Fast and easy:
static class Ext
{
public static string CapitalizeAfter(this string s, IEnumerable<char> chars)
{
var charsHash = new HashSet<char>(chars);
StringBuilder sb = new StringBuilder(s);
for (int i = 0; i < sb.Length - 2; i++)
{
if (charsHash.Contains(sb[i]) && sb[i + 1] == ' ')
sb[i + 2] = char.ToUpper(sb[i + 2]);
}
return sb.ToString();
}
}
Usage:
string capitalized = s.CapitalizeAfter(new[] { '.', ':', '?', '!' });
Try this:
string expression = #"[\.\?\!,]\s+([a-z])";
string input = "I ate something. but I didn't: instead, no. what do you think? i think not! excuse me.moi";
char[] charArray = input.ToCharArray();
foreach (Match match in Regex.Matches(input, expression,RegexOptions.Singleline))
{
charArray[match.Groups[1].Index] = Char.ToUpper(charArray[match.Groups[1].Index]);
}
string output = new string(charArray);
// "I ate something. But I didn't: instead, No. What do you think? I think not! Excuse me.moi"
I use an extension method.
public static string CorrectTextCasing(this string text)
{
// /[.:?!]\\s[a-z]/ matches letters following a space and punctuation,
// /^(?:\\s+)?[a-z]/ matches the first letter in a string (with optional leading spaces)
Regex regexCasing = new Regex("(?:[.:?!]\\s[a-z]|^(?:\\s+)?[a-z])", RegexOptions.Multiline);
// First ensure all characters are lower case.
// (In my case it comes all in caps; this line may be omitted depending upon your needs)
text = text.ToLower();
// Capitalize each match in the regular expression, using a lambda expression
text = regexCasing.Replace(text, s => (s.Value.ToUpper));
// Return the new string.
return text;
}
Then I can do the following:
string mangled = "i'm A little teapot, short AND stout. here IS my Handle.";
string corrected = s.CorrectTextCasing();
// returns "I'm a little teapot, short and stout. Here is my handle."
Using the Regex / MatchEvaluator route, you could match on
"[.:?!]\s[a-z]"
and capitalize the entire match.
Where the text variable contains the string
string text = "I ate something. but I didn't: instead, no. what do you think? i think not! excuse me.moi";
string[] punctuators = { "?", "!", ",", "-", ":", ";", "." };
for (int i = 0; i< 7;i++)
{
int pos = text.IndexOf(punctuators[i]);
while(pos!=-1)
{
text = text.Insert(pos+2, char.ToUpper(text[pos + 2]).ToString());
text = text.Remove(pos + 3, 1);
pos = text.IndexOf(punctuators[i],pos+1);
}
}

Categories