Extract keywords from text and exclude words - c#

I have this function to extract all words from text
public static string[] GetSearchWords(string text)
{
string pattern = #"\S+";
Regex re = new Regex(pattern);
MatchCollection matches = re.Matches(text);
string[] words = new string[matches.Count];
for (int i=0; i<matches.Count; i++)
{
words[i] = matches[i].Value;
}
return words;
}
and I want to exclude a list of words from the return array, the words list looks like this
string strWordsToExclude="if,you,me,about,more,but,by,can,could,did";
How can I modify the above function to avoid returning words which are in my list.

string strWordsToExclude="if,you,me,about,more,but,by,can,could,did";
var ignoredWords = strWordsToExclude.Split(',');
return words.Except(ignoredWords).ToArray();
I think Except method fits your needs

If you aren't forced to use Regex, you can use a little LINQ:
void Main()
{
var wordsToExclude = "if,you,me,about,more,but,by,can,could,did".Split(',');
string str = "if you read about cooking you can cook";
var newWords = GetSearchWords(str, wordsToExclude); // read, cooking, cook
}
string[] GetSearchWords(string text, IEnumerable<string> toExclude)
{
var words = text.Split();
return words.Where(word => !toExclude.Contains(word)).ToArray();
}
I'm assuming a word is a series of non-whitespace characters.

Related

Highlighting the multiple keywords

Iam trying to highlight the multiple keywords in gridview.I tried with forloop but it highlight only the first item from the array.
protected string HighlightText(string searchWord, string inputText)
{
// string[] strArray = new string[] { "Hello", "Welcome" };
string s = "d,s";
// Split string on spaces.
// ... This will separate all the words.
string[] words = s.Split(',');
for (int i = 0; i < words.Length; i++)
{
//Console.WriteLine(word);
searchWord = words[i];
Regex expression = new Regex(searchWord.Replace(" ", "|"), RegexOptions.IgnoreCase);
return expression.Replace(inputText, new MatchEvaluator(ReplaceKeywords));
}
return string.Empty;
}
Advance thanks.
This was the out put Iam getting only the keyword "d" get highlighted I need to highlight keyword "s" also...
Can you try something like this, instead of looping for keywords 1 by 1
string inputText = "this is keyword1 for test and keyword4 also";
Regex keywords = new Regex("keyword1|keyword2|keyword3|keyword4");
//keywords = keywords.Replace("|", "\b|\b"); //or use \b between keywords
foreach (Match match in keywords.Matches(inputText))
{
//get match.Index & match.Length for selection and color it
}

How to check if a string contains all of the characters of a word

I wish to check if a string contains a all of the characters of a word given, for example:
var inputString = "this is just a simple text string";
And say I have the word:
var word = "ts";
Now it should pick out the words that contains t and s:
this just string
This is what I am working on:
var names = Regex.Matches(inputString, #"\S+ts\S+",RegexOptions.IgnoreCase);
however this does not give me back the words I like. If I had like just a character like t, it would give me back all of the words that contains t. If I had st instead of ts, it would give me back the word just.
Any idea of how this can work ?
Here is a LINQ solution which is easy on the eyes more natural than regex.
var testString = "this is just a simple text string";
string[] words = testString.Split(' ');
var result = words.Where(w => "ts".All(w.Contains));
The result is:
this
just
string
You can use LINQ's Enumerable.All :
var input = "this is just a simple text string";
var token = "ts";
var results = input.Split().Where(str => token.All(c => str.Contains(c))).ToList();
foreach (var res in results)
Console.WriteLine(res);
Output:
// this
// just
// string
You can use this pattern.
(?=[^ ]*t)(?=[^ ]*s)[^ ]+
You can make regex dynamically.
var inputString = "this is just a simple text string";
var word = "ts";
string pattern = "(?=[^ ]*{0})";
string regpattern = string.Join("" , word.Select(x => string.Format(pattern, x))) + "[^ ]+";
var wineNames = Regex.Matches(inputString, regpattern ,RegexOptions.IgnoreCase);
Option without LINQ and Regex (just for fun):
string input = "this is just a simple text string";
char[] chars = { 't', 's' };
var array = input.Split();
List<string> result = new List<string>();
foreach(var word in array)
{
bool isValid = true;
foreach (var c in chars)
{
if (!word.Contains(c))
{
isValid = false;
break;
}
}
if(isValid) result.Add(word);
}

Parse for words starting with # character in a string

I have to write a program which parses a string for words starting with '#' and return the words along with the # symbol.
I have tried something like:
char[] delim = { '#' };
string[] strArr = commenttext.Split(delim);
return strArr;
But it returns all the words without '#' in an array.
I need something pretty straight forward.No LINQ like things
If the string is "abc #ert #xyz" then I should get back #ert and #xyz.
If you define "word" as "separated by spaces" then this would work:
string[] strArr = commenttext.Split(' ')
.Where(w => w.StartsWith("#"))
.ToArray();
If you need something more complex, a Regular Expression might be more appropriate.
I need something pretty straight forward.No LINQ like things>
The non-Linq equivalent would be:
var words = commenttext.Split(' ');
List<string> temp = new List<string>();
foreach(string w in words)
{
if(w.StartsWith("#"))
temp.Add(w);
}
string[] strArr = temp.ToArray();
If you're against using Linq, which you should not be unless you're required to use older .NET versions, an approach along these lines would suit your needs.
string[] words = commenttext.Split(delimiter);
for (int i = 0; i < words.Length; i++)
{
string word = words[i];
if (word.StartsWith(delimiter))
{
// save in array / list
}
}
const string test = "#Amir abcdef #Stack #C# mnop xyz";
var splited = test.Split(' ').Where(m => m.StartsWith("#")).ToList();
foreach (var b in splited)
{
Console.WriteLine(b.Substring(1, b.Length - 1));
}
Console.ReadKey();

How do I check if a string contains a string from an array of strings?

So here is my example
string test = "Hello World, I am testing this string.";
string[] myWords = {"testing", "string"};
How do I check if the string test contains any of the following words? If it does contain how do I make it so that it can replace those words with a number of asterisks equal to the length of that?
You can use a regex:
public string AstrixSomeWords(string test)
{
Regex regex = new Regex(#"\b\w+\b");
return regex.Replace(test, AsterixWord);
}
private string AsterixWord(Match match)
{
string word = match.Groups[0].Value;
if (myWords.Contains(word))
return new String('*', word.Length);
else
return word;
}
I have checked the code and it seems to work as expected.
If the number of words in myWords is large you might consider using HashSet for better performance.
bool cont = false;
string test = "Hello World, I am testing this string.";
string[] myWords = { "testing", "string" };
foreach (string a in myWords)
{
if( test.Contains(a))
{
int no = a.Length;
test = test.Replace(a, new string('*', no));
}
}
var containsAny = myWords.Any(x => test.Contains(x));
Something like this
foreach (var word in mywords){
if(test.Contains(word )){
string astr = new string("*", word.Length);
test.Replace(word, astr);
}
}
EDIT: Refined

Retrieve String Containing Specific substring C#

I am having an output in string format like following :
"ABCDED 0000A1.txt PQRSNT 12345"
I want to retreieve substring(s) having .txt in above string. e.g. For above it should return 0000A1.txt.
Thanks
You can either split the string at whitespace boundaries like it's already been suggested or repeatedly match the same regex like this:
var input = "ABCDED 0000A1.txt PQRSNT 12345 THE.txt FOO";
var match = Regex.Match (input, #"\b([\w\d]+\.txt)\b");
while (match.Success) {
Console.WriteLine ("TEST: {0}", match.Value);
match = match.NextMatch ();
}
Split will work if it the spaces are the seperator. if you use oter seperators you can add as needed
string input = "ABCDED 0000A1.txt PQRSNT 12345";
string filename = input.Split(' ').FirstOrDefault(f => System.IO.Path.HasExtension(f));
filname = "0000A1.txt" and this will work for any extension
You may use c#, regex and pattern, match :)
Here is the code, plug it in try. Please comment.
string test = "afdkljfljalf dkfjd.txt lkjdfjdl";
string ffile = Regex.Match(test, #"\([a-z0-9])+.txt").Groups[1].Value;
Console.WriteLine(ffile);
Reference: regexp
I did something like this:
string subString = "";
char period = '.';
char[] chArString;
int iSubStrIndex = 0;
if (myString != null)
{
chArString = new char[myString.Length];
chArString = myString.ToCharArray();
for (int i = 0; i < myString.Length; i ++)
{
if (chArString[i] == period)
iSubStrIndex = i;
}
substring = myString.Substring(iSubStrIndex);
}
Hope that helps.
First split your string in array using
char[] whitespace = new char[] { ' ', '\t' };
string[] ssizes = myStr.Split(whitespace);
Then find .txt in array...
// Find first element starting with .txt.
//
string value1 = Array.Find(array1,
element => element.Contains(".txt", StringComparison.Ordinal));
Now your value1 will have the "0000A1.txt"
Happy coding.

Categories