Check for word match count in random letters

Check for word match count in random letters - c#

I have a string of 500 random letters and a 10,000 dictionary word list.
I want to check the letters for word matches.
If there are 5 matches or greater I want it to return the list of matched words.
However this foreach and Contains.() doesn't seem to work correctly or return correct matches. It is also returning partial matches and single letters.
// 500 Random Letters
string letters = "bliduuwfhbgphwhsyzjnlfyizbjfeeepsbpgplpbhaegyepqcjhhotovnzdtlracxrwggbcmjiglasjvmscvxwazmutqiwppzcjhijjbguxfnduuphhsoffaqwtmhmensqmyicnciaoczumjzyaaowbtwjqlpxuuqknxqvmnueknqcbvkkmildyvosczlbnlgumohosemnfkmndtiubfkminlriytmbtrzhwqmovrivxxojbpirqahatmydqgulammsnfgcvgfncqkpxhgikulsjynjrjypxwvlkvwvigvjvuydbjfizmbfbtjprxkmiqpfuyebllzezbxozkiidpplvqkqlgdlvjbfeticedwomxgawuphocisaejeonqehoipzsjgbfdatbzykkurrwwtajeajeornrhyoqadljfjyizzfluetynlrpoqojxxqmmbuaktjqghqmusjfvxkkyoewgyckpbmismwyfebaucsfueuwgio"
// Dictionary Words List
string[] words = File.ReadAllText(#"C:\dictionarywords.txt").Split('\n');
// Word Matches List
List<string> matches = new List<string>();
// Check for Word matches in Letters
foreach (var x in words)
{
// Add to list if match
if (letters.Contains(x))
{
matches.Add(x);
}
}
// Return Matched Words if 5 or greater
if (matches.Count() >= 5)
{
textBox.Text = string.Join("\n", matches);
}
Examples
Word matches found by eye:
lid
hot
gum
hose
hat
Code Match Returns:
my
up
so
c
et
ms
am
me
s
x
n
b
...

Your code is working as intended. It IS finding those words, but it's also finding additional words. I'd suggest taking out the words you don't want to show up in a search. For example, a lot of people use this in a profanity filter. So if a sentence contains a curse word, it omits the word because it found it in the dictionary of curse words. Give it a try with a much smaller list with words you've put in yourself and test the results. Try changing those words to other words?

Related

Finding multiple occurrences of a word and printing out the entire line in which the word exists using c-sharp

Hey guys this is something I am trying to do for a while...
So I have a text file online which can be accessed through a link like this https://example.com/myfile.txt
the text file contains a few sentences, here are a few
This is a sentence
sentences are a combination of multiple words
we cannot imagine a world without sentences
words together form a sentence
here's a sentence orange is a fruit and I love it!
how's the day today?
Okay that was all random sentences in the text file, so now If I enter the word 'words' in the input (string input) I want it to print out all the lines which contain the input, here's an example
sentences are a combination of multiple words
words together form a sentence

This is an obvious case to use Regex.
using System.Text.RegularExpression;
string text = #"This is a sentence
sentences are a combination of multiple words
we cannot imagine a world without sentences
words together form a sentence
here's a sentence orange is a fruit and I love it!
how's the day today?";
string input = "words";
Regex regex = new Regex(#"^.*" + input + #".*$", RegexOptions.Multiline);
foreach (Match match in regex.Matches(text))
{
Console.WriteLine(match.value);
}
Explanation: #"^.*" + input + #".*$", RegexOptions.Multiline:
^.* match zero or more characters from start of line
input match the string in input string
.*$ match zero or more characters at end of line
RegexOptions.Multiline makes ^ and $ match start and end of line (normally they match start and end of whole text).

So, just for the fun of it, you can use these functions (they do what you asked, and the second one allows you to pass the URI so it will download the text file from a web address and proceed with the search.
public static global::System.Collections.Generic.List<string> SearchFor(string[] Lines, string Term)
{
global::System.Collections.Generic.List<string> r = new global::System.Collections.Generic.List<string>(Lines.Length);
foreach (string l in Lines) { if (l.Contains(Term)) { r.Add(l.Trim()); } }
r.TrimExcess();
return r;
}
public static global::System.Collections.Generic.List<string> SearchFor(string Text, string Term, bool TextIsUri = false)
{
if (TextIsUri) { using (global::System.Net.WebClient w = new global::System.Net.WebClient()) { Text = w.DownloadString(new global::System.Uri(Text)); } }
return SearchFor(Text.Split(new char[] { '\n' }, global::System.StringSplitOptions.RemoveEmptyEntries), Term);
}
Surelly you can also provide some safety measures like testing if string is empty and such, but i left it without it so you can see the simplest code and improve on it.

String split on words and queued punctuation characters

Here is the pattern I use for now:
string pattern = #"^(\s+|\d+|\w+|[^\d\s\w])+$";
Regex regex = new Regex(pattern);
if (regex.IsMatch(inputString))
{
Match match = regex.Match(inputString);
foreach (Capture capture in match.Groups[1].Captures)
{
if (!string.IsNullOrWhiteSpace(capture.Value))
tmpList.Add(capture.Value);
}
}
return tmpList.ToArray<string>();
With this I retrieve an array of strings, on item for each word and one item for each punctuation character.
What I'd like to achieve now is grouping queued punctuation chars in only one item, i.e. for now if there are three dots one after the other, I get three items in my array each containing a dot. Ultimately I'd like to have one item with three dots (or any other punctuation char for that matter).

Try this regex:
^(\s+|\d+|\w+|[^\d\s\w]+)+$
Description

Try with following pattern. I added an extra +. Let me know if you intended something else. Hope it helps.
string pattern = #"^(\s+|\d+|\w+|[^\d\s\w]+)+$";
For inputString = "abc;..cbe;aaa...kjaskjas" I get this result:
abc
;..
cbe
;
aaa
...
kjaskjas

Removing words with special characters in them

I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So:
"I saw a dog. It was black!"
should result in
{saw,a,dog,was,black}

So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
Example (ideone):
var re = #"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar

How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
Word can only be preceded by start of line or some number of whitespace characters
Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
Word can only contain lower case (latin) letters and dashes
The named group containing each word is "word"

Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.

List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}

This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = #"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");

You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, #"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example

You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

C# Parsing Text Within Quotes

I'm developing a simple little search mechanism and I want to allow the user to search for chunks of text with spaces. For example, a user can search for the name of a person:
Name: John Smith
I then "John Smith".Split(' ') into an array of two elements, {"John","Smith"}. I then return all of the records that match "John" AND "Smith" first followed by records that match either "John" OR "Smith." I then return no records for no matches. This isn't a complicated scenario and I have this part working.
I'd now like to be able to allow the user to ONLY return records that match "John Smith"
I'd like to use a basic quote syntax for searching. So if a user wants to search for "John Smith" OR Pocahontas they would enter: "John Smith" Pocahontas. The order of terms is absolutely irrelevant; "John Smith" does not receive priority over Pocahontas because he comes first in the list.
I have two main trains of thought on how I should parse the input.
A) Using regular expression then parsing stuff (IndexOf, Split)
B) Using only the parsing methods
I think a logical point of action would be to find the stuff in quotes; then remove it from the original string and insert it into a separate list. Then all the stuff left over from the original string could be split on the space and inserted into that separate list. If there is either 1 quote or an odd number, it is simply removed from the list.
How do I find matches the from within regex? I know about regex.Replace, but how would I iterate through the matches and insert them into a list. I know there is some neat way to do this using the MatchEvaluator delegate and linq, but I know basically nothing about regex in c#.

EDIT: Came back to this tab withou refreshing and didn't realize this question was already answered... accepted answer is better.
I think pulling out the stuff in quotes first with regex is a good idea. Maybe something like this:
String sampleInput = "\"John Smith\" Pocahontas Bambi \"Jane Doe\" Aladin";
//Create regex pattern
Regex regex = new Regex("\"([^\".]+)\"");
List<string> searches = new List<string>();
//Loop through all matches from regex
foreach (Match match in regex.Matches(sampleInput))
{
//add the match value for the 2nd group to the list
//(1st group is the entire match)
//(2nd group is the first parenthesis group in the defined regex pattern
// which in this case is the text inside the quotes)
searches.Add(match.Groups[1].Value);
}
//remove the matches from the input
sampleInput = regex.Replace(sampleInput, String.Empty);
//split the remaining input and add the result to our searches list
searches.AddRange(sampleInput.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries));

I needed the same functionality as Shawn but I didn't want to use regex. Here is a simple solution that I came up with uses Split() instead of regex for anyone else needing this functionality.
This works because the Split method, by default, will create empty entries in the array for consecutive search values in the source string. If we split on the quote character then the result is an array where the even indexed entries are individual words and the odd indexed entries will be the quotes phrases.
Example:
“John Smith” Pocahontas
Results in
item(0) = (empty string)
item(1) = John Smith
item(2) = Pocahontas
And
1 2 “3 4” 5 “6 7” “8 9”
Results in
item(0) = 1 2
item(1) = 3 4
item(2) = 5
item(3) = 6 7
item(4) = (empty string)
item(5) = 8 9
Note that an unmatched quote will result in a phrase from the last quote to the end of the input string.
public static List<string> QueryToTerms(string query)
{
List<string> Result = new List<string>();
// split on the quote token
string[] QuoteTerms = query.Split('"');
// switch to denote if the current loop is processing words or a phrase
bool WordTerms = true;
foreach (string Item in QuoteTerms)
{
if (!string.IsNullOrWhiteSpace(Item))
if (WordTerms)
{
// Item contains words. parse them and ignore empty entries.
string[] WTerms = Item.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
foreach (string WTerm in WTerms)
Result.Add(WTerm);
}
else
// Item is a phrase.
Result.Add(Item);
// Alternate between words and phrases.
WordTerms = !WordTerms;
}
return Result;
}

Use a regex like this:
string input = "\"John Smith\" Pocahontas";
Regex rx = new Regex(#"(?<="")[^""]+(?="")|[^\s""]\S*");
for (Match match = rx.Match(input); match.Success; match = match.NextMatch()) {
// use match.Value here, it contains the string to be searched
}

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.

I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}

Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.

Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.

Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.

If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Check for word match count in random letters - c#

Related

Finding multiple occurrences of a word and printing out the entire line in which the word exists using c-sharp

String split on words and queued punctuation characters

Removing words with special characters in them

C# Parsing Text Within Quotes

How can I get a regex match to only be added once to the matches collection?

Categories

Resources