Dictionary API (lexical) [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Does anyone know a good .NET dictionary API? I'm not interested in meanings, rather I need to be able to query words in a number of different ways - return words of x length, return partial matches and so on...

Grab the flat text file from an open source spellchecker like ASpell (http://aspell.net/) and load it into a List or whatever structure you like.
for example,
List<string> words = System.IO.File.ReadAllText("MyWords.txt").Split(new string[]{Environment.NewLine}).ToList();
// C# 3.0 (LINQ) example:
// get all words of length 5:
from word in words where word.length==5 select word
// get partial matches on "foo"
from word in words where word.Contains("foo") select word
// C# 2.0 example:
// get all words of length 5:
words.FindAll(delegate(string s) { return s.Length == 5; });
// get partial matches on "foo"
words.FindAll(delegate(string s) { return s.Contains("foo"); });

You might want to look for a Trie implementation. That will certainly help with "words starting with XYZ" as well as exact matches. You may well want to have all of your data in multiple data structures, each one tuned for the particular task - e.g. one for anagrams, one for "by length" etc. Natural language dictionaries are relatively small compared with RAM these days, so if you really want speedy lookup, that's probably the way to go.

Depending on how involved your queries are going to be, it might be worth investigating WordNet, which is basically a semantic dictionary. It includes parts of speech, synonyms, and other types of relationships between the words.

NetSpell (http://www.loresoft.com/netspell/) is a spell checker that's written in .NET that has word listings in several languages that you could use.

I'm with Barry Fandango on this one, but you can do it without LINQ. .NET 2.0 has some nice filtering methods on the List(T) type. The one I suggest is
List(T).FindAll(Predicate(T)) : List(T)
This method will put every element in the list through the predicate method and return the list of words that return 'true'. So, load your words as suggested from an open source dictionary into a List(String). To find all words of length 5...
List(String) words = LoadFromDictionary();
List(String) fiveLetterWords = words.FindAll(delegate(String word)
{
return word.Length == 5;
});
Or for all words starting with 'abc'...
List(String) words = LoadFromDictionary();
List(String) abcWords = words.FindAll(delegate(String word)
{
return word.StartsWith('abc');
});

Related

C# checking if a word is in an English dictionary? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am trying to go through a list of words and for each one determine if it is a valid English word (for Scrabble). I'm not sure how to approach this, do I have to go find a text file of all English words and then use file reading and parsing methods to manually build a data structure like a trie or hashmap - or can I find those premade somewhere? What is the simplest way to go about this?
You can use NetSpell library for checking this. It can be installed through Nuget Console easily with the following command
PM> Install-Package NetSpell
Then, loop through the words and check them using the library
NetSpell.SpellChecker.Dictionary.WordDictionary oDict = new NetSpell.SpellChecker.Dictionary.WordDictionary();
oDict.DictionaryFile = "en-US.dic";
oDict.Initialize();
string wordToCheck = "door";
NetSpell.SpellChecker.Spelling oSpell = new NetSpell.SpellChecker.Spelling();
oSpell.Dictionary = oDict;
if(!oSpell.TestWord(wordToCheck))
{
//Word does not exist in dictionary
...
}
Since you're looking specifically for valid Scrabble words, there are a few APIs that validate words for Scrabble. If you use anything that's not for that intended purpose then it's likely going to leave out some words that are valid.
Here's one, here's another, and here's a separate question that lists available APIs.
So that I can add some value beyond just pasting links, I'd recommend wrapping this in your own interface so that you can swap these out in case one or another is unavailable (since they're all free services.)
public interface IScrabbleWordValidator
{
bool IsValidScrabbleWord(string word);
}
Make sure your code only depends on that interface, and then write implementations of it that call whatever APIs you use.

Efficient way of counting every occurrences of every words from a URL [duplicate]

This question already has answers here:
Counting the occurrences of every duplicate words in a string using dictionary in c# [closed]
(3 answers)
Closed 6 years ago.
I am making something like, the user will input any url and the text will be obtained.
The text will then be parsed and the words will be counted.
I am currently reading this article from microsoft:
https://msdn.microsoft.com/en-us/library/bb546166.aspx
I can now get the text and i am currently trying to think of an efficient way to count every words.
The article example required a search data but i need to search every word and not a specific word.
Here is what i am thinking:
get the text and convert it to string
split them (delimiters) and store in array
loop through the array then check every occurrences of it.
would this be efficient?
Using Linq
If you have a small amount of data can just do a split on spaces, and create a group
var theString = MethodToGetStringFromUrl(urlString);
var wordCount = theString
.Split(' ')
.GroupBy(a=>a)
.Select(a=>new { word = a.Key , Count = a.Count() });
see fiddle for more a working copy
Some Experiments and Results
Messed around in .net fiddle a little bit and using Regexs actually decreased the performance and increased the amount of memory used see here to see what I am talking about
Other alternative
Because you are getting the request from a Url it might be more performant to search inside of the stream before converting it to a string and then performing the search
Don't optimize unless you need to
Why do you need to find a performant way to do this count? Have you run into any issues or just think you will, a good rule of thumb is generally not to prematurely optimize, for more information check out this good question on the topic : When is optimisation premature?

How do I get two numbers between two words (C#) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a string "Building1Floor2" and it's always in that format, how do I cleanly get the building number (e.g. 1) and floor number. I'm thinking I need a regex, but not entirely sure that's the best way. I could just use the index if the format stays the same, but if I have have a high floor number e.g. 100 it will break.
P.S. I'm using C#.
Use a regex like this:
Building(\d+)Floor(\d+)
Regex would be an ok option here if "Building" and "Floor" could change. e.g.: "Floor1Room23"
You could use "[A-Za-z]+([0-9]{1,})[A-Za-z]+([0-9]{1,})"
With those groupings, $1 would now be the Building number, and $2 would be Floor.
If "Building" and "Floor" never changed, however, then regex might be overkill.. you could use a string split
Find the index of the "F" and substring on that.
int first = str.IndexOf("F") ;
String building = str.substring(1, first);

Porter stemmer algorithm in information-retrieval [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need to create simple search engine for my application. Let's simplify it to the following: we have some texts (a lot) and i need to search and show relevant results.
I've based on this great article extend some things and it works pretty well for me.
But i have problem with stemming words to terms. For example words "annotation", "annotations" etc. will be stemmed to "annot", but imagine you try search something, and you will see unexpected results:
"anno" - nothing
"annota" - nothing
etc.
Only word "annot" will give relevant result. So, how should i improve my search to give expected results? Because "annot" contains "anno" and "annota" is slightly more than "annot". Using contains all the time obviously isn't the solution
If in first case i can use some Ternary search tree, in second case i don't know what to do.
Any ideas would be very helpful.
UPDATE
oleksii has pointed me to n-grams here, which may works for me, but i don't know how to properly index n-grams.
So the Question:
Which data structure would be the best for my needs
How properly index my n-grams
Stemming perhaps isn't much relevant here. Stemming will convert a plural to a singular form.
Given you have a tokeniser, a stemmer and a cleaner (to remove stop words, perhaps punctuation and numbers, short words etc) what you are looking at is a full-text search. I would advice you to get an off-the-shelf solution (like Elasticsearch, Lucene, Solr), but if you fancy a DIY approach I can suggest the following naive implementation.
Step 1
Create a search-orientated tokeniser. One example would be an n-gram tokeniser. It will take your word and split into the following sequences:
annotation
1 - [a, n, o, t, a, i]
2 - [an, nn, no, ot, ...]
3 - [ann, nno, not, ota, ...]
4 - [anno, nnot, nota, otat, ...]
....
Step 2
Sort n-grams for more efficient look-up
Step 3
Search n-grams for exact match using binary search

Replace word in string but ignore accented characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
If the input string is
Cat fish bannedword bread bánnedword mouse bãnnedword
It should output
Cat fish bread mouse
What would be the best way to do this without slowing down the performance?
There are number of ways you can use but non of them (at least as far as I know) will work without certain performance cost.
The most obvious way is to remove the accented characters first and then use simple string.Replace(). As for removing accented characters this or this stackoverflow questions should help you.
Other approach could be splitting the string into an array of strings (each string being separate word) and then removing each word that equals the 'bannedword' using a parameter that makes Equals() method ignore accents.
Something like:
string[] splittedInput = input.Split(' ');
StringBuilder output = new StringBuilder();
foreach(string word in splittedInput)
{
if(string.Compare(word, bannedWord, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == false)
{
output.Append(word);
}
}
string s_output = output.ToString();
//I've not tested it in Visual Studio so there might be mistakes... (A LINQ could also simplify it (and potentially enable pluralization)).
And finally, it should be possible to come up with a clever regex solution (probably the fastest way) but not being an expert on regex I can't help you with that (this might point you in the right direction (if you know at least something about regexes)).

Categories