Fast look up if word exists in dictionary text file

Fast look up if word exists in dictionary text file - c#

I have a large text file (~10mb) that has more or less every dictionary in a specific language, and each word is new line deliminated.
I want to do a really fast lookup to see if a word exists in a file -
What is the fastest way to do this without looping through each line?
It is sorted, and I can do all the pre-processing I want.
I considered doing some sort of Binary search, but I didnt know how I could do this, since all my lines are not a fixed number of bytes (and thus I wouldn't know where to jump the stream to). And surprisingly, I couldnt find a tool to do the fixed-width thing for me.
Any suggestions?
Thanks!

I'd suggest building a Trie from the dictionary. That gives you very quick lookups to see whether a word is in there.

A trie is a good bet if you don't mind using some more storage: http://en.wikipedia.org/wiki/Trie

Related

Would a Patricia Trie algorithm be the fastest way to do a 'prefix' search (i.e. Starts With search) in .NET?

I have a file that contains a lot of names of people and some meta for each person.
I'm trying to do a search for the people, given some search query.
So I'm first going to do an exact search (easy, implemented as a Dictionary with a Key Lookup).
If that fails (or fails to reach the max number of records specified), then I was hoping to try and go back over all the keys again and do a StartsWith check.
Now this works fine. But I was curious if there's a better way to do this. Looking around the interwebs it looks like there's a data structure called a Patricia Trie alg.
so ..
Is the Patricia Trie the recommended way of doing a StartsWith check over a large set of string-keys?
If yes, is there a nuget package for this?
No - i do no want to do a Contains check. nope. nada. zilch.
Edit 1:
Also, my data will be loaded in once at app startup so I'm happy to incur a slower 'build' step while creating/populating the data structure. Especially if runtime search-perf is better.
Also, I understand that the word 'performance' depends on heaps of factors, like the number of items stored and the size of the search queries etc. So this is more an academic question which I wish to do some code-comparison vs a microptimisation anti-pattern.
Edit 2:
Also, this previous SO question sorta talks about a similar problem I'm having but the answers seem to be talking about a substring search and not a prefix (i.e. StartsWith) search.

Need to parse C# textbox for objectionable words

I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.

An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.

Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.

Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);

Word game mechanics in XNA

I'm planning on making a casual word game for WP7 using XNA. The game mechanics are fine enough for me to implement but it is just the checking to see if the word they make is actually a word or not.
I thought about having a text file and loading that into memory at the start, but surely this wouldn't be possible to keep in memory for a phone? Also how slow would it be to read from this to see if it is a word. How would they be stored in memory? Would it be best to use a dictionary/hashmap and each key is a word and i just check to see if that key exists? Or would it put them in an array?
Stuck on the best way to implement this, so any input is appreciated. Thanks

Depending on your phones hardware, you could probably just load up a text file into memory. The english language probably has only a couple hundred thousand words. Assuming your average word is around 5 characters or so, thats roughly a meg of data. You will have overhead managing that file in memory, but thats where specifics of hardware matter. BTW, it's not uncommon for current generation of phones to have a gig of RAM.
Please see the following related SO questions which require a text file for a dictionary of words.
Dictionary text file

Putting a text file into memory, even of a whole dictionary, shouldn't be too bad as seth flowers has said. Choosing an appropriate data structure to hold the words will be important.
I would not recommend a dictionary using words as keys... that's kind of silly honestly. If you only have keys and no values, what good is a dictionary? However, you may be on a good track with the Dictionary idea. The first thing I would try would be a Dictionary<char, string[]>, where the key is the first letter, and the value is a list of all words beginning with that letter. Of course, that array will be very long, and search time on the array slow (though lookup on the key should be zippy, as char hashes are unique). The advantage is that, if you use the proper .txt dictionary file and load each word in order, you will know that list is ordered by alphabet. So, you can use efficient search techniques like binary search, or any number of searches formulated for pre-sorted lists. It may not be that slow in the end.
If you want to go further, though, you can use the structure which underlies predictive text. It's called a Patricia Trie, or Radix Trie (Wikipedia). Starting with the first letter, you work your way through all possible branches until you either:
assemble the word the user entered, so it is a valid word
reach the end of the branch; this word does not exist.
'Tries' were made to address this sort of problem. I've never represented one in code, so I'm afraid I can't give you any pointers (ba dum tsh!), but there's likely a wealth of information on how to do it available on the internet. Using a Trie will likely be the most efficient solution, but if you find that an alphabet Dictionary like I mentioned above is sufficiently fast using binary search, you might just want to stick with that for now while you develop the actual gameplay. Getting bogged down with finding the best solution when just starting your game tends to bleed off your passion for getting it done. If you run into performance issues, then you make improvements-- at least that's my philosophy when designing games.
The nice thing is, since Windows Phone supports only essentially 2 different specs, once you test the app and see it runs smoothly on them, you really don't have to worry about optimizing for any worse conditions. So use what works!
P.S.: on Windows Phone, loading text files is tricky. Here is a post on the issue which should help you.

Parse numbers from large text, possibly without regex (performance critical)

I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.

Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?

StreamReader.ReadLine() starting from the end of the stream

I'm working in C#/.NET and I'm parsing a file to check if one line matches a particular regex. Actually, I want to find the last line that matches.
To get the lines of my file, I'm currently using the System.IO.StreamReader.ReadLine() method but as my files are very huge, I would like to optimize a bit the code and start from the end of the file.
Does anyone know if there is in C#/.NET a similar function to ReadLine() starting from the end of the stream? And if not, what would be, to your mind, the easiest and most optimized way to do the job described above?

Funny you should mention it - yes I have. I wrote a ReverseLineReader a while ago, and put it in MiscUtil.
It was in answer to this question on Stack Overflow - the answer contains the code, although it uses other bits of MiscUtil too.
It will only cope with some encodings, but hopefully all the ones you need. Note that this will be less efficient than reading from the start of the file, if you ever have to read the whole file - all kinds of things may assume a forward motion through the file, so they're optimised for that. But if you're actually just reading lines near the end of the file, this could be a big win :)
(Not sure whether this should have just been a close vote or not...)

Since you are using a regular expression I think your best option is going to be to read the entire line into memory and then attempt to match it.
Perhaps if you provide us with the regular expression and a sample of the file contents we could find a better way to solve your problem.

"Easiest" -vs- "Most optimized"... I don't think you're going to get both
You could open the file and read each line. Each time you find one that fits your criteria, store it in a variable (replacing any earlier instance). When you finish, you will have the last line that matches.
You could also use a FileStream to set the position near the end of your file. Go through the steps above, and if no match is found, set your FileStream position earlier in your file, until you DO find a match.

This ought to do what you're looking for, it might be memory heavy for what you need, but I don't know what your needs are in that area:
string[] lines = File.ReadAllLines("C:\\somefilehere.txt");
IEnumerable<string> revLines = lines.Reverse();
foreach(string line in revLines) {
/*do whatever*/
}
It would still require reading every line at the outset, but it might be faster than doing a check on each one as you do so.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.