Custom Collation Ordering

Custom Collation Ordering - c#

Is there a way in .NET/C# to sort a List<string> according to a custom alphabetical order?
I have a list of words:
{ "badum", "śiram", "ðaur", "hor", "áltar", "aun" }
that I wish to sort in the following order:
{ "áltar", "aun", "badum", "śiram", "hor", "ðaur" }
By custom alphabetical order, I mean that I'm working on a constructed language with an alphabet that looks like this: ABZTMIGJLNKSOŚPRFUHDVEÐÞY. A C# implementation of the RuleBasedCollator found in Java would be perfect! If no such thing exists, a few pointers on writing a custom algorithm would be appreciated.
Thank you in advance.

I would definitely start with creating a RuleBasedCollator. Figuring out the rules that you want is one of the harder tasks.
There is a project that provides .net bindings over icu which may suit you.
If that doesn't meet your requirements, and you decide to write your own, the Unicode Collation Algorithm is a good resource. Keep in mind that natural language sorting conceptually (although many optimizations are possible) involves separate passes with increasing specificity. The first pass will look for so-called primary distinctions (usually ignoring differences of case and certain diacritics and punctuation) if there are no differences and the number of primary units in both strings is the same, then you can make the second pass, this time taking into account diacritic differences, if any. Next you process the case distinctions and finally punctuation distinctions.

You can pass custom sorter to the List.Sort() method:
List<string> foo = new List<string>();
foo.Sort((a, b) => a.CompareTo(b));
This will sort the list in place depending on which criteria you want to use (above obviously does just a regular string comparison).

Related

Data structure for searching strings

I am looking for the best data structure for the following case:
In my case I will have thousands of strings, however for this example I am gonna use two for obvious reasons. So let's say I have the strings "Water" and "Walter", what I need is when the letter "W" is entered both strings to be found, and when "Wat" is entered "Water" to be the only result. I did a research however I am still not quite sure which is the correct data structure for this case and I don't want to implement it if I am not sure as this will waste time. So basically what I am thinking right now is either "Trie" or "Suffix Tree". It seems that the "Trie" will do the trick but as I said I need to be sure. Additionally the implementation should not be a problem so I just need to know the correct structure. Also feel free to let me know if there is a better choice. As you can guess normal structures such as Dictionary/MultiDictionary would not work as that will be a memory killer. I am also planning to implement cache to limit the memory consumption. I am sorry there is no code but I hope I will get a answer. Thank you in advance.

You should user Trie. Tries are the foundation for one of the fastest known sorting algorithms (burstsort), it is also used for spell checking, and is used in applications that use text completion. You can see details here.

Practically, if you want to do auto suggest, then storing upto 3-4 chars should suffice.
I mean suggest as and when user types "a" or "ab" or "abc" and the moment he types "abcd" or more characters, you can use map.keys starting with "abcd" using c# language support lamda expressions.
Hence, I suggest, create a map like:
Map<char, <Map<char, Map<char, Set<string>>>>> map;
So, if user enters "a", you look for map[a] and finds all children.

Fastest way to check if a string is a substring C#?

I have a need to check if a list of items contains a string...so kind of like the list gets filtered as the user types in a search box. So, on the text changed event, I am checking if the entered text is contained in one of the listox items and filtering out...so
something like:
value.Contains(enteredText)
I was wondering if this is the fastest and most efficient way to filter out listbox items?
Is Contains() method the best way to search for substrings in C#?

I'd say that in all but very exceptional circumstances, it's fast and efficient enough, and even in such exceptional circumstances it's likely to be a purely academical problem. If you use it and come across any bottlenecks in your logic related to this then I'd be surprised, but only then would it be worth looking at, then chances are you'll be looking elsewhere.

Contains is one of the cheapest methods in my code completion filtering algorithm (Part 6 #6, where #7 and the fuzzy logic matching described in the footnote are vastly more expensive), which doesn't have problems keeping up with even a fast typing user and thousands of items in the dropdown.
I highly doubt it will cause you problems.

Although this is not the fastest option globally, it is the fastest one for which you do not need to code anything. It should be sufficient for filtering drop-down items.
For longer texts, you may want to go with the KMP Algorithm, which has a linear timing complexity. Note, however, that it would not make any difference for very short search strings.
For searches that have lots of matches (e.g. ones that you get for the first one to two characters) you may want to precompute a table that maps single letters and letter pairs to the rows in your drop-down list for a much faster look-up at the expense of using more memory (a pretty standard tradeoff in programming in general).

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}

You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.

Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.

While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.

It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.

Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

how to recognize similar words with difference in spelling

I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named Brook may have three entries to the system
with this variations:
Brook Berta
Bruck Berta
Biruk Berta
Let's assume we are putting this name in one database column.
I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).
This is what I thought as a solution
Write a C# code to iterate through each record
Get only the Consonant letters in order (in the above case: BrKBrt)
Search for the same Consonant pattern from the other records considering
similar sounding letters like (C,K) (C,S), (F, PH)
So please forward any ideas.

The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.
The article has links to Double Metaphone implementations in many languages.

Have a look at Soundex
There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):
SELECT
SOUNDEX('brook berta'),
SOUNDEX('Bruck Berta'),
SOUNDEX('Biruk Berta')
returns the same value B620 for each of the example values

The obvious, established (and well documented) algorithms for finding string similarity are:
Levenstein distance
Soundex

I would consider writing something such as the "famous" python spell checker.
http://norvig.com/spell-correct.html
This will take a word and find all possible alternatives based on missing letters, adding letters, swapping letters, etc.

You might want to google for phonetic similarity algorithm and you'll find plenty of information about this. Including this article on Codeproject about implementing a solution in C#.

Look into soundex. It's a pretty standard library in most languages that does what you require, i.e. algorithmically identify phonetic similarity.
http://en.wikipedia.org/wiki/Soundex

There is a very nice R (just search for "R" in Google) package for Record Linkage. The standard examples target exactly your problem: R RecordLinkage
The C-Code for Soundex etc. is taken directly from PostgreSQL!

I would recommend Soundex and derived algorithms over Lev distance for this solution. Levenstein distance more appropriate for spell checking solutions imho.

How do I determine if two similar band names represent the same band?

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance

The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.

I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.

soundex may also be useful

In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.