Fastest way to search a list of names in C#

Fastest way to search a list of names in C# - c#

I have a list of perhaps 100,000 strings in memory in my application. I need to find the top 20 strings that contain a certain keyword (case insensitive). That's easy to do, I just run the following LINQ.
from s in stringList
where s.ToLower().Contains(searchWord.ToLower())
select s
However, I have a distinct feeling that I could do this much faster and I need to find the way to that, because I need to look up in this list multiple times per second.

Finding substrings (not complete matches) is surprisingly hard. There is nothing built-in to help you with this. I suggest you look into Suffix Trees data structures which can be used to find substrings efficiently.
You can pull searchWord.ToLower() out to a local variable to save tons of string operations, btw. You can also pre-calculate the lower-case version of stringList. If you can't precompute, at least use s.IndexOf(searchWord, StringComparison.InvariantCultureIgnoreCase) != -1. This saves on expensive ToLower calls.
You can also slap an .AsParallel on the query.

Another option, although it would require a fair amount of memory, would be to precompute something like a suffix array (a list of positions within the strings, sorted by the strings to which they point).
http://en.wikipedia.org/wiki/Suffix_array
This would be most feasible if the list of strings you're searching against is relatively static. The entire list of string indexes could be stored in a single array of tuples(indexOfString, positionInString), upon which you would perform a binary search, using String.Compare(keyword, 0, target, targetPos, keyword.Length).
So if you had 100,000 strings of average 20 length, you would need 100,000 * 20 * 2*sizeof(int) of memory for the structure. You could cut that in half by packing both indexOfString and positionInString into a single 32 bit int, for example with positionInString in the lowest 12 bits, and the indexOfString in the remaining upper bits. You'd just have to do a little bit fiddling to get the two values back out. It's important to note that the structure contains no strings or substrings itself. The strings you're searching against exist only once.
This would basically give you a complete index, and allow finding any substring very quickly (binary search over the index the suffix array represents), with a minimum of actual string comparisons.
If memory is dear, a simple optimization of the original brute force algorithm would be to precompute a dictionary of unique chars, and assign ordinal numbers to represent each. Then precompute a bit array for each string with the bits set for each unique char contained within the string. Since your strings are relatively short, there should be a fair amount of variability of the resuting BitArrays (it wouldn't work well if your strings were very long). You then simply compute the BitArray or your search keyword, and only search for the keyword in those strings where keywordBits & targetBits == keywordBits. If your strings are preconverted to lower case, and are just the English alphabet, the BitArray would likely fit within a single int. So this would require a minimum of additional memory, be simple to implement, and would allow you to quickly filter out strings within which you will definitely not find the keyword. This might be a useful optimization since string searches are fast, but you have so many of them to do using the brute force search.
EDIT For those interested, here is a basic implementation of the initial solution I proposed. I ran tests using 100,000 randomly generated strings of lengths described by the OP. Although it took around 30 seconds to construct and sort the index, once made, the speed of searching for keywords 3000 times was 49,805 milliseconds for brute force, and 18 milliseconds using the indexed search, so a couple thousand times faster. If you rarely build the list, then my simple, but relatively slow method of initially building the suffix array should be sufficient. There are smarter ways to build it that are faster, but would require more coding than my basic implementation below.
// little test console app
static void Main(string[] args) {
var list = new SearchStringList(true);
list.Add("Now is the time");
list.Add("for all good men");
list.Add("Time now for something");
list.Add("something completely different");
while (true) {
string keyword = Console.ReadLine();
if (keyword.Length == 0) break;
foreach (var pos in list.FindAll(keyword)) {
Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]);
}
}
}
~~~~~~~~~~~~~~~~~~
// file for the class that implements a simple suffix array
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
namespace ConsoleApplication1 {
public class SearchStringList {
private List<string> strings = new List<string>();
private List<StringPosition> positions = new List<StringPosition>();
private bool dirty = false;
private readonly bool ignoreCase = true;
public SearchStringList(bool ignoreCase) {
this.ignoreCase = ignoreCase;
}
public void Add(string s) {
if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big.");
this.strings.Add(s);
this.dirty = true;
for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i));
}
public string this[int index] { get { return this.strings[index]; } }
public void EnsureSorted() {
if (dirty) {
this.positions.Sort(Compare);
this.dirty = false;
}
}
public IEnumerable<StringPosition> FindAll(string keyword) {
var idx = IndexOf(keyword);
while ((idx >= 0) && (idx < this.positions.Count)
&& (Compare(keyword, this.positions[idx]) == 0)) {
yield return this.positions[idx];
idx++;
}
}
private int IndexOf(string keyword) {
EnsureSorted();
// binary search
// When the keyword appears multiple times, this should
// point to the first match in positions. The following
// positions could be examined for additional matches
int minP = 0;
int maxP = this.positions.Count - 1;
while (maxP > minP) {
int midP = minP + ((maxP - minP) / 2);
if (Compare(keyword, this.positions[midP]) > 0) {
minP = midP + 1;
} else {
maxP = midP;
}
}
if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) {
return minP;
} else {
return -1;
}
}
private int Compare(StringPosition pos1, StringPosition pos2) {
int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex);
return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase);
}
private int Compare(string keyword, StringPosition pos2) {
return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase);
}
// Packs index of string, and position within string into a single int. This is
// set up for strings no greater than 255 bytes. If longer strings are desired,
// the code for the constructor, and extracting ListIndex and StringIndex would
// need to be modified accordingly, taking bits from ListIndex and using them
// for StringIndex.
public struct StringPosition {
public static StringPosition NotFound = new StringPosition(-1, 0);
private readonly int position;
public StringPosition(int listIndex, byte stringIndex) {
this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex;
}
public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } }
public byte StringIndex { get { return (byte) (this.position & 0xFF); } }
public override string ToString() {
return ListIndex.ToString() + ":" + StringIndex;
}
}
}
}

There's one approach that would be a lot faster. But it would mean looking for exact word matches, rather than using the Contains functionality.
Basically, if you have the memory for it you could create a Dictionary of words which also reference some sort of ID (or IDs) for the strings in which the word is found.
So the Dictionary might be of type <string, List<int>>. The benefit here of course is that you're consolidating a lot of words into a smaller collection. And, the Dictionary is very fast with lookups since it's built on a hash table.
Now if this isn't what you're looking for you might search for in-memory full-text searching libraries. SQL Server supports full-text searching using indexing to speed up the process beyond traditional wildcard searches. But a pure in-memory solution would surely be faster. This still may not give you the exact functionality of a wildcard search, however.

In that case what you need is a reverse index.
If you are keen to pay much you can use database specific full text search index, and tuning the indexing to index on every subset of words.
Alternatively, you can use a very successful open source project that can achieve the same thing.
You need to pre-index the string using tokenizer, and build the reverse index file. We have similar use case in Java where we have to have a very fast autocomplete in a big set of data.
You can take a look at Lucene.NET which is a port of Apache Lucene (in Java).
If you are willing to ditch LINQ, you can use NHibernate Search. (wink).
Another option is to implement the pre-indexing in memory, with preprocessing and bypass of scanning unneeded, take a look at the Knuth-Morris-Pratt algorithm.

Related

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:

You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();

General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.

You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

Sort Sequential Guids Generated by UuidCreateSequential

I tried to sort Guids generated by UuidCreateSequential, but I see the results are not correct, am I mising something? here is the code
private class NativeMethods
{
[DllImport("rpcrt4.dll", SetLastError = true)]
public static extern int UuidCreateSequential(out Guid guid);
}
public static Guid CreateSequentialGuid()
{
const int RPC_S_OK = 0;
Guid guid;
int result = NativeMethods.UuidCreateSequential(out guid);
if (result == RPC_S_OK)
return guid;
else throw new Exception("could not generate unique sequential guid");
}
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
Thread.Sleep(60000);
}
Array.Sort(guids, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using guids failed!");
return;
}
}
Console.WriteLine("sorting using guids succeeded!");
}
EDIT1:
Just to make my question clear, why the guid struct is not sortable using the default comparer ?
EDIT 2:
Also here are some sequential guids I've generated, seems they are not sorted ascending as presented by the hex string
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"

First off, let's re-state the observation: when creating sequential GUIDs with a huge time delay -- 60 billion nanoseconds -- between creations, the resulting GUIDs are not sequential.
am I missing something?
You know every fact you need to know to figure out what is going on. You're just not putting them together.
You have a service that provides numbers which are both sequential and unique across all computers in the universe. Think for a moment about how that is possible. It's not a magic box; someone had to write that code.
Imagine if you didn't have to do it using computers, but instead had to do it by hand. You advertise a service: you provide sequential globally unique numbers to anyone who asks at any time.
Now, suppose I ask you for three such numbers and you hand out 20, 21, and 22. Then sixty years later I ask you for three more and surprise, you give me 13510985, 13510986 and 13510987. "Wait just a minute here", I say, "I wanted six sequential numbers, but you gave me three sequential numbers and then three more. What gives?"
Well, what do you suppose happened in that intervening 60 years? Remember, you provide this service to anyone who asks, at any time. Under what circumstances could you give me 23, 24 and 25? Only if no one else asked within that 60 years.
Now is it clear why your program is behaving exactly as it ought to?
In practice, the sequential GUID generator uses the current time as part of its strategy to enforce the globally unique property. Current time and current location is a reasonable starting point for creating a unique number, since presumably there is only one computer on your desk at any one time.
Now, I caution you that this is only a starting point; suppose you have twenty virtual machines all in the same real machine and all trying to generate sequential GUIDs at the same time? In these scenarios collisions become much more likely. You can probably think of techniques you might use to mitigate collisions in these scenarios.

After researching, I can't sort the guid using the default sort or even using the default string representation from guid.ToString as the byte order is different.
to sort the guids generated by UuidCreateSequential I need to convert to either BigInteger or form my own string representation (i.e. hex string 32 characters) by putting bytes in most signification to least significant order as follows:
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
// this simulates the delay between guids creation
// yes the guids will not be sequential as it interrupts generator
// (as it used the time internally)
// but still the guids should be in increasing order and hence they are
// sortable and that was the goal of the question
Thread.Sleep(60000);
}
var sortedGuidStrings = guids.Select(x =>
{
var bytes = x.ToByteArray();
//reverse high bytes that represents the sequential part (time)
string high = BitConverter.ToString(bytes.Take(10).Reverse().ToArray());
//set last 6 bytes are just the node (MAC address) take it as it is.
return high + BitConverter.ToString(bytes.Skip(10).ToArray());
}).ToArray();
// sort ids using the generated sortedGuidStrings
Array.Sort(sortedGuidStrings, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using sortedGuidStrings failed!");
return;
}
}
Console.WriteLine("sorting using sortedGuidStrings succeeded!");
}

Hopefully I understood your question correctly. It seems you are trying to sort the HEX representation of your Guids. That really means that you are sorting them alphabetically and not numerically.
Guids will be indexed by their byte value in the database. Here is a console app to prove that your Guids are numerically sequential:
using System;
using System.Linq;
using System.Numerics;
class Program
{
static void Main(string[] args)
{
//These are the sequential guids you provided.
Guid[] guids = new[]
{
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"
}.Select(l => Guid.Parse(l)).ToArray();
//Convert to BigIntegers to get their numeric value from the Guids bytes then sort them.
BigInteger[] values = guids.Select(l => new BigInteger(l.ToByteArray())).OrderBy(l => l).ToArray();
for (int i = 0; i < guids.Length; i++)
{
//Convert back to a guid.
Guid sortedGuid = new Guid(values[i].ToByteArray());
//Compare the guids. The guids array should be sequential.
if(!sortedGuid.Equals(guids[i]))
throw new Exception("Not sequential!");
}
Console.WriteLine("All good!");
Console.ReadKey();
}
}

How to compare large string integer values

Currently I am working on a program that processes extremely large integernumbers .
To prevent hitting the intiger.maxvalue a script that processes strings as numbers, and splits them up into a List<int>as following
0 is the highest currently known value
list entry 0: 123 (hundred twenty three million)
list entry 1: 321 (three hundred twenty one thousand)
list entry 2: 777 (seven hundred seventy seven)
Now my question is: How would one check if the incoming string value is sub tractable from these values?
The start for subtraction I currently made is as following, but I am getting stuck on the subtracting part.
public bool Subtract(string value)
{
string cleanedNumeric = NumericAndSpaces(value);
List<string> input = new List<string>(cleanedNumeric.Split(' '));
// In case 1) the amount is bigger 2) biggest value exceeded by a 10 fold
// 3) biggest value exceeds the value
if (input.Count > values.Count ||
input[input.Count - 1].Length > values[0].ToString().Length ||
FastParseInt(input[input.Count -1]) > values[0])
return false;
// Flip the array for ease of comparison
input.Reverse();
return true;
}
EDIT
Current target for the highest achievable number in this program is a Googolplex And are limited to .net3.5 MONO

You should do some testing on this because I haven't run extensive tests but it has worked on the cases I've put it through. Also, it might be worth ensuring that each character in the string is truly a valid integer as this procedure would bomb given a non-integer character. Finally, it expects positive numbers for both subtrahend and minuend.
static void Main(string[] args)
{
// In subtraction, a subtrahend is subtracted from a minuend to find a difference.
string minuend = "900000";
string subtrahend = "900001";
var isSubtractable = IsSubtractable(subtrahend, minuend);
}
public static bool IsSubtractable(string subtrahend, string minuend)
{
minuend = minuend.Trim();
subtrahend = subtrahend.Trim();
// maybe loop through characters and ensure all are valid integers
// check if the original number is longer - clearly subtractable
if (minuend.Length > subtrahend.Length) return true;
// check if original number is shorter - not subtractable
if (minuend.Length < subtrahend.Length) return false;
// at this point we know the strings are the same length, so we'll
// loop through the characters, one by one, from the start, to determine
// if the minued has a higher value character in a column of the number.
int numberIndex = 0;
while (numberIndex < minuend.Length )
{
Int16 minuendCharValue = Convert.ToInt16(minuend[numberIndex]);
Int16 subtrahendCharValue = Convert.ToInt16(subtrahend[numberIndex]);
if (minuendCharValue > subtrahendCharValue) return true;
if (minuendCharValue < subtrahendCharValue) return false;
numberIndex++;
}
// number are the same
return true;
}

[BigInteger](https://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx) is of aribtary size.
Run this code if you don't believe me
var foo = new BigInteger(2);
while (true)
{
foo = foo * foo;
}
Things get crazy. My debugger (VS2013) becomes unable to represent the number before it's done. ran it for a short time and got a number with 1.2 million digits in base 10 from ToString. It is big enough. There is a 2GB limit on object, which can be overriden in .NET 4.5 with the setting gcAllowVeryLargeObjects
Now what to do if you are using .NET 3.5? You basically need to reimplement BigInteger (obviously only taking what you need, there is a lot in there).
public class MyBigInteger
{
uint[] _bits; // you need somewhere to store the value to an arbitrary length.
....
You also need to perform maths on these arrays. here is the Equals method from BigInteger:
public bool Equals(BigInteger other)
{
AssertValid();
other.AssertValid();
if (_sign != other._sign)
return false;
if (_bits == other._bits)
// _sign == other._sign && _bits == null && other._bits == null
return true;
if (_bits == null || other._bits == null)
return false;
int cu = Length(_bits);
if (cu != Length(other._bits))
return false;
int cuDiff = GetDiffLength(_bits, other._bits, cu);
return cuDiff == 0;
}
It basically does cheap length and sign comparisons of the byte arrays, then, if that doesn't produce a difference hands off to GetDiffLength.
internal static int GetDiffLength(uint[] rgu1, uint[] rgu2, int cu)
{
for (int iv = cu; --iv >= 0; )
{
if (rgu1[iv] != rgu2[iv])
return iv + 1;
}
return 0;
}
Which does the expensive check of looping through the arrays looking for a difference.
All you math will have to follow this pattern and can largely be ripped of from the .Net source code.
Googleplex and 2GB:
Here the 2GB limit becomes a problem, because you will be needing an object size of 3.867×10^90 gigabyte. This the the point where you give up, or get clever and store objects as powers at the cost of not being able to represent a lot of them. *2
if you moderate your expectations, it doesn't actually change the maths of BigInteger to split _bits into multiple jagged arrays *1. You change the cheap checks a bit. Rather than checking the size of the array, you check the number of subarrays and then the size of the last one. Then the loop needs to be a bit more (but not much) more complex in that it does elementwise array comparison for each sub array. There are other changes as well, but it's by no means impossible and gets you out of the 2GB limit.
*1 Note use jagged arrays[][], not multidimensional arrays [,] which are still subject to the same limit.
*2 Ie give up on precision and store the mantissa and exponent. If you look how floating point numbers are implemented they can't represent all numbers between their max and min (as the number of real numbers in a range is 'bigger' than infinite). They make a complex trade off between precision and range. If you are wanting to do this, looking at float implementations will be a lot more useful than taking about integer representations like Biginteger.

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");

I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..

Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)

There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.

The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.

You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).

I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing

try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.