I tried to sort Guids generated by UuidCreateSequential, but I see the results are not correct, am I mising something? here is the code
private class NativeMethods
{
[DllImport("rpcrt4.dll", SetLastError = true)]
public static extern int UuidCreateSequential(out Guid guid);
}
public static Guid CreateSequentialGuid()
{
const int RPC_S_OK = 0;
Guid guid;
int result = NativeMethods.UuidCreateSequential(out guid);
if (result == RPC_S_OK)
return guid;
else throw new Exception("could not generate unique sequential guid");
}
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
Thread.Sleep(60000);
}
Array.Sort(guids, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using guids failed!");
return;
}
}
Console.WriteLine("sorting using guids succeeded!");
}
EDIT1:
Just to make my question clear, why the guid struct is not sortable using the default comparer ?
EDIT 2:
Also here are some sequential guids I've generated, seems they are not sorted ascending as presented by the hex string
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"
First off, let's re-state the observation: when creating sequential GUIDs with a huge time delay -- 60 billion nanoseconds -- between creations, the resulting GUIDs are not sequential.
am I missing something?
You know every fact you need to know to figure out what is going on. You're just not putting them together.
You have a service that provides numbers which are both sequential and unique across all computers in the universe. Think for a moment about how that is possible. It's not a magic box; someone had to write that code.
Imagine if you didn't have to do it using computers, but instead had to do it by hand. You advertise a service: you provide sequential globally unique numbers to anyone who asks at any time.
Now, suppose I ask you for three such numbers and you hand out 20, 21, and 22. Then sixty years later I ask you for three more and surprise, you give me 13510985, 13510986 and 13510987. "Wait just a minute here", I say, "I wanted six sequential numbers, but you gave me three sequential numbers and then three more. What gives?"
Well, what do you suppose happened in that intervening 60 years? Remember, you provide this service to anyone who asks, at any time. Under what circumstances could you give me 23, 24 and 25? Only if no one else asked within that 60 years.
Now is it clear why your program is behaving exactly as it ought to?
In practice, the sequential GUID generator uses the current time as part of its strategy to enforce the globally unique property. Current time and current location is a reasonable starting point for creating a unique number, since presumably there is only one computer on your desk at any one time.
Now, I caution you that this is only a starting point; suppose you have twenty virtual machines all in the same real machine and all trying to generate sequential GUIDs at the same time? In these scenarios collisions become much more likely. You can probably think of techniques you might use to mitigate collisions in these scenarios.
After researching, I can't sort the guid using the default sort or even using the default string representation from guid.ToString as the byte order is different.
to sort the guids generated by UuidCreateSequential I need to convert to either BigInteger or form my own string representation (i.e. hex string 32 characters) by putting bytes in most signification to least significant order as follows:
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
// this simulates the delay between guids creation
// yes the guids will not be sequential as it interrupts generator
// (as it used the time internally)
// but still the guids should be in increasing order and hence they are
// sortable and that was the goal of the question
Thread.Sleep(60000);
}
var sortedGuidStrings = guids.Select(x =>
{
var bytes = x.ToByteArray();
//reverse high bytes that represents the sequential part (time)
string high = BitConverter.ToString(bytes.Take(10).Reverse().ToArray());
//set last 6 bytes are just the node (MAC address) take it as it is.
return high + BitConverter.ToString(bytes.Skip(10).ToArray());
}).ToArray();
// sort ids using the generated sortedGuidStrings
Array.Sort(sortedGuidStrings, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using sortedGuidStrings failed!");
return;
}
}
Console.WriteLine("sorting using sortedGuidStrings succeeded!");
}
Hopefully I understood your question correctly. It seems you are trying to sort the HEX representation of your Guids. That really means that you are sorting them alphabetically and not numerically.
Guids will be indexed by their byte value in the database. Here is a console app to prove that your Guids are numerically sequential:
using System;
using System.Linq;
using System.Numerics;
class Program
{
static void Main(string[] args)
{
//These are the sequential guids you provided.
Guid[] guids = new[]
{
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"
}.Select(l => Guid.Parse(l)).ToArray();
//Convert to BigIntegers to get their numeric value from the Guids bytes then sort them.
BigInteger[] values = guids.Select(l => new BigInteger(l.ToByteArray())).OrderBy(l => l).ToArray();
for (int i = 0; i < guids.Length; i++)
{
//Convert back to a guid.
Guid sortedGuid = new Guid(values[i].ToByteArray());
//Compare the guids. The guids array should be sequential.
if(!sortedGuid.Equals(guids[i]))
throw new Exception("Not sequential!");
}
Console.WriteLine("All good!");
Console.ReadKey();
}
}
Related
I have around 5,000,000 objects stored in a Dictionary<MyKey, MyValue>.
MyKey is a struct that packs each component of my key (5 different numbers) in the right-most 44 bits of an Int64 (ulong).
Since the ulong will always start with 20 zero-bits, my gut feeling is that returning the native Int64.GetHashCode() implementation is likely to collide more often, than if the hash code implementation only considers the 44 bits that are actually in use (although mathematically, I wouldn't know where to begin to prove that theory).
This increases the number of calls to .Equals() and makes dictionary lookups slower.
The .NET implementation of Int64.GetHashCode() looks like this:
public override int GetHashCode()
{
return (int)this ^ (int)(this >> 32);
}
How would I best implement GetHashCode()?
I couldn't begin to suggest a "best" way to hash 44-bit numbers. But, I can suggest a way to compare it to the 64-bit hash algorithm.
One way to do this is to simply check how many collisions you get for a set of numbers (as suggested by McKenzie et al in Selecting a Hashing Algorithm) Unless you're going to test all possible values of your set, you'll need to judge whether the # of collisions you get is acceptable. This could be done in code with something like:
var rand = new Random(42);
var dict64 = new Dictionary<int, int>();
var dict44 = new Dictionary<int, int>();
for (int i = 0; i < 100000; ++i)
{
// get value between 0 and 0xfffffffffff (max 44-bit value)
var value44 = (ulong)(rand.NextDouble() * 0x0FFFFFFFFFFF);
var value64 = (ulong)(rand.NextDouble() * ulong.MaxValue);
var hash64 = value64.GetHashCode();
var hash44 = (int)value44 ^ (int)(value44>> 32);
if (!dict64.ContainsValue(hash64))
{
dict64.Add(hash64,hash64);
}
if (!dict44.ContainsValue(hash44))
{
dict44.Add(hash44, hash44);
}
}
Trace.WriteLine(string.Format("64-bit hash: {0}, 64-bit hash with 44-bit numbers {1}", dict64.Count, dict44.Count));
In other words, consistently generate 100,000 random 64-bit values and 100,000 random 44-bit values, perform a hash on each and keep track of unique values.
In my test this generated 99998 unique values for 44-bit numbers and 99997 unique values for 64-bit numbers. So, that's one less collision for 44-bit numbers over 64-bit numbers. I would expect less collisions with 44-bit numbers simply because you have less possible inputs.
I'm not going to tell you the 64-bit hash method is "best" for 44-bit; you'll have to decide if these results mean it's good for your circumstances.
Ideally you should be testing with realistic values that your application is likely to generate. Given those will all be 44-bit values, it's hard to compare that to the collisions ulong.GetHashCode() produces (i.e. you'd have identical results). If random values based on a constant seed isn't good enough, modify the code with something better.
While things might not "feel" right, science suggests there's no point in changing something without reproducible tests that prove a change is necessary.
Here's my attempt to answer this question, which I'm posting despite the fact that the answer is the opposite of what I was expecting. (Although I may have made a mistake somewhere - I almost hope so, and am open to criticism regarding my test technique.)
// Number of Dictionary hash buckets found here:
// http://stackoverflow.com/questions/24366444/how-many-hash-buckets-does-a-net-dictionary-use
const int CNumberHashBuckets = 4999559;
static void Main(string[] args)
{
Random randomNumberGenerator = new Random();
int[] dictionaryBuckets1 = new int[CNumberHashBuckets];
int[] dictionaryBuckets2 = new int[CNumberHashBuckets];
for (int i = 0; i < 5000000; i++)
{
ulong randomKey = (ulong)(randomNumberGenerator.NextDouble() * 0x0FFFFFFFFFFF);
int simpleHash = randomKey.GetHashCode();
BumpHashBucket(dictionaryBuckets1, simpleHash);
int superHash = ((int)(randomKey >> 12)).GetHashCode() ^ ((int)randomKey).GetHashCode();
BumpHashBucket(dictionaryBuckets2, superHash);
}
int collisions1 = ComputeCollisions(dictionaryBuckets1);
int collisions2 = ComputeCollisions(dictionaryBuckets2);
}
private static void BumpHashBucket(int[] dictionaryBuckets, int hashedKey)
{
int bucketIndex = (int)((uint)hashedKey % CNumberHashBuckets);
dictionaryBuckets[bucketIndex]++;
}
private static int ComputeCollisions(int[] dictionaryBuckets)
{
int i = 0;
foreach (int dictionaryBucket in dictionaryBuckets)
i += Math.Max(dictionaryBucket - 1, 0);
return i;
}
I try to simulate how the processing done by Dictionary will work. The OP says he has "around 5,000,000" objects in a Dictionary, and according to the referenced source there will be either 4999559 or 5999471 "buckets" in the Dictionary.
Then I generate 5,000,000 random 44-bit keys to simulate the OP's Dictionary entries, and for each key I hash it two different ways: the simple ulong.GetHashCode() and an alternative way that I suggested in a comment. Then I turn each hash code into a bucket index using modulo - I assume that's how it is done by Dictionary. This is used to increment the pseudo buckets as a way of computing the number of collisions.
Unfortunately (for me) the results are not as I was hoping. With 4999559 buckets the simulation typically indicates around 1.8 million collisions, with my "super hash" technique actually having a few (around 0.01%) MORE collisions. With 5999471 buckets there are typically around 1.6 million collisions, and my so-called super hash gives maybe 0.1% fewer collisions.
So my "gut feeling" was wrong, and there seems to be no justification for trying to find a better hash code technique.
There are many how-to's on how to create Guid's that are Sql server index friendly, for example this tutorial. Another popular method is the one (listed below) from the NHibernate implementation. So I thought it could be fun to write a test method that actually tested the sequential requirements of such code. But I fail - I don't know what makes a good Sql server sequence. I can't figure out how they are ordered.
For example, given the two different way to create a sequential guid, how to determine which is the best (other than speed)? For example it looks like both have the disadvantage that if their clock is set back 2 minutes (e.g. timeserver update) they sequences are suddenly broken? But would that also mean trouble for the Sql sever index?
I use this code to produce the sequential Guid:
public static Guid CombFromArticle()
{
var randomBytes = Guid.NewGuid().ToByteArray();
byte[] timestampBytes = BitConverter.GetBytes(DateTime.Now.Ticks / 10000L);
if (BitConverter.IsLittleEndian)
Array.Reverse(timestampBytes);
var guidBytes = new byte[16];
Buffer.BlockCopy(randomBytes, 0, guidBytes, 0, 10);
Buffer.BlockCopy(timestampBytes, 2, guidBytes, 10, 6);
return new Guid(guidBytes);
}
public static Guid CombFromNHibernate()
{
var destinationArray = Guid.NewGuid().ToByteArray();
var time = new DateTime(0x76c, 1, 1);
var now = DateTime.Now;
var span = new TimeSpan(now.Ticks - time.Ticks);
var timeOfDay = now.TimeOfDay;
var bytes = BitConverter.GetBytes(span.Days);
var array = BitConverter.GetBytes((long)(timeOfDay.TotalMilliseconds / 3.333333));
Array.Reverse(bytes);
Array.Reverse(array);
Array.Copy(bytes, bytes.Length - 2, destinationArray, destinationArray.Length - 6, 2);
Array.Copy(array, array.Length - 4, destinationArray, destinationArray.Length - 4, 4);
return new Guid(destinationArray);
}
The one from the article is slightly faster but which creates the best sequence for SQL server? I could populate 1 million records and compare the fragmentation but I'm not even sure how to validate that properly. And in any case, I'd like to understand how I could write a test case that ensures the sequences are sequences as defined by Sql server!
Also I'd like some comments on these two implementations. What makes one better than the other?
I generated sequential GUIDs for an SQL Server. I never looked at too many articles beforehand.. still, it seems sound.
The first one, I generate with a system function (to get a proper one) and the following ones, I simply increment. You have to look for overflow and the like, of course (also, a GUID has several fields).
Apart from that, nothing hard to consider. If 2 GUIDs are unique, so is a sequence of them, if.. you stay below a few million. Well, it is mathematics.. even 2 GUIDs aren't guaranteed to be unique, at least not in the long run (if humanity keeps growing). So by using this kind of sequence, you probably increase the probability of a collision from nearly 0 to nearly 0 (but slightly more). If at all.. ask a mathematician.. it is the Birthday Problem http://en.wikipedia.org/wiki/Birthday_problem , with an insane number of days.
It is in C, but that should be easily translatable to more comfortable languages. Especially, you don't neet to worry about converting the wchar to a char.
GUID guid;
bool bGuidInitialized = false;
void incrGUID()
{
for (int i = 7; i >= 0; --i)
{
++guid.Data4[i];
if (guid.Data4[i] != 0)
return;
}
++guid.Data3;
if (guid.Data3 != 0)
return;
++guid.Data2;
if (guid.Data2 != 0)
return;
++guid.Data1;
if (guid.Data1 != 0)
return;
}
GenerateGUID(char *chGuid)
{
if (!bGuidInitialized)
{
CoCreateGuid(&guid);
bGuidInitialized = true;
}
else
incrGUID();
WCHAR temp[42];
StringFromGUID2(guid, temp, 42-1);
wcstombs(chGuid, &(temp[1]), 42-1);
chGuid[36] = 0;
if (!onlyOnceLogGUIDAlreadyDone)
{
onlyOnceLogGUIDAlreadyDone = true;
WR_cTools_LogTime(chGuid);
}
return ReturnCode;
}
I want to generate 1M random (appearing) unique alphanumeric keys and store them in a database. Each key will be 8 characters long and only the subset "abcdefghijk n pqrstuvxyz and 0-9" will be used.
The letters l,m,o and w are ditched. "m and w" are left out because of limited printing space, as each key will be printed on a product in a very small space. Dropping m and w enabled to increase the letter size with 2pt, improving readability. l and o were dropped because they are easily mixed up with 1, i and 0 at the current printing size. We did some testing characters 1,i, and 0 were always read correctly, l and o had to many mistakes. Capitals were left out for the same reason as 'm and w".
So why not a sequence? A few reasons: The keys can be registered afterwards and we do not want anyone guessing the next key in the sequence and register somebody else's key. Appearance: we don't need customers and competition to know we only shipped a few thousand keys.
Is there a practical way to generate the keys, ensure the uniqueness of each key and store them in a database? Thanks!
Edit: #CodeInChaos pointed out a problem: System.Random isn't very secure, and the sequence could be reproduced without a great deal of difficulty. I've replaced Random with a secure generator here:
var possibilities = "abcdefghijknpqrstuvxyz0123456789".ToCharArray();
int goal = 1000000;
int codeLength = 8;
var codes = new HashSet<string>();
var random = new RNGCryptoServiceProvider();
while (codes.Count < goal)
{
var newCode = new char[codeLength];
for (int i = 0; i < codeLength; i++)
newCode[i] = possibilities[random.Next(possibilities.Length)];
codes.Add(new string(newCode));
}
// now write codes to database
static class Extensions
{
public static byte Next(this RNGCryptoServiceProvider provider, byte maximum)
{
var b = new byte[1];
while (true)
{
provider.GetBytes(b);
if (b[0] < maximum)
return b[0];
}
}
}
(the Next method isn't very fast, but might be good enough for your purposes)
1 million isn't much these days and you can probably do that on a single machine fairly quickly. It's a one-time operation after all.
Take a hashtable (or hashset)
Generate random keys and put them into it as keys (or directly if a set) until the count is 1 million
Write them to the database
My quick and dirty testing code looked like this:
function new-key {-join'abcdefghijknpqrstuvxyz0123456789'[(0..7|%{random 32})]}
$keys = #{}
for(){$keys[(new-key)]=1}
But PowerShell is slow, so I'd expect C++ or C# do very well here.
Is there a practical way to generate the keys, ensure the uniqueness
of each key and store them in a database?
Since this is a single operation you could simply do the following:
1) Generate A Single Key
2) Verify that the generated key does not exist in the database.
3) If it does exist generate a new key.
3b) If it does not exist write it to the database
4) Go Back to Step 1
There are other choices of course, in the end it boils down to generating a key, and making sure it does not exist in the database.
You could in theory generate 10 million keys ( in order to save processing power ) write them to a file. Once the keys are generate just look at each one and see if it already exits in the database. You likely could program a tool that does this in less then 48 hours.
I encounter a similar problem once.. what I did is create a unique sequence YYYY/MM/DD/HH/MM/SS/millis/nano and get its hash code. After that I use the hash as a key. Your client and your competitor won't be able to guess the next value. It might not be full proof but in my case it was enough!
To actually get the random string, you can use code similar to this:
Random rand = new Random(new DateTime().Millisecond);
String[] possibilities = {"a","b","c","d","e","f","g","h","i","j","k",
"l","n","p","q","r","s","t","u","v","x","y","z","0","1","2","3","4",
"5","6","7","8","9"};
for (int i = 0; i < 1000000; ++i)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int j = 0; j < 8; ++j)
{
sb.Append(possibilities[rand.Next(possibilities.Length)]);
}
if (!databaseContains(sb.ToString()))
databaseAdd(sb.ToString());
else
--i;
}
I have a list of perhaps 100,000 strings in memory in my application. I need to find the top 20 strings that contain a certain keyword (case insensitive). That's easy to do, I just run the following LINQ.
from s in stringList
where s.ToLower().Contains(searchWord.ToLower())
select s
However, I have a distinct feeling that I could do this much faster and I need to find the way to that, because I need to look up in this list multiple times per second.
Finding substrings (not complete matches) is surprisingly hard. There is nothing built-in to help you with this. I suggest you look into Suffix Trees data structures which can be used to find substrings efficiently.
You can pull searchWord.ToLower() out to a local variable to save tons of string operations, btw. You can also pre-calculate the lower-case version of stringList. If you can't precompute, at least use s.IndexOf(searchWord, StringComparison.InvariantCultureIgnoreCase) != -1. This saves on expensive ToLower calls.
You can also slap an .AsParallel on the query.
Another option, although it would require a fair amount of memory, would be to precompute something like a suffix array (a list of positions within the strings, sorted by the strings to which they point).
http://en.wikipedia.org/wiki/Suffix_array
This would be most feasible if the list of strings you're searching against is relatively static. The entire list of string indexes could be stored in a single array of tuples(indexOfString, positionInString), upon which you would perform a binary search, using String.Compare(keyword, 0, target, targetPos, keyword.Length).
So if you had 100,000 strings of average 20 length, you would need 100,000 * 20 * 2*sizeof(int) of memory for the structure. You could cut that in half by packing both indexOfString and positionInString into a single 32 bit int, for example with positionInString in the lowest 12 bits, and the indexOfString in the remaining upper bits. You'd just have to do a little bit fiddling to get the two values back out. It's important to note that the structure contains no strings or substrings itself. The strings you're searching against exist only once.
This would basically give you a complete index, and allow finding any substring very quickly (binary search over the index the suffix array represents), with a minimum of actual string comparisons.
If memory is dear, a simple optimization of the original brute force algorithm would be to precompute a dictionary of unique chars, and assign ordinal numbers to represent each. Then precompute a bit array for each string with the bits set for each unique char contained within the string. Since your strings are relatively short, there should be a fair amount of variability of the resuting BitArrays (it wouldn't work well if your strings were very long). You then simply compute the BitArray or your search keyword, and only search for the keyword in those strings where keywordBits & targetBits == keywordBits. If your strings are preconverted to lower case, and are just the English alphabet, the BitArray would likely fit within a single int. So this would require a minimum of additional memory, be simple to implement, and would allow you to quickly filter out strings within which you will definitely not find the keyword. This might be a useful optimization since string searches are fast, but you have so many of them to do using the brute force search.
EDIT For those interested, here is a basic implementation of the initial solution I proposed. I ran tests using 100,000 randomly generated strings of lengths described by the OP. Although it took around 30 seconds to construct and sort the index, once made, the speed of searching for keywords 3000 times was 49,805 milliseconds for brute force, and 18 milliseconds using the indexed search, so a couple thousand times faster. If you rarely build the list, then my simple, but relatively slow method of initially building the suffix array should be sufficient. There are smarter ways to build it that are faster, but would require more coding than my basic implementation below.
// little test console app
static void Main(string[] args) {
var list = new SearchStringList(true);
list.Add("Now is the time");
list.Add("for all good men");
list.Add("Time now for something");
list.Add("something completely different");
while (true) {
string keyword = Console.ReadLine();
if (keyword.Length == 0) break;
foreach (var pos in list.FindAll(keyword)) {
Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]);
}
}
}
~~~~~~~~~~~~~~~~~~
// file for the class that implements a simple suffix array
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
namespace ConsoleApplication1 {
public class SearchStringList {
private List<string> strings = new List<string>();
private List<StringPosition> positions = new List<StringPosition>();
private bool dirty = false;
private readonly bool ignoreCase = true;
public SearchStringList(bool ignoreCase) {
this.ignoreCase = ignoreCase;
}
public void Add(string s) {
if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big.");
this.strings.Add(s);
this.dirty = true;
for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i));
}
public string this[int index] { get { return this.strings[index]; } }
public void EnsureSorted() {
if (dirty) {
this.positions.Sort(Compare);
this.dirty = false;
}
}
public IEnumerable<StringPosition> FindAll(string keyword) {
var idx = IndexOf(keyword);
while ((idx >= 0) && (idx < this.positions.Count)
&& (Compare(keyword, this.positions[idx]) == 0)) {
yield return this.positions[idx];
idx++;
}
}
private int IndexOf(string keyword) {
EnsureSorted();
// binary search
// When the keyword appears multiple times, this should
// point to the first match in positions. The following
// positions could be examined for additional matches
int minP = 0;
int maxP = this.positions.Count - 1;
while (maxP > minP) {
int midP = minP + ((maxP - minP) / 2);
if (Compare(keyword, this.positions[midP]) > 0) {
minP = midP + 1;
} else {
maxP = midP;
}
}
if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) {
return minP;
} else {
return -1;
}
}
private int Compare(StringPosition pos1, StringPosition pos2) {
int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex);
return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase);
}
private int Compare(string keyword, StringPosition pos2) {
return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase);
}
// Packs index of string, and position within string into a single int. This is
// set up for strings no greater than 255 bytes. If longer strings are desired,
// the code for the constructor, and extracting ListIndex and StringIndex would
// need to be modified accordingly, taking bits from ListIndex and using them
// for StringIndex.
public struct StringPosition {
public static StringPosition NotFound = new StringPosition(-1, 0);
private readonly int position;
public StringPosition(int listIndex, byte stringIndex) {
this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex;
}
public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } }
public byte StringIndex { get { return (byte) (this.position & 0xFF); } }
public override string ToString() {
return ListIndex.ToString() + ":" + StringIndex;
}
}
}
}
There's one approach that would be a lot faster. But it would mean looking for exact word matches, rather than using the Contains functionality.
Basically, if you have the memory for it you could create a Dictionary of words which also reference some sort of ID (or IDs) for the strings in which the word is found.
So the Dictionary might be of type <string, List<int>>. The benefit here of course is that you're consolidating a lot of words into a smaller collection. And, the Dictionary is very fast with lookups since it's built on a hash table.
Now if this isn't what you're looking for you might search for in-memory full-text searching libraries. SQL Server supports full-text searching using indexing to speed up the process beyond traditional wildcard searches. But a pure in-memory solution would surely be faster. This still may not give you the exact functionality of a wildcard search, however.
In that case what you need is a reverse index.
If you are keen to pay much you can use database specific full text search index, and tuning the indexing to index on every subset of words.
Alternatively, you can use a very successful open source project that can achieve the same thing.
You need to pre-index the string using tokenizer, and build the reverse index file. We have similar use case in Java where we have to have a very fast autocomplete in a big set of data.
You can take a look at Lucene.NET which is a port of Apache Lucene (in Java).
If you are willing to ditch LINQ, you can use NHibernate Search. (wink).
Another option is to implement the pre-indexing in memory, with preprocessing and bypass of scanning unneeded, take a look at the Knuth-Morris-Pratt algorithm.
I need to create a list of numbers from a range (for example from x to y) in a random order so that every order has an equal chance.
I need this for a music player I write in C#, to create play lists in a random order.
Any ideas?
Thanks.
EDIT: I'm not interested in changing the original list, just pick up random indexes from a range in a random order so that every order has an equal chance.
Here's what I've wrriten so far:
public static IEnumerable<int> RandomIndexes(int count)
{
if (count > 0)
{
int[] indexes = new int[count];
int indexesCountMinus1 = count - 1;
for (int i = 0; i < count; i++)
{
indexes[i] = i;
}
Random random = new Random();
while (indexesCountMinus1 > 0)
{
int currIndex = random.Next(0, indexesCountMinus1 + 1);
yield return indexes[currIndex];
indexes[currIndex] = indexes[indexesCountMinus1];
indexesCountMinus1--;
}
yield return indexes[0];
}
}
It's working, but the only problem of this is that I need to allocate an array in the memory in the size of count. I'm looking for something that dose not require memory allocation.
Thanks.
This can actually be tricky if you're not careful (i.e., using a naïve shuffling algorithm). Take a look at the Fisher-Yates/Knuth shuffle algorithm for proper distribution of values.
Once you have the shuffling algorithm, the rest should be easy.
Here's more detail from Jeff Atwood.
Lastly, here's Jon Skeet's implementation and description.
EDIT
I don't believe that there's a solution that satisfies your two conflicting requirements (first, to be random with no repeats and second to not allocate any additional memory). I believe you may be prematurely optimizing your solution as the memory implications should be negligible, unless you're embedded. Or, perhaps I'm just not smart enough to come up with an answer.
With that, here's code that will create an array of evenly distributed random indexes using the Knuth-Fisher-Yates algorithm (with a slight modification). You can cache the resulting array, or perform any number of optimizations depending on the rest of your implementation.
private static int[] BuildShuffledIndexArray( int size ) {
int[] array = new int[size];
Random rand = new Random();
for ( int currentIndex = array.Length - 1; currentIndex > 0; currentIndex-- ) {
int nextIndex = rand.Next( currentIndex + 1 );
Swap( array, currentIndex, nextIndex );
}
return array;
}
private static void Swap( IList<int> array, int firstIndex, int secondIndex ) {
if ( array[firstIndex] == 0 ) {
array[firstIndex] = firstIndex;
}
if ( array[secondIndex] == 0 ) {
array[secondIndex] = secondIndex;
}
int temp = array[secondIndex];
array[secondIndex] = array[firstIndex];
array[firstIndex] = temp;
}
NOTE: You can use ushort instead of int to half the size in memory as long as you don't have more than 65,535 items in your playlist. You could always programmatically switch to int if the size exceeds ushort.MaxValue. If I, personally, added more than 65K items to a playlist, I wouldn't be shocked by increased memory utilization.
Remember, too, that this is a managed language. The VM will always reserve more memory than you are using to limit the number of times it needs to ask the OS for more RAM and to limit fragmentation.
EDIT
Okay, last try: we can look to tweak the performance/memory trade off: You could create your list of integers, then write it to disk. Then just keep a pointer to the offset in the file. Then every time you need a new number, you just have disk I/O to deal with. Perhaps you can find some balance here, and just read N-sized blocks of data into memory where N is some number you're comfortable with.
Seems like a lot of work for a shuffle algorithm, but if you're dead-set on conserving memory, then at least it's an option.
If you use a maximal linear feedback shift register, you will use O(1) of memory and roughly O(1) time. See here for a handy C implementation (two lines! woo-hoo!) and tables of feedback terms to use.
And here is a solution:
public class MaximalLFSR
{
private int GetFeedbackSize(uint v)
{
uint r = 0;
while ((v >>= 1) != 0)
{
r++;
}
if (r < 4)
r = 4;
return (int)r;
}
static uint[] _feedback = new uint[] {
0x9, 0x17, 0x30, 0x44, 0x8e,
0x108, 0x20d, 0x402, 0x829, 0x1013, 0x203d, 0x4001, 0x801f,
0x1002a, 0x2018b, 0x400e3, 0x801e1, 0x10011e, 0x2002cc, 0x400079, 0x80035e,
0x1000160, 0x20001e4, 0x4000203, 0x8000100, 0x10000235, 0x2000027d, 0x4000016f, 0x80000478
};
private uint GetFeedbackTerm(int bits)
{
if (bits < 4 || bits >= 28)
throw new ArgumentOutOfRangeException("bits");
return _feedback[bits];
}
public IEnumerable<int> RandomIndexes(int count)
{
if (count < 0)
throw new ArgumentOutOfRangeException("count");
int bitsForFeedback = GetFeedbackSize((uint)count);
Random r = new Random();
uint i = (uint)(r.Next(1, count - 1));
uint feedback = GetFeedbackTerm(bitsForFeedback);
int valuesReturned = 0;
while (valuesReturned < count)
{
if ((i & 1) != 0)
{
i = (i >> 1) ^ feedback;
}
else {
i = (i >> 1);
}
if (i <= count)
{
valuesReturned++;
yield return (int)(i-1);
}
}
}
}
Now, I selected the feedback terms (badly) at random from the link above. You could also implement a version that had multiple maximal terms and you select one of those at random, but you know what? This is pretty dang good for what you want.
Here is test code:
static void Main(string[] args)
{
while (true)
{
Console.Write("Enter a count: ");
string s = Console.ReadLine();
int count;
if (Int32.TryParse(s, out count))
{
MaximalLFSR lfsr = new MaximalLFSR();
foreach (int i in lfsr.RandomIndexes(count))
{
Console.Write(i + ", ");
}
}
Console.WriteLine("Done.");
}
}
Be aware that maximal LFSR's never generate 0. I've hacked around this by returning the i term - 1. This works well enough. Also, since you want to guarantee uniqueness, I ignore anything out of range - the LFSR only generates sequences up to powers of two, so in high ranges, it will generate wost case 2x-1 too many values. These will get skipped - that will still be faster than FYK.
Personally, for a music player, I wouldn't generate a shuffled list, and then play that, then generate another shuffled list when that runs out, but do something more like:
IEnumerable<Song> GetSongOrder(List<Song> allSongs)
{
var playOrder = new List<Song>();
while (true)
{
// this step assigns an integer weight to each song,
// corresponding to how likely it is to be played next.
// in a better implementation, this would look at the total number of
// songs as well, and provide a smoother ramp up/down.
var weights = allSongs.Select(x => playOrder.LastIndexOf(x) > playOrder.Length - 10 ? 50 : 1);
int position = random.Next(weights.Sum());
foreach (int i in Enumerable.Range(allSongs.Length))
{
position -= weights[i];
if (position < 0)
{
var song = allSongs[i];
playOrder.Add(song);
yield return song;
break;
}
}
// trim playOrder to prevent infinite memory here as well.
if (playOrder.Length > allSongs.Length * 10)
playOrder = playOrder.Skip(allSongs.Length * 8).ToList();
}
}
This would make songs picked in order, as long as they haven't been recently played. This provides "smoother" transitions from the end of one shuffle to the next, because the first song of the next shuffle could be the same song as the last shuffle with 1/(total songs) probability, whereas this algorithm has a lower (and configurable) chance of hearing one of the last x songs again.
Unless you shuffle the original song list (which you said you don't want to do), you are going to have to allocate some additional memory to accomplish what you're after.
If you generate the random permutation of song indices beforehand (as you are doing), you obviously have to allocate some non-trivial amount of memory to store it, either encoded or as a list.
If the user doesn't need to be able to see the list, you could generate the random song order on the fly: After each song, pick another random song from the pool of unplayed songs. You still have to keep track of which songs have already been played, but you can use a bitfield for that. If you have 10000 songs, you just need 10000 bits (1250 bytes), each one representing whether the song has been played yet.
I don't know your exact limitations, but I have to wonder if the memory required to store a playlist is significant compared to the amount required for playing audio.
There are a number of methods of generating permutations without needing to store the state. See this question.
I think you should stick to your current solution (the one in your edit).
To do a re-order with no repetitions & not making your code behave unreliable, you have to track what you have already used / like by keeping unused indexes or indirectly by swapping from the original list.
I suggest to check it in the context of the working application i.e. if its of any significance vs. the memory used by other pieces of the system.
From a logical standpoint, it is possible. Given a list of n songs, there are n! permutations; if you assign each permutation a number from 1 to n! (or 0 to n!-1 :-D) and pick one of those numbers at random, you can then store the number of the permutation that you are currently using, along with the original list and the index of the current song within the permutation.
For example, if you have a list of songs {1, 2, 3}, your permutations are:
0: {1, 2, 3}
1: {1, 3, 2}
2: {2, 1, 3}
3: {2, 3, 1}
4: {3, 1, 2}
5: {3, 2, 1}
So the only data I need to track is the original list ({1, 2, 3}), the current song index (e.g. 1) and the index of the permutation (e.g. 3). Then, if I want to find the next song to play, I know it's third (2, but zero-based) song of permutation 3, e.g. Song 1.
However, this method relies on you having an efficient means of determining the ith song of the jth permutation, which until I've had chance to think (or someone with a stronger mathematical background than I can interject) is equivalent to "then a miracle happens". But the principle is there.
If memory was really a concern after a certain number of records and it's safe to say that if that memory boundary is reached, there's enough items in the list to not matter if there are some repeats, just as long as the same song was not repeated twice, I would use a combination method.
Case 1: If count < max memory constraint, generate the playlist ahead of time and use Knuth shuffle (see Jon Skeet's implementation, mentioned in other answers).
Case 2: If count >= max memory constraint, the song to be played will be determined at run time (I'd do it as soon as the song starts playing so the next song is already generated by the time the current song ends). Save the last [max memory constraint, or some token value] number of songs played, generate a random number (R) between 1 and song count, and if R = one of X last songs played, generate a new R until it is not in the list. Play that song.
Your max memory constraints will always be upheld, although performance can suffer in case 2 if you've played a lot of songs/get repeat random numbers frequently by chance.
you could use a trick we do in sql server to order sets in random like this with the use of guid. the values are always distributed equaly random.
private IEnumerable<int> RandomIndexes(int startIndexInclusive, int endIndexInclusive)
{
if (endIndexInclusive < startIndexInclusive)
throw new Exception("endIndex must be equal or higher than startIndex");
List<int> originalList = new List<int>(endIndexInclusive - startIndexInclusive);
for (int i = startIndexInclusive; i <= endIndexInclusive; i++)
originalList.Add(i);
return from i in originalList
orderby Guid.NewGuid()
select i;
}
You're going to have to allocate some memory, but it doesn't have to be a lot. You can reduce the memory footprint (the degree by which I'm unsure, as I don't know that much about the guts of C#) by using a bool array instead of int. Best case scenario this will only use (count / 8) bytes of memory, which isn't too bad (but I doubt C# actually represents bools as single bits).
public static IEnumerable<int> RandomIndexes(int count) {
Random rand = new Random();
bool[] used = new bool[count];
int i;
for (int counter = 0; counter < count; counter++) {
while (used[i = rand.Next(count)]); //i = some random unused value
used[i] = true;
yield return i;
}
}
Hope that helps!
As many others have said you should implement THEN optimize, and only optimize the parts that need it (which you check on with a profiler). I offer a (hopefully) elegant method of getting the list you need, which doesn't really care so much about performance:
using System;
using System.Collections.Generic;
using System.Linq;
namespace Test
{
class Program
{
static void Main(string[] a)
{
Random random = new Random();
List<int> list1 = new List<int>(); //source list
List<int> list2 = new List<int>();
list2 = random.SequenceWhile((i) =>
{
if (list2.Contains(i))
{
return false;
}
list2.Add(i);
return true;
},
() => list2.Count == list1.Count,
list1.Count).ToList();
}
}
public static class RandomExtensions
{
public static IEnumerable<int> SequenceWhile(
this Random random,
Func<int, bool> shouldSkip,
Func<bool> continuationCondition,
int maxValue)
{
int current = random.Next(maxValue);
while (continuationCondition())
{
if (!shouldSkip(current))
{
yield return current;
}
current = random.Next(maxValue);
}
}
}
}
It is pretty much impossible to do it without allocating extra memory. If you're worried about the amount of extra memory allocated, you could always pick a random subset and shuffle between those. You'll get repeats before every song is played, but with a sufficiently large subset I'll warrant few people will notice.
const int MaxItemsToShuffle = 20;
public static IEnumerable<int> RandomIndexes(int count)
{
Random random = new Random();
int indexCount = Math.Min(count, MaxItemsToShuffle);
int[] indexes = new int[indexCount];
if (count > MaxItemsToShuffle)
{
int cur = 0, subsetCount = MaxItemsToShuffle;
for (int i = 0; i < count; i += 1)
{
if (random.NextDouble() <= ((float)subsetCount / (float)(count - i + 1)))
{
indexes[cur] = i;
cur += 1;
subsetCount -= 1;
}
}
}
else
{
for (int i = 0; i < count; i += 1)
{
indexes[i] = i;
}
}
for (int i = indexCount; i > 0; i -= 1)
{
int curIndex = random.Next(0, i);
yield return indexes[curIndex];
indexes[curIndex] = indexes[i - 1];
}
}