Generate Short Random Uniquer Promotional Codes - c#

On a C# application I need to create UNIQUE Promotional codes.
The promotional codes will be store in a SQL Table (SQL Server 2012).
Initially I though of GUIDS but they are to long to give to users.
I am considering a 6 alphanumeric code resulting in 52 521 875 unique combinations.
What do you think?
But how to generate the code so it is UNIQUE? I am considering:
Use random;
Use ticks of current datetime
Pre generate all codes in the database and pick it randomly ...
This approach has the problem of occupying two much space.
What would be a good approach to generate the random unique code?
UPDATE 1
For the approach in 1 I came up with the following C# code:
Random random = new Random();
Int32 count = 20;
Int32 length = 5;
List<String> codes = new List<String>();
Char[] keys = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".ToCharArray();
while (codes.Count < count) {
var code = Enumerable.Range(1, length)
.Select(k => keys[random.Next(0, keys.Length - 1)]) // Generate random char
.Aggregate("", (e, c) => e + c); // Join into a string
codes.Add(code);
}
UPDATE 2
String source = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
while (codes.Count < count) {
length = 5;
StringBuilder builder = new StringBuilder(5);
while (length-- > 0)
builder.Append(source[random.Next(source.Length)]);
codes.Add(builder.ToString());
}
Which approach do you think is faster?
Thank You,
Miguel

Eric Lippert showed how to use a multiplicative inverse to obfuscate sequential keys. Basically, the idea is to generate sequential keys (i.e. 1, 2, 3, 4, etc.) and then obfuscate them.
I showed a more convoluted way to do it in my article series Obfuscating Sequential Keys.
The beauty of this approach is that the keys appear random, but all you have to keep track of is a single number: the next sequential value to be generated.
YouTube uses a similar technique to generate their video IDs.

Don't worry about generating a unique code. Instead . . .
Generate a random code.
Insert that code into a database. It belongs in a column that has a unique index.
Trap the error that results from trying to insert a duplicate value.
If you get a duplicate value error, generate another random code, and try again.

I'd go for number 1.
Number 2 is not that random (you have lower and upper limit), number 3 is an overkill.
Maybe you can use something like this:
DECLARE #VALUES varchar(100)
SET #VALUES = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
SELECT SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1)
Use a table for storing previously generated codes to check for uniqeness.

Related

How to genereate random numbers from a range when button is used but the numbers cannot repeate [duplicate]

This question already has answers here:
"order by newid()" - how does it work?
(5 answers)
Closed 2 years ago.
I am creating a quiz and im using a random number from a range of 1 - 20 numbers ( Primary Keys)
Random r = new Random();
int rInt = r.Next(1, 9);
The numbers(primary keys) and then used for a query to select 5 random number but the problem is that I am getting repeated questions because the numbers repeat
string SQL = "SELECT QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3 FROM Question Where QuestionID = " + rInt;
I have tried some methods to fix it but its not working and running out of ideas , anyone have any suggestions?
Just ask the database for it:
string SQL = #"
SELECT TOP 5 QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3
FROM Question
ORDER BY NewID()";
If/when you outgrow this, there exists a more optimized solution as well:
string SQL = #"
WITH cte AS
(
SELECT TOP 5 QuestionId FROM Questions ORDER BY NEWID()
)
SELECT QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3
FROM cte c
JOIN Questions q
ON q.QuestionId = c.QuestionId
";
The second query will perform much better (assuming QuestionId is your primary key) because it will only have to read the primary index (which will likely already be in memory), generate the Guids, pick the top 5 using the most efficient method, then look up those 5 records using the primary key.
The first query should work just fine for smaller number of questions, but I believe it may cause a table scan, and some pressure on tempdb, so if your questions are varchar(max) and get very long, or you have tens of thousands of questions with a very small tempdb with some versions of Sql Server, it may not perform great.
Something like this might do the trick for you:
[ThreadStatic]
private static Random __random = null;
public int[] Get5RandomQuestions()
{
__random = __random ?? new Random(Guid.NewGuid().GetHashCode()); // approx one in 850 chance of seed collision
using (var context = new MyDBContext())
{
var questions = context.Questions.Select(x => x.Question_ID).ToArray();
return questions.OrderBy(_ => __random.Next()).Take(5).ToArray();
}
}
Another, server side approach:
private static Random _r = new Random();
...
var seed = _r.NextDouble();
using var context = new SomeContext();
var questions = context.Questions
.OrderBy(p => SqlFunctions.Checksum(p.Id * seed))
.Take(5);
Note : Checksum is not bullet proof, limitations apply. This approach should not be used to generate quiz questions in life or death situations.
As per request:
SqlFunctions.Checksum will essentially generate a hash and order by it
CHECKSUM([Id] * <seed>) AS [C1],
...
ORDER BY [C1] ASC
CHECKSUM (Transact-SQL)
The CHECKSUM function returns the checksum value computed over a table
row, or over an expression list. Use CHECKSUM to build hash indexes.
...
CHECKSUM computes a hash value, called the checksum, over its argument
list. Use this hash value to build hash indexes. A hash index will
result if the CHECKSUM function has column arguments, and an index is
built over the computed CHECKSUM value. This can be used for equality
searches over the columns.
Note, as mentioned before the Checksum is not bullet proof it returns an int (take it for what it is), however, the chances of a collision or duplicate is extremely small for smaller data sets when using it in this way with unique Id, it's also fairly performant.
So running this only a production database with 10 million records many times, there was no collisions.
In regards to speed, it can get the top 5 in 75ms, however it is slower when generated by EF
The cte solution tendered for NewId, is about 125 ms.
The Linq .Distinct() method is too nice to not use here
The easiest way I know of doing this would be like below, using a method to create an infinite stream of random numbers, which can then be nicely wrangled with Linq:
using System.Linq;
IEnumerable<int> GenRandomNumbers()
{
var random = new Random();
while (true)
{
yield return rand.Next(1, 20);
}
}
var numbers = GenRandomNumbers()
.Distinct()
.Take(5)
.ToArray();
Though it looks like the generator method will run for ever because of its closed loop, it will only run until it has generated 5 distinct numbers, because of how it yields.
Try selecting all the questions from the db. Say you have them in a collection 'Question', you could then try
Questions.OrderBy(y => Guid.NewGuid()).ToList()

Generate 1M unique random keys with alpanumeric subset

I want to generate 1M random (appearing) unique alphanumeric keys and store them in a database. Each key will be 8 characters long and only the subset "abcdefghijk n pqrstuvxyz and 0-9" will be used.
The letters l,m,o and w are ditched. "m and w" are left out because of limited printing space, as each key will be printed on a product in a very small space. Dropping m and w enabled to increase the letter size with 2pt, improving readability. l and o were dropped because they are easily mixed up with 1, i and 0 at the current printing size. We did some testing characters 1,i, and 0 were always read correctly, l and o had to many mistakes. Capitals were left out for the same reason as 'm and w".
So why not a sequence? A few reasons: The keys can be registered afterwards and we do not want anyone guessing the next key in the sequence and register somebody else's key. Appearance: we don't need customers and competition to know we only shipped a few thousand keys.
Is there a practical way to generate the keys, ensure the uniqueness of each key and store them in a database? Thanks!
Edit: #CodeInChaos pointed out a problem: System.Random isn't very secure, and the sequence could be reproduced without a great deal of difficulty. I've replaced Random with a secure generator here:
var possibilities = "abcdefghijknpqrstuvxyz0123456789".ToCharArray();
int goal = 1000000;
int codeLength = 8;
var codes = new HashSet<string>();
var random = new RNGCryptoServiceProvider();
while (codes.Count < goal)
{
var newCode = new char[codeLength];
for (int i = 0; i < codeLength; i++)
newCode[i] = possibilities[random.Next(possibilities.Length)];
codes.Add(new string(newCode));
}
// now write codes to database
static class Extensions
{
public static byte Next(this RNGCryptoServiceProvider provider, byte maximum)
{
var b = new byte[1];
while (true)
{
provider.GetBytes(b);
if (b[0] < maximum)
return b[0];
}
}
}
(the Next method isn't very fast, but might be good enough for your purposes)
1 million isn't much these days and you can probably do that on a single machine fairly quickly. It's a one-time operation after all.
Take a hashtable (or hashset)
Generate random keys and put them into it as keys (or directly if a set) until the count is 1 million
Write them to the database
My quick and dirty testing code looked like this:
function new-key {-join'abcdefghijknpqrstuvxyz0123456789'[(0..7|%{random 32})]}
$keys = #{}
for(){$keys[(new-key)]=1}
But PowerShell is slow, so I'd expect C++ or C# do very well here.
Is there a practical way to generate the keys, ensure the uniqueness
of each key and store them in a database?
Since this is a single operation you could simply do the following:
1) Generate A Single Key
2) Verify that the generated key does not exist in the database.
3) If it does exist generate a new key.
3b) If it does not exist write it to the database
4) Go Back to Step 1
There are other choices of course, in the end it boils down to generating a key, and making sure it does not exist in the database.
You could in theory generate 10 million keys ( in order to save processing power ) write them to a file. Once the keys are generate just look at each one and see if it already exits in the database. You likely could program a tool that does this in less then 48 hours.
I encounter a similar problem once.. what I did is create a unique sequence YYYY/MM/DD/HH/MM/SS/millis/nano and get its hash code. After that I use the hash as a key. Your client and your competitor won't be able to guess the next value. It might not be full proof but in my case it was enough!
To actually get the random string, you can use code similar to this:
Random rand = new Random(new DateTime().Millisecond);
String[] possibilities = {"a","b","c","d","e","f","g","h","i","j","k",
"l","n","p","q","r","s","t","u","v","x","y","z","0","1","2","3","4",
"5","6","7","8","9"};
for (int i = 0; i < 1000000; ++i)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int j = 0; j < 8; ++j)
{
sb.Append(possibilities[rand.Next(possibilities.Length)]);
}
if (!databaseContains(sb.ToString()))
databaseAdd(sb.ToString());
else
--i;
}

Optimizing this C# algorithm (K Difference)

This is the problem I'm solving (it's a sample problem, not a real problem):
Given N numbers , [N<=10^5] we need to count the total pairs of
numbers that have a difference of K. [K>0 and K<1e9]
Input Format: 1st line contains N & K (integers). 2nd line contains N
numbers of the set. All the N numbers are assured to be distinct.
Output Format: One integer saying the no of pairs of numbers that have
a diff K.
Sample Input #00:
5 2
1 5 3 4 2
Sample Output #00:
3
Sample Input #01:
10 1
363374326 364147530 61825163 1073065718 1281246024 1399469912 428047635 491595254 879792181 1069262793
Sample Output #01:
0
I already have a solution (and I haven't been able to optimize it as well as I had hoped). Currently my solution gets a score of 12/15 when it is run, and I'm wondering why I can't get 15/15 (my solution to another problem wasn't nearly as efficient, but got all of the points). Apparently, the code is run using "Mono 2.10.1, C# 4".
So can anyone think of a better way to optimize this further? The VS profiler says to avoid calling String.Split and Int32.Parse. The calls to Int32.Parse can't be avoided, although I guess I could optimize tokenizing the array.
My current solution:
using System;
using System.Collections.Generic;
using System.Text;
using System.Linq;
namespace KDifference
{
class Solution
{
static void Main(string[] args)
{
char[] space = { ' ' };
string[] NK = Console.ReadLine().Split(space);
int N = Int32.Parse(NK[0]), K = Int32.Parse(NK[1]);
int[] nums = Console.ReadLine().Split(space, N).Select(x => Int32.Parse(x)).OrderBy(x => x).ToArray();
int KHits = 0;
for (int i = nums.Length - 1, j, k; i >= 1; i--)
{
for (j = 0; j < i; j++)
{
k = nums[i] - nums[j];
if (k == K)
{
KHits++;
}
else if (k < K)
{
break;
}
}
}
Console.Write(KHits);
}
}
}
Your algorithm is still O(n^2), even with the sorting and the early-out. And even if you eliminated the O(n^2) bit, the sort is still O(n lg n). You can use an O(n) algorithm to solve this problem. Here's one way to do it:
Suppose the set you have is S1 = { 1, 7, 4, 6, 3 } and the difference is 2.
Construct the set S2 = { 1 + 2, 7 + 2, 4 + 2, 6 + 2, 3 + 2 } = { 3, 9, 6, 8, 5 }.
The answer you seek is the cardinality of the intersection of S1 and S2. The intersection is {6, 3}, which has two elements, so the answer is 2.
You can implement this solution in a single line of code, provided that you have sequence of integers sequence, and integer difference:
int result = sequence.Intersect(from item in sequence select item + difference).Count();
The Intersect method will build an efficient hash table for you that is O(n) to determine the intersection.
Try this (note, untested):
Sort the array
Start two indexes at 0
If difference between the numbers at those two positions is equal to K, increase count, and increase one of the two indexes (if numbers aren't duplicated, increase both)
If difference is larger than K, increase index #1
If difference is less than K, increase index #2, if that would place it outside the array, you're done
Otherwise, go back to 3 and keep going
Basically, try to keep the two indexes apart by K value difference.
You should write up a series of unit-tests for your algorithm, and try to come up with edge cases.
This would allow you to do it in a single pass. Using hash sets is beneficial if there are many values to parse/check. You might also want to use a bloom filter in combination with hash sets to reduce lookups.
Initialize. Let A and B be two empty hash sets. Let c be zero.
Parse loop. Parse the next value v. If there are no more values the algorithm is done and the result is in c.
Back check. If v exists in A then increment c and jump back to 2.
Low match. If v - K > 0 then:
insert v - K into A
if v - K exists in B then increment c (and optionally remove v - K from B).
High match. If v + K < 1e9 then:
insert v + K into A
if v + K exists in B then increment c (and optionally remove v + K from B).
Remember. Insert v into B.
Jump back to 2.
// php solution for this k difference
function getEqualSumSubstring($l,$s) {
$s = str_replace(' ','',$s);
$l = str_replace(' ','',$l);
for($i=0;$i<strlen($s);$i++)
{
$array1[] = $s[$i];
}
for($i=0;$i<strlen($s);$i++)
{
$array2[] = $s[$i] + $l[1];
}
return count(array_intersect($array1,$array2));
}
echo getEqualSumSubstring("5 2","1 3 5 4 2");
Actually that's trivially to solve with a hashmap:
First put each number into a hashmap: dict((x, x) for x in numbers) in "pythony" pseudo code ;)
Now you just iterate through every number in the hashmap and check if number + K is in the hashmap. If yes, increase count by one.
The obvious improvement to the naive solution is to ONLY check for the higher (or lower) bound, otherwise you get the double results and have to divide by 2 afterwards - useless.
This is O(N) for creating the hashmap when reading the values in and O(N) when iterating through, i.e. O(N) and about 8loc in python (and it is correct, I just solved it ;-) )
Following Eric's answer, paste the implementation of Interscet method below, it is O(n):
private static IEnumerable<TSource> IntersectIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource current in second)
{
set.Add(current);
}
foreach (TSource current2 in first)
{
if (set.Remove(current2))
{
yield return current2;
}
}
yield break;
}

Performance issue with generation of random unique numbers

I have a situation where by I need to create tens of thousands of unique numbers. However these numbers must be 9 digits and cannot contain any 0's. My current approach is to generate 9 digits (1-9) and concatenate them together, and if the number is not already in the list adding it into it. E.g.
public void generateIdentifiers(int quantity)
{
uniqueIdentifiers = new List<string>(quantity);
while (this.uniqueIdentifiers.Count < quantity)
{
string id = string.Empty;
id += random.Next(1,10);
id += random.Next(1,10);
id += random.Next(1,10);
id += " ";
id += random.Next(1,10);
id += random.Next(1,10);
id += random.Next(1,10);
id += " ";
id += random.Next(1,10);
id += random.Next(1,10);
id += random.Next(1,10);
if (!this.uniqueIdentifiers.Contains(id))
{
this.uniqueIdentifiers.Add(id);
}
}
}
However at about 400,000 the process really slows down as more and more of the generated numbers are duplicates. I am looking for a more efficient way to perform this process, any help would be really appreciated.
Edit: - I'm generating these - http://www.nhs.uk/NHSEngland/thenhs/records/Pages/thenhsnumber.aspx
As others have mentioned, use a HashSet<T> instead of a List<T>.
Furthermore, using StringBuilder instead of simple string operations will gain you another 25%. If you can use numbers instead of strings, you win, because it only takes a third or fourth of the time.
var quantity = 400000;
var uniqueIdentifiers = new HashSet<int>();
while (uniqueIdentifiers.Count < quantity)
{
int i=0;
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
i = i*10 + random.Next(1,10);
uniqueIdentifiers.Add(i);
}
It takes about 270 ms on my machine for 400,000 numbers and about 700 for 1,000,000. And this even without any parallelism.
Because of the use of a HashSet<T> instead of a List<T>, this algorithm runs in O(n), i.e. the duration will grow linear. 10,000,000 values therefore take about 7 seconds.
This suggestion may or may not be popular.... it depends on people's perspective. Because you haven't been too specific about what you need them for, how often, or the exact number, I will suggest a brute force approach.
I would generate a hundred thousand numbers - shouldn't take very long at all, maybe a few seconds? Then use Parallel LINQ to do a Distinct() on them to eliminate duplicates. Then use another PLINQ query to run a regex against the remainder to eliminate any with zeroes in them. Then take the top x thousand. (PLINQ is brilliant for ripping through large tasks like this). If needed, rinse and repeat until you have enough for your needs.
On a decent machine it will just about take you longer to write this simple function than it will take to run it. I would also query why you have 400K entries to test when you state you actually need "tens of thousands"?
The trick here is that you only need ten thousand unique numbers. Theoretically you could have almost 9,0E+08 possibilities, but why care if you need so many less?
Once you realize that you can cut down on the combinations that much then creating enough unique numbers is easy:
long[] numbers = { 1, 3, 5, 7 }; //note that we just take a few numbers, enough to create the number of combinations we might need
var list = (from i0 in numbers
from i1 in numbers
from i2 in numbers
from i3 in numbers
from i4 in numbers
from i5 in numbers
from i6 in numbers
from i7 in numbers
from i8 in numbers
from i9 in numbers
select i0 + i1 * 10 + i2 * 100 + i3 * 1000 + i4 * 10000 + i5 * 100000 + i6 * 1000000 + i7 * 10000000 + i8 * 100000000 + i9 * 1000000000).ToList();
This snippet creates a list of more than a 1,000,000 valid unique numbers pretty much instantly.
Try avoiding checks making sure that you always pick up a unique number:
static char[] base9 = "123456789".ToCharArray();
static string ConvertToBase9(int value) {
int num = 9;
char[] result = new char[9];
for (int i = 8; i >= 0; --i) {
result[i] = base9[value % num];
value = value / num;
}
return new string(result);
}
public static void generateIdentifiers(int quantity) {
var uniqueIdentifiers = new List<string>(quantity);
// we have 387420489 (9^9) possible numbers of 9 digits in base 9.
// if we choose a number that is prime to that we can easily get always
// unique numbers
Random random = new Random();
int inc = 386000000;
int seed = random.Next(0, 387420489);
while (uniqueIdentifiers.Count < quantity) {
uniqueIdentifiers.Add(ConvertToBase9(seed));
seed += inc;
seed %= 387420489;
}
}
I'll try to explain the idea behind with small numbers...
Suppose you have at most 7 possible combinations. We choose a number that is prime to 7, e.g. 3, and a random starting number, e.g. 4.
At each round, we add 3 to our current number, and then we take the result modulo 7, so we get this sequence:
4 -> 4 + 3 % 7 = 0
0 -> 0 + 3 % 7 = 3
3 -> 3 + 3 % 7 = 6
6 -> 6 + 6 % 7 = 5
In this way, we generate all the values from 0 to 6 in a non-consecutive way. In my example, we are doing the same, but we have 9^9 possible combinations, and as a number prime to that I choose 386000000 (you just have to avoid multiples of 3).
Then, I pick up the number in the sequence and I convert it to base 9.
I hope this is clear :)
I tested it on my machine, and generating 400k unique values took ~ 1 second.
Meybe this will bee faster:
//we can generate first number wich in 9 base system will be between 88888888 - 888888888
//we can't start from zero becouse it will couse the great amount of 1 digit at begining
int randNumber = random.Next((int)Math.Pow(9, 8) - 1, (int)Math.Pow(9, 9));
//no we change our number to 9 base, but we add 1 to each digit in our number
StringBuilder builder = new StringBuilder();
for (int i=(int)Math.Pow(9,8); i>0;i= i/9)
{
builder.Append(randNumber / i +1);
randNumber = randNumber % i;
}
id = builder.ToString();
Looking at the solutions already posted, mine seems fairly basic. But, it works, and generates 1million values in approximate 1s (10 million in 11s).
public static void generateIdentifiers(int quantity)
{
HashSet<int> uniqueIdentifiers = new HashSet<int>();
while (uniqueIdentifiers.Count < quantity)
{
int value = random.Next(111111111, 999999999);
if (!value.ToString().Contains('0') && !uniqueIdentifiers.Contains(value))
uniqueIdentifiers.Add(value);
}
}
use string array or stringbuilder, wjile working with string additions.
more over, your code is not efficient because after generating many id's your list may hold new generated id, so that the while loop will run more than you need.
use for loops and generate your id's from this loop without randomizing. if random id's are required, use again for loops and generate more than you need and give an generation interval, and selected from this list randomly how much you need.
use the code below to have a static list and fill it at starting your program. i will add later a second code to generate random id list. [i'm a little busy]
public static Random RANDOM = new Random();
public static List<int> randomNumbers = new List<int>();
public static List<string> randomStrings = new List<string>();
private void fillRandomNumbers()
{
int i = 100;
while (i < 1000)
{
if (i.ToString().Contains('0') == false)
{
randomNumbers.Add(i);
}
}
}
I think first thing would be to use StringBuilder, instead of concatenation - you'll be pleasantly surprised.
Antoher thing - use a more efficient data structure, for example HashSet<> or HashTable.
If you could drop the quite odd requirement not to have zero's - then you could of course use just one random operation, and then format your resulting number the way you want.
I think #slugster is broadly right - although you could run two parallel processes, one to generate numbers, the other to verify them and add them to the list of accepted numbers when verified. Once you have enough, signal the original process to stop.
Combine this with other suggestions - using more efficient and appropriate data structures - and you should have something that works acceptably.
However the question of why you need such numbers is also significant - this requirement seems like one that should be analysed.
Something like this?
public List<string> generateIdentifiers2(int quantity)
{
var uniqueIdentifiers = new List<string>(quantity);
while (uniqueIdentifiers.Count < quantity)
{
var sb = new StringBuilder();
sb.Append(random.Next(11, 100));
sb.Append(" ");
sb.Append(random.Next(11, 100));
sb.Append(" ");
sb.Append(random.Next(11, 100));
var id = sb.ToString();
id = new string(id.ToList().ConvertAll(x => x == '0' ? char.Parse(random.Next(1, 10).ToString()) : x).ToArray());
if (!uniqueIdentifiers.Contains(id))
{
uniqueIdentifiers.Add(id);
}
}
return uniqueIdentifiers;
}

SQL huge selection of IDs - How to make it faster?

I have an array with a huge amounts of IDs I would like to select out from the DB.
The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE).
(The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)
However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow.
Let me give an example:
SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)
Time to execute: ~1100 msec (This returns appx 200 rows in my example)
Versus:
SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)
Time to execute: ~80 msec (This returns appx 50000 rows in my example)
So even though I get 250 times more data back, it executes 14 times faster...
So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)
private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
public static string MassIdSelectionStringBuilder(
List<int> keys, ref int startindex, string colname)
{
const int maxlength = 63000;
if (keys.Count - startindex == 1)
{
string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
startindex++;
return idstring;
}
StringBuilder sb = new StringBuilder(maxlength + 1000);
List<int> individualkeys = new List<int>(256);
int min = keys[startindex++];
int max = min;
sb.Append("(");
const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
{
int key = keys[startindex];
if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
{
if (min == max)
individualkeys.Add(min);
else
{
if(sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
min = max = key;
}
else
{
max = key;
}
}
if (min == max)
individualkeys.Add(min);
else
{
if (sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
if (individualkeys.Count > 0)
{
if (sb.Length > 2)
sb.Append(" OR ");
string[] individualkeysstr = new string[individualkeys.Count];
for (int i = 0; i < individualkeys.Count; i++)
individualkeysstr[i] = individualkeys[i].ToString();
sb.AppendFormat("{0} IN ({1})", colname, String.Join(",",individualkeysstr));
}
sb.Append(")");
return sb.ToString();
}
It is then used like this:
List<int> keys; //Sort and make unique
...
for (int i = 0; i < keys.Count;)
{
string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);
However, my question is...
Does anyone know of a better/faster/smarter way to do this?
In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:
Mimicking a table variable parameter with an image
Yet is was a while ago. The following articles by Erland Sommarskog are up to date:
Arrays and Lists in SQL Server
If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN
if that isn't possible then try creating a TABLE variable like so
DECLARE #tTable TABLE
(
#Id int
)
store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try
You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.
let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}.
They translate into
ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7
ID BETWEEN 1 AND 2 OR ID in (8)
The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.
I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).
Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.
Another dirty idea similar to Neils,
Have a indexed view which holds the IDs alone based on your business condition
And you can join the view with your actual table and get the desired result.
The efficient way to do this is to:
Create a temporary table to hold the IDs
Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO #Temporary VALUES ... to insert it into the temporary table
INNER JOIN the temporary table or use it in an IN (SELECT ID from #Temporary) subquery
Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.
No recompilation is done at all when this is executed as long as the large string is passed as a parameter.
Note that in the loop you must tracking the prior and current comma in two separate values
Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:
Select * from
( SELECT *
FROM dbo.table
WHERE ID between <lowerbound> and <upperbound>) as range
where ID in (
1206,
1207,
1208,
1209,
1210,
1211,
1212,
1213,
1214,
1215,
1216,
1217,
1218,
1219,
1220,
1221,
1222,
1223,
1224,
1225,
1226,
1227,
1228,
<...>,
1230,
1231
)

Categories