I have an array with a huge amounts of IDs I would like to select out from the DB.
The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE).
(The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)
However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow.
Let me give an example:
SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)
Time to execute: ~1100 msec (This returns appx 200 rows in my example)
Versus:
SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)
Time to execute: ~80 msec (This returns appx 50000 rows in my example)
So even though I get 250 times more data back, it executes 14 times faster...
So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)
private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
public static string MassIdSelectionStringBuilder(
List<int> keys, ref int startindex, string colname)
{
const int maxlength = 63000;
if (keys.Count - startindex == 1)
{
string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
startindex++;
return idstring;
}
StringBuilder sb = new StringBuilder(maxlength + 1000);
List<int> individualkeys = new List<int>(256);
int min = keys[startindex++];
int max = min;
sb.Append("(");
const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
{
int key = keys[startindex];
if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
{
if (min == max)
individualkeys.Add(min);
else
{
if(sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
min = max = key;
}
else
{
max = key;
}
}
if (min == max)
individualkeys.Add(min);
else
{
if (sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
if (individualkeys.Count > 0)
{
if (sb.Length > 2)
sb.Append(" OR ");
string[] individualkeysstr = new string[individualkeys.Count];
for (int i = 0; i < individualkeys.Count; i++)
individualkeysstr[i] = individualkeys[i].ToString();
sb.AppendFormat("{0} IN ({1})", colname, String.Join(",",individualkeysstr));
}
sb.Append(")");
return sb.ToString();
}
It is then used like this:
List<int> keys; //Sort and make unique
...
for (int i = 0; i < keys.Count;)
{
string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);
However, my question is...
Does anyone know of a better/faster/smarter way to do this?
In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:
Mimicking a table variable parameter with an image
Yet is was a while ago. The following articles by Erland Sommarskog are up to date:
Arrays and Lists in SQL Server
If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN
if that isn't possible then try creating a TABLE variable like so
DECLARE #tTable TABLE
(
#Id int
)
store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try
You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.
let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}.
They translate into
ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7
ID BETWEEN 1 AND 2 OR ID in (8)
The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.
I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).
Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.
Another dirty idea similar to Neils,
Have a indexed view which holds the IDs alone based on your business condition
And you can join the view with your actual table and get the desired result.
The efficient way to do this is to:
Create a temporary table to hold the IDs
Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO #Temporary VALUES ... to insert it into the temporary table
INNER JOIN the temporary table or use it in an IN (SELECT ID from #Temporary) subquery
Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.
No recompilation is done at all when this is executed as long as the large string is passed as a parameter.
Note that in the loop you must tracking the prior and current comma in two separate values
Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:
Select * from
( SELECT *
FROM dbo.table
WHERE ID between <lowerbound> and <upperbound>) as range
where ID in (
1206,
1207,
1208,
1209,
1210,
1211,
1212,
1213,
1214,
1215,
1216,
1217,
1218,
1219,
1220,
1221,
1222,
1223,
1224,
1225,
1226,
1227,
1228,
<...>,
1230,
1231
)
Related
This question already has answers here:
"order by newid()" - how does it work?
(5 answers)
Closed 2 years ago.
I am creating a quiz and im using a random number from a range of 1 - 20 numbers ( Primary Keys)
Random r = new Random();
int rInt = r.Next(1, 9);
The numbers(primary keys) and then used for a query to select 5 random number but the problem is that I am getting repeated questions because the numbers repeat
string SQL = "SELECT QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3 FROM Question Where QuestionID = " + rInt;
I have tried some methods to fix it but its not working and running out of ideas , anyone have any suggestions?
Just ask the database for it:
string SQL = #"
SELECT TOP 5 QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3
FROM Question
ORDER BY NewID()";
If/when you outgrow this, there exists a more optimized solution as well:
string SQL = #"
WITH cte AS
(
SELECT TOP 5 QuestionId FROM Questions ORDER BY NEWID()
)
SELECT QuestionText,CorrectAnswer,WrongAnswer1,WrongAnswer2,WrongAnswer3
FROM cte c
JOIN Questions q
ON q.QuestionId = c.QuestionId
";
The second query will perform much better (assuming QuestionId is your primary key) because it will only have to read the primary index (which will likely already be in memory), generate the Guids, pick the top 5 using the most efficient method, then look up those 5 records using the primary key.
The first query should work just fine for smaller number of questions, but I believe it may cause a table scan, and some pressure on tempdb, so if your questions are varchar(max) and get very long, or you have tens of thousands of questions with a very small tempdb with some versions of Sql Server, it may not perform great.
Something like this might do the trick for you:
[ThreadStatic]
private static Random __random = null;
public int[] Get5RandomQuestions()
{
__random = __random ?? new Random(Guid.NewGuid().GetHashCode()); // approx one in 850 chance of seed collision
using (var context = new MyDBContext())
{
var questions = context.Questions.Select(x => x.Question_ID).ToArray();
return questions.OrderBy(_ => __random.Next()).Take(5).ToArray();
}
}
Another, server side approach:
private static Random _r = new Random();
...
var seed = _r.NextDouble();
using var context = new SomeContext();
var questions = context.Questions
.OrderBy(p => SqlFunctions.Checksum(p.Id * seed))
.Take(5);
Note : Checksum is not bullet proof, limitations apply. This approach should not be used to generate quiz questions in life or death situations.
As per request:
SqlFunctions.Checksum will essentially generate a hash and order by it
CHECKSUM([Id] * <seed>) AS [C1],
...
ORDER BY [C1] ASC
CHECKSUM (Transact-SQL)
The CHECKSUM function returns the checksum value computed over a table
row, or over an expression list. Use CHECKSUM to build hash indexes.
...
CHECKSUM computes a hash value, called the checksum, over its argument
list. Use this hash value to build hash indexes. A hash index will
result if the CHECKSUM function has column arguments, and an index is
built over the computed CHECKSUM value. This can be used for equality
searches over the columns.
Note, as mentioned before the Checksum is not bullet proof it returns an int (take it for what it is), however, the chances of a collision or duplicate is extremely small for smaller data sets when using it in this way with unique Id, it's also fairly performant.
So running this only a production database with 10 million records many times, there was no collisions.
In regards to speed, it can get the top 5 in 75ms, however it is slower when generated by EF
The cte solution tendered for NewId, is about 125 ms.
The Linq .Distinct() method is too nice to not use here
The easiest way I know of doing this would be like below, using a method to create an infinite stream of random numbers, which can then be nicely wrangled with Linq:
using System.Linq;
IEnumerable<int> GenRandomNumbers()
{
var random = new Random();
while (true)
{
yield return rand.Next(1, 20);
}
}
var numbers = GenRandomNumbers()
.Distinct()
.Take(5)
.ToArray();
Though it looks like the generator method will run for ever because of its closed loop, it will only run until it has generated 5 distinct numbers, because of how it yields.
Try selecting all the questions from the db. Say you have them in a collection 'Question', you could then try
Questions.OrderBy(y => Guid.NewGuid()).ToList()
I have a table of students in SQL Server and I want to execute a query like
SELECT *
FROM tbl_students;
but I don't want to write each column number getValue(0) getValue(1) in C# to get the result,
I wrote in the following statement
Console.WriteLine("{0},\t{1}", sqlDReader.GetValue(0), sqlDReader.GetValue(1));
I just want to get all the column values without writing each column index number, can't we simply get a string of the complete record, preferably with spaces of tabs in between?
You can. Assuming you are using c#, something like this would work
If you know the number of columns:
string completeLine = "";
for(int i = 0 ; i < numCols ; i++)
{
completeLine += sqlDReader.GetValue(0).ToString();
if (i < numCols - 1)
completeLine += " ";
}
Console.WriteLine (completeLine);
This assumes all columns can be cast to a string. Also assumes you know number of columns. There's a bunch of more complex ways to do this.
To get number of rols you can get the .Columns.Count or similar.
Then have an outer loop, for each row. Note: do not try to do this type of string concatenation on the entire table as it will be slow (since string are really not changeable).
I have two tables in database. Ticket and TicketNumbers. I would like to write Linq to count the number of tickets that have numbers matching those passed into this function. Since we don't know how many numbers must be matched, the Linq has to be dynamic ... to my understanding.
public int CountPartialMatchingTicket(IList<int> numbers)
{
// where's the code? =_=;
}
Say for example there are 3 Tickets in the database now and I want to count up all those that have the numbers 3 and 4.
(1) 1 2 3 4
(2) 1 3 4 6 7
(3) 1 2 3
In this case the function should return 2, since ticket (1) and (2) have the matching numbers.
In another case if asked to match 1, 2, 3, then again we should be returned 2.
Here's what those two tables in the database look like:
Ticket:
TicketId, Name, AddDateTime
TicketNumbers:
Id, TicketId, Number
I've never used Dynamic Linq before, so just in case this is what I have at the top of my cs file.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using LottoGen.Models;
using System.Linq.Dynamic;
First things first though, I don't even know how I should write the Linq line for a fixed amount of numbers. I suppose the SQL would be like this:
SELECT TicketId, COUNT(0) AS Expr1
FROM TicketNumber
WHERE (Number = 3) OR (Number = 4)
GROUP BY TicketId
However this isn't what I want either. The above query would get me Tickets that have either a 3 or a 4 - but I just want the tickets that have BOTH numbers. And I guess it has to be nested somehow to return a single count. If I had to use my imagination for completing the function then, it would be something like this:
public int CountPartialMatchingTicket(IList<int> numbers)
{
string query = "";
foreach(int number in numbers) {
query += "Number = " + number.ToString() + " AND ";
}
// I know.. there is a trailing AND.. lazy
int count = DbContext.TicketNumbers.Where(query).Count();
return count;
}
Oh wait a minute. There's no Dynamic Linq there... The above is looking like something I would do in PHP and that query statement obviously does not do anything useful. What am I doing? :(
At the end of the day, I want to output a little table to the webpage looking like this.
Ticket Matching Tickets
-----------------------------------
3 4 2
Trinity, help!
public int CountPartialMatchingTicket(IList<int> numbers)
{
var arr = numbers.ToArray();
int count = DbContext.Tickets
.Count(tk=>arr.All(n=> tk.TicketNumbers.Any(tn=>tn.Number== n));
return count;
}
UPDATE: "If you don't mind, how would I limit this query to just the tickets made on a particular AddDateTime (the whole day)?"
The part inside the Count() method call is the WHERE condition, so just extend that:
DateTime targetDate = ......;
DateTime tooLate = targetDate.AddDay(1);
int count = DbContext.Tickets
.Count(tk=>
targetDate < tk.AddDateTime && tk.AddDateTime < tooLate
&& arr.All(n=> tk.TicketNumbers.Any(tn=>tn.Number== n));
This is similar to James Curran's answer, but a little simpler, and it should produce a simpler WHERE IN-based query:
// count all tickets...
return DbContext.Tickets.Count(
// where any of their ticket numbers
tk => tk.TicketNumbers.Any(
// are contained in our list of numbers
tn => numbers.Contains(tn.Number)))
On a C# application I need to create UNIQUE Promotional codes.
The promotional codes will be store in a SQL Table (SQL Server 2012).
Initially I though of GUIDS but they are to long to give to users.
I am considering a 6 alphanumeric code resulting in 52 521 875 unique combinations.
What do you think?
But how to generate the code so it is UNIQUE? I am considering:
Use random;
Use ticks of current datetime
Pre generate all codes in the database and pick it randomly ...
This approach has the problem of occupying two much space.
What would be a good approach to generate the random unique code?
UPDATE 1
For the approach in 1 I came up with the following C# code:
Random random = new Random();
Int32 count = 20;
Int32 length = 5;
List<String> codes = new List<String>();
Char[] keys = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".ToCharArray();
while (codes.Count < count) {
var code = Enumerable.Range(1, length)
.Select(k => keys[random.Next(0, keys.Length - 1)]) // Generate random char
.Aggregate("", (e, c) => e + c); // Join into a string
codes.Add(code);
}
UPDATE 2
String source = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
while (codes.Count < count) {
length = 5;
StringBuilder builder = new StringBuilder(5);
while (length-- > 0)
builder.Append(source[random.Next(source.Length)]);
codes.Add(builder.ToString());
}
Which approach do you think is faster?
Thank You,
Miguel
Eric Lippert showed how to use a multiplicative inverse to obfuscate sequential keys. Basically, the idea is to generate sequential keys (i.e. 1, 2, 3, 4, etc.) and then obfuscate them.
I showed a more convoluted way to do it in my article series Obfuscating Sequential Keys.
The beauty of this approach is that the keys appear random, but all you have to keep track of is a single number: the next sequential value to be generated.
YouTube uses a similar technique to generate their video IDs.
Don't worry about generating a unique code. Instead . . .
Generate a random code.
Insert that code into a database. It belongs in a column that has a unique index.
Trap the error that results from trying to insert a duplicate value.
If you get a duplicate value error, generate another random code, and try again.
I'd go for number 1.
Number 2 is not that random (you have lower and upper limit), number 3 is an overkill.
Maybe you can use something like this:
DECLARE #VALUES varchar(100)
SET #VALUES = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
SELECT SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1) +
SUBSTRING(#Values, CAST(FLOOR(RAND()*LEN(#Values)) + 1 AS int), 1)
Use a table for storing previously generated codes to check for uniqeness.
I have been struggling with deleting massive quantities of old data from a database. Each of 5 different tables has as many as 50M rows that need to be deleted. No single delete statement could handle that quantity of data, so I have to loop through deleting a few at a time. My question is to whether there is any noticeable performance increase in looping within a stored procedure instead of looping in the application code. Now for the specifics, I am using DB2 (9.7 CE), and coding in C#. For my stored procedure I use:
--#SET TERMINATOR ;
DROP PROCEDURE myschema.purge_orders_before;
--#SET TERMINATOR #
CREATE PROCEDURE myschema.purge_orders_before (IN before_date TIMESTAMP)
DYNAMIC RESULT SETS 1
P1: BEGIN
DECLARE no_data SMALLINT DEFAULT 0;
DECLARE deadlock_encountered SMALLINT DEFAULT 0;
DECLARE deadlock_condition CONDITION FOR SQLSTATE '40001';
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_data = 1;
-- The deadlock_encountered attribute is throw-away,
-- but a continue handler needs to do something,
-- i.e., it's not enough to just declare a handler,
-- it has to have an action in its body.
DECLARE CONTINUE HANDLER FOR deadlock_condition
SET deadlock_encountered = 1;
WHILE (no_data = 0 ) DO
DELETE FROM
(SELECT 1 FROM myschema.orders WHERE date < before_date FETCH FIRST 100 ROWS ONLY );
COMMIT;
END WHILE;
END P1
#
--#SET TERMINATOR ;
Whose approach was unceremoniously lifted from this thread. My programmatic approach is as follows:
public static void PurgeOrdersBefore( DateTime date ) {
using ( OleDbConnection connection = DatabaseUtil.GetInstance( ).GetConnection( ) ) {
connection.Open( );
OleDbCommand command = new OleDbCommand( deleteOrdersBefore, connection );
command.Parameters.Add( "#Date", OleDbType.DBTimeStamp ).Value = date;
int rows = 0;
int loopRows = 0;
int loopIterations = 0;
log.Info( "starting PurgeOrdersBefore loop" );
while ( true ) {
command.Transaction = connection.BeginTransaction( );
loopRows = command.ExecuteNonQuery( );
command.Transaction.Commit( );
if ( loopRows <= 0 ) {
break;
}
if ( log.IsDebugEnabled ) log.Debug( "purged " + loopRows + " in loop iteration " + loopIterations );
loopIterations++;
rows += loopRows;
}
if ( log.IsInfoEnabled ) log.Info( "purged " + rows + " orders in " + loopIterations + " loop iterations" );
}
}
I performed a VERY primitive test in which I printed a timestamp at the start and finish and broke out of the loop after 10,000 in each. The outcome of said test was that the stored procedure took slightly over 6 minutes to delete 10,000 rows and the programmatic approach took just under 5 minutes. Being as primitive as it was, I imagine the only conclusion I can draw is that their is likely going to be very minimal difference in practice and keeping the loop in the C# code allows for much more dynamic monitoring.
All that said, does anyone else have any input on the subject? Could you explain what kind of hidden benefits I might receive were I to use the stored procedure approach? In particular, if Serge Rielau keeps an eye on this site, I would love to hear what you have to say (seems to be that he is the ninja all the others refer to when it comes to DB2 nonsense like this...)
-------------- Edit ---------------------
How about an export of some sort followed by a LOAD REPLACE? Has anyone done that before? Is there an example that I could follow? What implications would that have?
If the number of records to delete is a large fraction of the total, it can be cheaper to copy the good records into a temporary table, empty the original table, and copy the temp table back. The optimal way to do this is not consistent across RDBMSes; for example, some support TRUNCATE and others do not.
Try using the TOP command. I assume that you have problems with the size of the log file (which is why you can't just use a Delete from table command).
So you could write your query like so:
DELETE TOP 10000
FROM myschema.orders
WHERE date < before_date
Then loop over this command until the rows deleted = 0;