I have been struggling with deleting massive quantities of old data from a database. Each of 5 different tables has as many as 50M rows that need to be deleted. No single delete statement could handle that quantity of data, so I have to loop through deleting a few at a time. My question is to whether there is any noticeable performance increase in looping within a stored procedure instead of looping in the application code. Now for the specifics, I am using DB2 (9.7 CE), and coding in C#. For my stored procedure I use:
--#SET TERMINATOR ;
DROP PROCEDURE myschema.purge_orders_before;
--#SET TERMINATOR #
CREATE PROCEDURE myschema.purge_orders_before (IN before_date TIMESTAMP)
DYNAMIC RESULT SETS 1
P1: BEGIN
DECLARE no_data SMALLINT DEFAULT 0;
DECLARE deadlock_encountered SMALLINT DEFAULT 0;
DECLARE deadlock_condition CONDITION FOR SQLSTATE '40001';
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_data = 1;
-- The deadlock_encountered attribute is throw-away,
-- but a continue handler needs to do something,
-- i.e., it's not enough to just declare a handler,
-- it has to have an action in its body.
DECLARE CONTINUE HANDLER FOR deadlock_condition
SET deadlock_encountered = 1;
WHILE (no_data = 0 ) DO
DELETE FROM
(SELECT 1 FROM myschema.orders WHERE date < before_date FETCH FIRST 100 ROWS ONLY );
COMMIT;
END WHILE;
END P1
#
--#SET TERMINATOR ;
Whose approach was unceremoniously lifted from this thread. My programmatic approach is as follows:
public static void PurgeOrdersBefore( DateTime date ) {
using ( OleDbConnection connection = DatabaseUtil.GetInstance( ).GetConnection( ) ) {
connection.Open( );
OleDbCommand command = new OleDbCommand( deleteOrdersBefore, connection );
command.Parameters.Add( "#Date", OleDbType.DBTimeStamp ).Value = date;
int rows = 0;
int loopRows = 0;
int loopIterations = 0;
log.Info( "starting PurgeOrdersBefore loop" );
while ( true ) {
command.Transaction = connection.BeginTransaction( );
loopRows = command.ExecuteNonQuery( );
command.Transaction.Commit( );
if ( loopRows <= 0 ) {
break;
}
if ( log.IsDebugEnabled ) log.Debug( "purged " + loopRows + " in loop iteration " + loopIterations );
loopIterations++;
rows += loopRows;
}
if ( log.IsInfoEnabled ) log.Info( "purged " + rows + " orders in " + loopIterations + " loop iterations" );
}
}
I performed a VERY primitive test in which I printed a timestamp at the start and finish and broke out of the loop after 10,000 in each. The outcome of said test was that the stored procedure took slightly over 6 minutes to delete 10,000 rows and the programmatic approach took just under 5 minutes. Being as primitive as it was, I imagine the only conclusion I can draw is that their is likely going to be very minimal difference in practice and keeping the loop in the C# code allows for much more dynamic monitoring.
All that said, does anyone else have any input on the subject? Could you explain what kind of hidden benefits I might receive were I to use the stored procedure approach? In particular, if Serge Rielau keeps an eye on this site, I would love to hear what you have to say (seems to be that he is the ninja all the others refer to when it comes to DB2 nonsense like this...)
-------------- Edit ---------------------
How about an export of some sort followed by a LOAD REPLACE? Has anyone done that before? Is there an example that I could follow? What implications would that have?
If the number of records to delete is a large fraction of the total, it can be cheaper to copy the good records into a temporary table, empty the original table, and copy the temp table back. The optimal way to do this is not consistent across RDBMSes; for example, some support TRUNCATE and others do not.
Try using the TOP command. I assume that you have problems with the size of the log file (which is why you can't just use a Delete from table command).
So you could write your query like so:
DELETE TOP 10000
FROM myschema.orders
WHERE date < before_date
Then loop over this command until the rows deleted = 0;
Related
I have a table of students in SQL Server and I want to execute a query like
SELECT *
FROM tbl_students;
but I don't want to write each column number getValue(0) getValue(1) in C# to get the result,
I wrote in the following statement
Console.WriteLine("{0},\t{1}", sqlDReader.GetValue(0), sqlDReader.GetValue(1));
I just want to get all the column values without writing each column index number, can't we simply get a string of the complete record, preferably with spaces of tabs in between?
You can. Assuming you are using c#, something like this would work
If you know the number of columns:
string completeLine = "";
for(int i = 0 ; i < numCols ; i++)
{
completeLine += sqlDReader.GetValue(0).ToString();
if (i < numCols - 1)
completeLine += " ";
}
Console.WriteLine (completeLine);
This assumes all columns can be cast to a string. Also assumes you know number of columns. There's a bunch of more complex ways to do this.
To get number of rols you can get the .Columns.Count or similar.
Then have an outer loop, for each row. Note: do not try to do this type of string concatenation on the entire table as it will be slow (since string are really not changeable).
I apologize if the question is confusing as I am not really sure how to word this concept.
Currently, what I am doing, is something along the following lines as MySQL statements, however I am migrating this to be handled in C# and plan to insert the records to the database after working with the data directly instead of inserting into the database and using the following concept:
$db->exec('UPDATE `' . date('Y-m',time() - self::DAYS_TO_MERGE) . '` SET `Cost`=0, `Location`=\'Flat Rate World\' WHERE `Cost` IS NULL AND `Caller` IN (' . $FlatRateWO. ') AND SUBSTR(`Dialed`,1,7) IN (\'0114021\',\'0117095\');');
$db->exec('UPDATE `' . date('Y-m',time() - self::DAYS_TO_MERGE) . '` SET `Cost`=0, `Location`=\'Flat Rate World\' WHERE `Cost` IS NULL AND `Caller` IN (' . $FlatRateWO. ') AND SUBSTR(`Dialed`,1,6) IN (\'011420\',\'011420\',\'011852\',\'011353\',\'011353\',\'011972\',\'011972\',\'011379\',\'011379\',\'011351\',\'011351\',\'011886\');');
$db->exec('UPDATE `' . date('Y-m',time() - self::DAYS_TO_MERGE) . '` SET `Cost`=0, `Location`=\'Flat Rate World\' WHERE `Cost` IS NULL AND `Caller` IN (' . $FlatRateWO. ') AND SUBSTR(`Dialed`,1,5) IN (\'01154\',\'01154\',\'01161\',\'01161\',\'01143\',\'01143\',\'01132\',\'01132\',\'01186\',\'01186\',\'01145\',\'01145\',\'01133\',\'01133\',\'01149\',\'01149\',\'01130\',\'01130\',\'01136\',\'01136\',\'01131\',\'01131\',\'01147\',\'01148\',\'01148\',\'01182\',\'01182\',\'01165\',\'01165\',\'01134\',\'01134\',\'01141\',\'01141\',\'01146\',\'01146\',\'01166\',\'01166\',\'01144\');');
$db->exec('UPDATE `' . date('Y-m',time() - self::DAYS_TO_MERGE) . '` SET `Cost`=0, `Location`=\'Flat Rate World\' WHERE `Cost` IS NULL AND `Caller` IN (' . $FlatRateWO. ') AND SUBSTR(`Dialed`,1,4) IN (\'1787\');');
The above PHP code executes queries and are in sequence based on the length of the starting digits starting with the longest digit group first. Meaning, 0114021 being 7 digits long, gets processed prior to processing 011420 which is 6 digits long. This is to prevent cases where 0111234 has a different price to set than 011123.
This process is working 100%, however it is very slow (average around 0.63s/query over 100,000 records). The actual values for this come from a CSV file which I must pre-process and then insert into the database, so if I can do the above processing and calculations on the records prior to inserting, I imagine this would save a lot of time.
Following is the above array converted into C# :
World = new List<string>() { "0114021", "0117095", "011420", "011852", "011353", "011972", "011972", "011379", "011351", "011886", "01154", "01161", "01143", "01132", "01186", "01145", "01133", "01149", "01130", "01136", "01131", "01147", "01148", "01182", "01165", "01134", "01141", "01146", "01166", "01144", "01135", "1787" };
What I would like to know is how can I accomplish this same task efficiently (as possible) of comparing for example the following numbers to see if they start with anything in World keeping in mind that I want the longest match returned first.
011353123456277 ... should match 011353
011351334478399 ... should match 01135
011326717788726 ... should match nothing -- not found.
Just tried the following code with no success :
if ( World.All( s => "01197236718876321".Contains( s ) ) ) {
MessageBox.Show( "found" );
}
and
if ( World.All( s => s.Contains("01197236718876321") ) ) {
MessageBox.Show( "found" );
}
Using the example found here > Using C# to check if string contains a string in string array
The first example is using nested foreach which I would like to avoid using nested loops. The Linq example looks good, but I believe the question is the reverse of what I am trying to do.
The following code seems to work, however I am not sure if it is respecting the order of the items in the array. It seems to be, but would like confirmation as I have no idea how to 'watch' what happens inside Linq's magic:
string foundas = "";
string number = "01197236718876321";
if(World.Any(
b => {
if(number.StartsWith(b)) {
foundas = b;
return true;
} else {
return false;
}
}
) ) {
MessageBox.Show( foundas );
}
Aside
I will have a follow up for this question as the next part is a bit more complex where I grab groups of rates (about 10,000), and they are also ordered by length of the group, but they have a 'cost' field which I am currently calculating on.
I would check for all hits with StartsWith and then simply take the longest string in the result (via an aggregation). There might be something simpler then aggregate.
var hit = World.Where( s => source.StartsWith(s)).Aggregate(string.Empty, (max,cur)=> max.Length > cur.Length ? max :cur);
if(!string.IsNullOrEmpty(hit))
MessageBox.Show( "found ");
I want to update my database but I think my code takes a lot of time in doing it. It takes about 20secs or more in updating. Is it possible to make it faster? If so please help me.
This is my code:
for (int i = 0; i < listView1.Items.Count; i++)
{
if (listView1.Items[i].SubItems[13].Text.ToString() == ("ACTIVE") || listView1.Items[i].SubItems[13].Text.ToString() == ("Active"))
{
for (int x = 0; x < listView1.Items[i].SubItems.Count; x++)
{
string a = listView1.Items[i].SubItems[7].Text;
TimeSpan time = Convert.ToDateTime(DateTime.Now.ToString("MMMM dd, yyyy")) - Convert.ToDateTime(a.ToString());
int days = (int)time.TotalDays;
listView1.Items[i].SubItems[11].Text = days.ToString() + " day(s)";
Class1.ConnectToDB();
Class1.sqlStatement = "Update tblhd set aging = '" + days.ToString() + " day(s)" + "'";
Class1.dbcommand = new SqlCommand(Class1.sqlStatement, Class1.dbconnection);
Class1.dbcommand.ExecuteReader();
}
}
}
It seems that you could do it with a single update statement:
UPDATE tblhd set aging=DATEDIFF(day, DateField, GETDATE())+" day(s)" WHERE ItemId=...
But it's generally not a good idea to store user-friendly labels like 'day(s)' in the database.
Actually, it is hard to say what is your SQL request suggested to do.
- Why are you using database?
- What are you storing there?
- Why are you inserting 'day(s)' string into a database instead of days integer value?
- Why are you updating ALL rows every time?
- Why are you updating (and overwriting) the same rows every time?
Please, describe your model and scenario, so, we understand how you want it work like and help you.
For your information: now your algorithm sets all rows' aging value to the last ListView's row's days value. It overwrites previously stored and recently updated data and, thus, this for loop is absolutely useless.
Each time your for loop is making a call to DB, which is not an efficient way to do this.
You can create a stored procedure which will make a single call to your DB.
do not open connection multiple time.
use using statement for connection creation using (SqlConnection
connection = Class1.ConnectToDB())
and us sql with parameter or store procedures
try to convert this day string into int so that you do not have to
convert it every time
use ExecuteNonQuery instead of ExecuteReader
StringBuilder query = new StringBuilder();
query.Append("CREATE TABLE #Codes (Code nvarchar(100) collate database_default ) ");
query.Append("Insert into #Codes (Code) ");
int lengthOfCodesArray = targetCodes.Length;
for (int index = 0; index < lengthOfCodesArray; index++)
{
string targetCode = targetCodes[index];
query.Append("Select N'" + targetCode + "' ");
if (index != lengthOfCodesArray - 1)
{
query.Append("Union All ");
}
}
query.Append("drop table #Codes ");
on: cmd.ExecuteReader() I get
There is insufficient system memory to run this query when creating temporary table
But weird thing is that, when I have 25k codes is ok, when 5k I get this error.
Initial size is 262 MB.
Lengt of each code is average 15.
This produces one giant statement, and of course it fails eventually.
You should do your INSERT one at a time (no UNION ALL), at least until it's time to optimize.
I have a feeling that your ultimate answer is going to involve BULK INSERT, but I don't know enough about your application to be sure.
I have an array with a huge amounts of IDs I would like to select out from the DB.
The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE).
(The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)
However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow.
Let me give an example:
SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)
Time to execute: ~1100 msec (This returns appx 200 rows in my example)
Versus:
SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)
Time to execute: ~80 msec (This returns appx 50000 rows in my example)
So even though I get 250 times more data back, it executes 14 times faster...
So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)
private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
public static string MassIdSelectionStringBuilder(
List<int> keys, ref int startindex, string colname)
{
const int maxlength = 63000;
if (keys.Count - startindex == 1)
{
string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
startindex++;
return idstring;
}
StringBuilder sb = new StringBuilder(maxlength + 1000);
List<int> individualkeys = new List<int>(256);
int min = keys[startindex++];
int max = min;
sb.Append("(");
const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
{
int key = keys[startindex];
if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
{
if (min == max)
individualkeys.Add(min);
else
{
if(sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
min = max = key;
}
else
{
max = key;
}
}
if (min == max)
individualkeys.Add(min);
else
{
if (sb.Length > 2)
sb.Append(" OR ");
sb.AppendFormat(betweenAnd, colname, min, max);
}
if (individualkeys.Count > 0)
{
if (sb.Length > 2)
sb.Append(" OR ");
string[] individualkeysstr = new string[individualkeys.Count];
for (int i = 0; i < individualkeys.Count; i++)
individualkeysstr[i] = individualkeys[i].ToString();
sb.AppendFormat("{0} IN ({1})", colname, String.Join(",",individualkeysstr));
}
sb.Append(")");
return sb.ToString();
}
It is then used like this:
List<int> keys; //Sort and make unique
...
for (int i = 0; i < keys.Count;)
{
string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);
However, my question is...
Does anyone know of a better/faster/smarter way to do this?
In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:
Mimicking a table variable parameter with an image
Yet is was a while ago. The following articles by Erland Sommarskog are up to date:
Arrays and Lists in SQL Server
If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN
if that isn't possible then try creating a TABLE variable like so
DECLARE #tTable TABLE
(
#Id int
)
store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try
You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.
let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}.
They translate into
ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7
ID BETWEEN 1 AND 2 OR ID in (8)
The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.
I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).
Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.
Another dirty idea similar to Neils,
Have a indexed view which holds the IDs alone based on your business condition
And you can join the view with your actual table and get the desired result.
The efficient way to do this is to:
Create a temporary table to hold the IDs
Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO #Temporary VALUES ... to insert it into the temporary table
INNER JOIN the temporary table or use it in an IN (SELECT ID from #Temporary) subquery
Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.
No recompilation is done at all when this is executed as long as the large string is passed as a parameter.
Note that in the loop you must tracking the prior and current comma in two separate values
Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:
Select * from
( SELECT *
FROM dbo.table
WHERE ID between <lowerbound> and <upperbound>) as range
where ID in (
1206,
1207,
1208,
1209,
1210,
1211,
1212,
1213,
1214,
1215,
1216,
1217,
1218,
1219,
1220,
1221,
1222,
1223,
1224,
1225,
1226,
1227,
1228,
<...>,
1230,
1231
)