I need some help with optimizing following query.
My Problem
I am trying to clean-up a table based on a size parameter. (delete x Mb from this table). The way I thought implementing it is: iterate over the table, starting with the oldest entry, get each row size (I'm taking only blob columns into account), iterate over all linked tables and do the same operation for them; if currentSize >= size stop the query and return the list of GUIDs found
Please note that this is part of a bigger query, so in the end I need the list of Ids.
What I've tried
First, I tried writing it using EntityFramework, but its execution took too long and I was only halfway of finishing it. So I wrote it directly in T-SQL.
Below is what I managed to write. However, when running into over a SQL Azure Database, it throws a Timeout Exception. I know this is due to the DTU limitation, but I'm also wondering if this query can be improved. I am no SQL expert and I need your help.
Current Query
DECLARE #maxSize int = 1
DECLARE #tempTable TABLE
(
Id uniqueidentifier,
Size float,
Position int
)
DECLARE #currentId uniqueidentifier
DECLARE #maxIterations int
DECLARE #index int = 1
SET #maxIterations = (SELECT COUNT(Id) FROM WhereToDelete)
WHILE(#index < #maxIterations)
BEGIN
INSERT INTO #tempTable
SELECT MasterJobGUID, ISNULL(DATALENGTH(BlobColumn1),0) +
ISNULL(DATALENGTH(BlobColumn2),0) +
ISNULL(DATALENGTH(BlobColumn3),0) +
ISNULL(DATALENGTH(BlobColumn4),0),
#index
FROM WhereToDelete
ORDER BY SomeColumn
OFFSET #index ROWS
FETCH NEXT 1 ROWS ONLY
SET #index=#index+1
SET #currentid = (SELECT TOP 1 Id FROM #tempTable ORDER BY Position DESC)
UPDATE #tempTable
SET Size = Size + ( SELECT SUM(ISNULL(DATALENGTH(BlobColumn),0))
FROM LinkedTable
WHERE ParentId = #currentId )
UPDATE #tempTable
SET Size = Size + ( SELECT ISNULL(SUM(ISNULL(DATALENGTH(OtherBlobColumn),0)),0)
FROM OtherLinkedTable
WHERE OtherLinkedTableId IN
(
SELECT OtherLinkedTableId FROM SomeTable
WHERE SomeTableId IN
(
SELECT SomeTableId FROM SomeOtherTable
WHERE ParentId = #currentId
)
))
IF ((SELECT SUM(Size) FROM #tempTable) >= #maxSize*1000000)
BEGIN
BREAK;
END
END
SELECT Id from #tempTable
You could try something like this
SELECT MasterJobGUID FROM (
SELECT [MasterJobGUID], SUM(ISNULL(DATALENGTH(BlobColumn1),0) +
ISNULL(DATALENGTH(BlobColumn2),0) +
ISNULL(DATALENGTH(BlobColumn3),0) +
ISNULL(DATALENGTH(BlobColumn4),0))
OVER (ORDER BY SomeColumn ROWS UNBOUNDED PRECEDING) SizeTotal
FROM WhereToDelete) innerQuery
WHERE [SizeTotal] < #maxSize*1000000
That's using T-SQL windowing functions to return you the summed total size, and only returning the rows that match the criteria in a single operation. It should be a bunch more efficient than a cursor.
Related
I am writing a Multi-instance background service, where I would like to process the recent data first, for this I am selecting top 10 records in a batch and marking them from a instance specific id, and then reading them for processing,
This process runs in multiple instances, thus there is a chance of records to be picked by more than one instance and I could generate duplicate results.
To avoid this I would like instance to pick records which are not already put on HOLDLOCK
My current update and select statement looks like this:
ALTER PROC dbo.GetRecordsForInstance #RecordCount INT = 10, #InstanceId varchar(max)
AS
BEGIN
update t with (ROWLOCK) set t.ProcessingStatus =#InstanceId, t.IsProcessing = 1
from SomeTable t
join
(
SELECT TOP (#RecordCount) Col1, Col2
FROM dbo. SomeTable with (ROWLOCK, HOLDLOCK)
WHERE IsProcessing = 0 and IsCompleted = 0
ORDER BY LastModifiedOn
) t1 on t.Id = t1.Id
SELECT * FROM SomeTable with (ROWLOCK) where ProcessingStatus = #InstanceId
END;
You can use the READPAST hint to skip already-locked rows.
You cannot use it with SERIALIZABLE though, you would need to downgrade the isolation level to REPEATABLEREAD. This is not an issue in this case, as the extra guarantee is only regarding new data.
Further improvements:
You can update the t1 derived table directly, there is no need to rejoin. Just select out all columns from that inner table.
You can combine the UPDATE and SELECT using OUTPUT.
To prevent deadlocks in this type of query, you must have an index on your table over the columns which you are querying by (IsProcessing, IsCompleted, LastModifiedOn) preferably with INCLUDE columns also.
CREATE OR ALTER PROC dbo.GetRecordsForInstance
#RecordCount INT = 10,
#InstanceId varchar(max)
AS
UPDATE t
SET
ProcessingStatus = #InstanceId,
IsProcessing = 1
OUTPUT inserted.*
FROM (
SELECT TOP (#RecordCount)
*
FROM dbo.SomeTable t WITH (ROWLOCK, REPEATABLEREAD, READPAST)
WHERE IsProcessing = 0 and IsCompleted = 0
ORDER BY LastModifiedOn
) t;
I have a Linq query (against in-memory objects) which uses local dateTimeLast variable to keep state:
IEnumerable<CacheEntry> entries = await db.Caches.OrderBy(e => e.Time).ToListAsync();
DateTime? dateTimeLast = null;
IEnumerable<CacheEntry> progression = entries.Where(e =>
{
bool isProgress = ((dateTimeLast == null) || (dateTimeLast >= e.DateAndTime));
if (isProgress)
dateTimeLast = e.DateAndTime;
return isProgress;
});
var result = progression.ToList();
How can I rewrite that Linq query into plain a T-SQL (SQL Server) query?
I do not know how to translate Where condition with state variable dateTimeLast in T-SQL.
Source table grew a lot in size and loading all into memory is too slow now.
Of course query is very simplified so there would be additional WHERE conditions, like SELECT * FROM Caches WHERE <search_condition> ORDER BY Time, but they are not the issue.
Source table Caches has 2 columns: Time, DateAndTime (they are no related).
For example I was looking at LAG function, but not useful.
You can achieve what you want with some SQL code using a combination of:
A temporary #dateTimeLast variable, as suggested by Tetsuya Yamamoto
A temporary table to store the valid rows
A cursor to iterate the rows in the table
You can use the following code as guidance (NOT tested, please consider it as pseudocode), it assumes that the name of the table is entries and that it has an integer id column:
DECLARE #dateTimeLast DATETIME
DECLARE #isProgress BIT
DECLARE #id INT
DECLARE #entryDateTime DATETIME
SELECT TOP 0 * INTO #temp FROM entries
DECLARE the_cursor CURSOR FOR SELECT Id, DateAndTime FROM entries ORDER BY Time
OPEN the_cursor
FETCH NEXT FROM the_cursor INTO #id, #entryDateTime
WHILE ##FETCH_STATUS = 0
BEGIN
SELECT #isProgress = (#dateTimeLast IS NULL) OR (#dateTimeLast >= #entryDateTime)
IF #isProgress
BEGIN
INSERT INTO #temp (SELECT * FROM entries WHERE id = #id)
#dateTimeLast = #entryDateTime
END
FETCH NEXT FROM the_cursor INTO #id, #entryDateTime
END
CLOSE the_cursor
DEALLOCATE the_cursor
SELECT * FROM #temp
Another option is to create the temporary table to store just the ids of the rows, and in the end do something like SELECT * FROM entries WHERE id IN (SELECT id FROM #temp)
How do I perform a check using CASE to check if a specific column has been reached then SELECT INTO INSERT into a another table while copying the current data over into this second table?
This is because my first tbl cases only accepts inserts upto column 20 (in the real world these are actually upload files).
The two by definition are exactly the same as I populated the second from a Select Table As > Script into > New table query ....etc. ID is in both are identity cols.
For example:
--INSERT THE SAME DATA BUT DO NOT INSERT INTO COLUMN UploadNo1 to UploadNo20 into casesTwo as this should already have data (files from table cases).
INSERT INTO casesTwo --ONLY FILE FROM COLUMN UploadNo20
SELECT CAST(
CASE
WHEN No20 = 'UploadNo20'
THEN 1
ELSE 0
END)
FROM cases
No way with CASE - you have to list the first 20 columns manually. You will need a dynamic query if you want to get columns dynamically.
Here is a sample:
DECLARE #DynamicSQL NVARCHAR(max);
DECLARE #ColumnsLits NVARCHAR(max);
-- Concatenate first 20 columns:
SELECT TOP 20
#ColumnsLits = isnull(#ColumnsLits + ', ', '') + column_name
FROM
information_schema.columns
WHERE
table_name = 'YourTable'
ORDER BY
ordinal_position
OPTION
(MAXDOP 1)
-- You can check column list:
-- SELECT #ColumnsLits
-- Build the query:
SET #DynamicSQL = N'SELECT ' + #ColumnsLits + N' FROM YourTable'
-- ... and run it:
EXEC (#DynamicSQL)
In my C# application I'm executing multiple update queries to manipulate data in the database table. E.g. replace a specific character set into a different character set, insert new characters and remove characters. When a query like this has executed I want to do two things. Get the total rowcount of the affected rows and get a row_number() result set of the affected rows. The first thing is quite simple and is working already. The second thing however is something I haven't been able to figure out yet.
Here is an example of a query that I might use when I'm manipulating data:
UPDATE myTable
SET myColumn = STUFF(myColumn, fromCharPos, toCharPos,
REPLACE(SUBSTRING(myColumn, fromCharPos, toCharPos), charToReplace, charReplacement))
WHERE LEN(myColumn) >= fromCharPos;
This query replaces (on all the cells of a column) a character set with another character set within a specified character range.
When this query has executed I want to get a result set of row numbers from the affected rows. Anyone know how I'm able to implement this?
Some things to consider:
It has to work on atleast SERVER version 2005 and up.
The UPDATE statements are executed within a transaction
If anything is unclear, please comment below so I'm able to improve my question.
Edit:
I noticed that it is not quite clear what I want to achieve.
Lets say we have a set of data that looks like this:
34.56.44.12.33
32.44.68
45.22.66.33.77
44.42.44
66.44.22.44.45
00.22.78
43.98.34.65.33
Now I want to replace the dots with an underscore between character position 9 to 12. That means that only these rows will be affected by the query:
34.56.44.12.33 <--
32.44.68
45.22.66.33.77 <--
44.42.44
66.44.22.44.45 <--
00.22.78
43.98.34.65.33 <--
The thing I want to achieve is to get a row number result set of the affected rows. In my example that will be a result set like this:
Row_number()
1
3
5
7
This may help you..
CREATE TABLE #updatetablename
(excolumn VARCHAR(100))
INSERT INTO #updatetablename
VALUES ('34.56.44.12.33'),
('32.44.68'),
('45.22.66.33.77'),
('44.42.44'),
('66.44.22.44.45'),
('00.22.78'),
('43.98.34.65.33')
DECLARE #temp TABLE
(excolumn VARCHAR(100))
DECLARE #temp1 TABLE
(row_num INT,excolumn VARCHAR(100))
INSERT INTO #temp1
SELECT Row_number()OVER (ORDER BY excolumn),*
FROM #updatetablename
UPDATE #updatetablename
SET excolumn = Replace(excolumn, '.', '_')
output deleted.excolumn
INTO #temp
WHERE Len(excolumn) > 12
SELECT b.row_num AS updatedrows,
a.excolumn
FROM #temp a
JOIN #temp1 b
ON a.excolumn = b.excolumn
Updated
declare #table table(val varchar(500))
insert into #table values
('34.56.44.12.33'),
('32.44.68'),
('45.22.66.33.77'),
('44.42.44'),
('66.44.22.44.45'),
('00.22.78'),
('43.98.34.65.33')
--select * from #table
declare #temp table(rowid int,val varchar(500), createdate datetime)
insert into #temp
select ROW_NUMBER () over(order by val), val, GETDATE() from #table
declare #rowEffectedCount int = 0
--select ROW_NUMBER () over(order by val), val, GETDATE() from #table WHERE CHARINDEX( '.',val,9 ) > 0
UPDATE #table
SET val =
REPLACE(SUBSTRING(val, CHARINDEX( '.',val,9 ), LEN(val)), ',', '_' )
WHERE CHARINDEX( '.',val,9 ) > 0
set #rowEffectedCount = ##ROWCOUNT
select #rowEffectedCount roweffected ,* from #temp t1
where val not in (
select val from #table )
Old one
Its quite simple as my understanding.
You just add a one select query of your update query. read the comment for more understand
declare #rowEffectedCount int = 0
--you can use a temp table or permanet history table to take each work of portion.
--one thing to be carefull as structure is same or only save the pk , then no issue
--create table tempRowcount and insert the statement with the same where query, which filter the same data.
declare #t table(id int, createdate datetime)
select * into #t from myTable WHERE LEN(myColumn) >= fromCharPos
--or only pk
select id into #t from myTable WHERE LEN(myColumn) >= fromCharPos
--now update the
UPDATE myTable
SET myColumn = STUFF(myColumn, fromCharPos, toCharPos,
REPLACE(SUBSTRING(myColumn, fromCharPos, toCharPos), charToReplace, charReplacement))
WHERE LEN(myColumn) >= fromCharPos;
select * from #t
--finally delete or truncate or drop the table(if you use permanent table)
i have a query with purpose to find the records in table item_location and not exist in table operation_detail for a specific month of the year
SELECT il.item_id,
il.SEQUENCE,
SUM (il.quantity) AS quantity,
i.buy_price,
i.sell_price, i.item_name, i.unit_measure_id,
i.is_raw_item AS is_raw
FROM item_location il, item i
WHERE il.quantity <> 0
AND il.item_id = i.item_id
AND il.SEQUENCE = i.SEQUENCE
AND NOT EXISTS (
SELECT od.*
FROM operation_detail od, operation_header oh, rt_operation o
WHERE od.item_id = il.item_id
AND od.SEQUENCE = il.SEQUENCE
AND od.operation_header_id = oh.operation_header_id
AND oh.operation_type_id = o.operation_type_id
AND o.operation_stock IN ('I', 'O')
AND MONTH (oh.operation_date) = #MONTH
AND YEAR (oh.operation_date) = #YEAR)
GROUP BY il.item_id,
il.SEQUENCE,
i.buy_price,
i.sell_price,
i.item_name,
i.unit_measure_id,
i.is_raw_item
Note that running this query from .net platform using DataAdapter give a timeout, running it from SQL take 40s
My main prob is the TimeOut....any suggest
The default time-out to run the query is 30 seconds, and if your command takes longer, it will be terminated. I guess you should optimize your query to run faster, but you can also increase the time out for your data adapter:
dataAdapter.SelectCommand.CommandTimeout = 120; // Two minutes
For improving performance don't use non SARGable where clause. Your major mistake do which makes a query non-SARGable is functions directly on a column in the WHERE Clause.This won't use an index.
Look at this example where declare new parameters for query forcing INDEX SEEK operation
DECLARE #YEAR int = 1971,
#MONTH int = 11,
#StartDate datetime,
#EndDate datetime
SELECT #StartDate = CAST(CAST(#YEAR AS nvarchar(4)) + RIGHT('0' + CAST(#MONTH AS nvarchar(2)), 2) + '01' AS datetime),
#EndDate = DATEADD(month, 1, CAST(#YEAR AS nvarchar(4)) + RIGHT('0' + CAST(#MONTH AS nvarchar(2)), 2) + '01')
SELECT
...
WHERE od.item_id = il.item_id
AND od.SEQUENCE = il.SEQUENCE
AND od.operation_header_id = oh.operation_header_id
AND oh.operation_type_id = o.operation_type_id
AND o.operation_stock IN ('I', 'O')
AND oh.operation_date >= #StartDate AND oh.operation_date < #EndDate
Whenever I see queries with more than a few grouping columns, I start to think that the query can be rewritten. In general, you should try to group only on key columns, storing the key and aggregated results into a temp table, then join that temp table to get the additional details. For example:
insert into
#tmp
select
key1, key2, sum(things)
from
table;
Then:
select
table.key1, table.key2,
tmp.sum_of_all_things,
table.other_stuff, table.extra_data
from
#tmp tmp
join
table on tmp.key1 = table.key1 and tmp.key2 = table.key2
This will avoid all the overhead of having to sort all non-key columns (which is what happens as part of a group by operation).
Secondly, since you have a correlated subquery within the not exists clause, you should provide an index or indexes on the match predicate (in this case, item_id and sequence). The way exists works, it will return true if the result set contains any rows, but it will still need to re-execute the inner query for every row of the outer query. So you will need indexes to make this less torturous.
Since your inner query itself contains 3 joins, I would seriously consider running it separately and storing the results in another temp table.