I have a SQL Server database designed like this :
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name1 (string)
Name2 (string, can be null)
Name3 (string, can be null)
Name4 (string, can be null)
TableValue
Iteration (int)
IdTableParameter (int, FOREIGN KEY)
Type (string)
Value (decimal)
So, as you've just understood, TableValue is linked to TableParameter.
TableParameter is like a multidimensionnal dictionary.
TableParameter is supposed to have a lot of rows (more than 300,000 rows)
From my c# client program, I have to fill this database after each Compute() function :
for (int iteration = 0; iteration < 5000; iteration++)
{
Compute();
FillResultsInDatabase();
}
In FillResultsInDatabase() method, I have to :
Check if the label of my parameter already exists in TableParameter. If it doesn't exist, i have to insert a new one.
I have to insert the value in the TableValue
Step 1 takes a long time ! I load all the table TableParameter in a IEnumerable property and then, for each parameter I make a
.FirstOfDefault( x => x.Name1 == item.Name1 &&
x.Name2 == item.Name2 &&
x.Name3 == item.Name3 &&
x.Name4 == item.Name4 );
in order to detect if it already exists (and after to get the id).
Performance are very bad like this !
I've tried to make selection with WHERE word in order to avoid loading every row of TableParameter but performance are worse !
How can I improve the performance of step 1 ?
For Step 2, performance are still bad with classic INSERT. I am going to try SqlBulkCopy.
How can I improve the performance of step 2 ?
EDITED
I've tried with Store Procedure :
CREATE PROCEDURE GetIdParameter
#Id int OUTPUT,
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
SELECT TOP 1 #Id = Id FROM TableParameter
WHERE
TableParameter.Name1 = #Name1
AND
(#Name2 IS NULL OR TableParameter.Name2= #Name2)
AND
(#Name3 IS NULL OR TableParameter.Name3 = #Name3)
GO
CREATE PROCEDURE CreateValue
#Iteration int,
#Type nvarchar(50),
#Value decimal(32, 18),
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
DECLARE #IdParameter int
EXEC GetIdParameter #IdParameter OUTPUT,
#Name1, #Name2, #Name3
IF #IdParameter IS NULL
BEGIN
INSERT TablePArameter (Name1, Name2, Name3)
VALUES
(#Name1, #Name2, #Name3)
SELECT #IdParameter= SCOPE_IDENTITY()
END
INSERT TableValue (Iteration, IdParamter, Type, Value)
VALUES
(#Iteration, #IdParameter, #Type, #Value)
GO
I still have the same performance... :-( (not acceptable)
If I understand what's happening you're querying the database to see if the data is there in step 1. I'd use a db call to a stored procedure that that inserts the data if it not there. So just compute the results and pass to the sp.
Can you compute the results first, and then insert in batches?
Does the compute function take data from the database? If so can you turn the operation in to a set based operation and perform it on the server itself? Or may part of it?
Remember that sql server is designed for a large dataset operations.
Edit: reflecting comments
Since the code is slow on the data inserts, and you suspect that it's because the insert has to search back before it can be done, I'd suggest that you may need to place SQL Indexes on the columns that you search on in order to improve searching speed.
However I have another idea.
Why don't you just insert the data without the check and then later when you read the data remove the duplicates in that query?
Given the fact that name2 - name3 can be null, would it be possible to restructure the parameter table:
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name (string)
Dimension int
Now you can index it and simplify the query. (WHERE name = "TheNameIWant" AND Dimension="2")
(And speaking of indices, you do have index the name columns in the parameter table?)
Where do you do your commits on the insert? if you do one statement commits, group multiple inserts into one.
If you are the only one inserting values, if speed is really of essence, load all values from the database into the memory and check there.
just some ideas
hth
Mario
I must admit that I'm struggling to grasp the business process that you are trying to achieve here.
On initial review, it appears as if you are are performing a data comparison within your application tier. I would advise against this and suggest that you let the Database Engine do what it is designed to do, to manage and implement your data access.
As another poster has mentioned, I concur that you should look to create a Stored Procedure to handle your record insertion logic. The procedure can perform a simple check to see if your records already exist.
You should also consider:
Enforcing the insertion logic/rule by creating a Unique Constraint across the four name columns.
Creating a covering non-clustered index incorporating the four name columns.
With regard to performance of your inserts, perhaps you can provide some metrics to qualify what it is that you are seeing and how you are measuring it?
To give you a yardstick the current ETL insertion record for SQL Server is approx 16 million rows per second. What sort of numbers are you expecting and wanting to see?
the fastest way ( i know so far) is bulk insert. but not just lines of INSERT. try insert + select + union. it works pretty fast.
insert into myTable
select a1, b1, c1, ...
union select a2, b2, c2, ...
union select a3, b3, c3, ...
Related
So let's say I have a table in SQL server that serves as a queue for items that need processing. Something like this:
Id (bigint)
BatchGuid (guid)
BatchProcessed (bit)
...
...along with some other columns describing the item that needs to be processed, etc. So there are many running consumers that add records to this table as needed to indicate that an item needs to be processed.
So now let's say I have a job that is in charge of getting a batch of items from this table and processing them. Say we want to let it process 10 at a time. Now also assume that this job can have many instances running at once, so it is concurrently accessing the table (along with any other consumers who may be adding new records to the queue).
I was planning to do something like this:
using(var tx = new Transaction(Isolation.Serializable))
{
var batchGuid = //newGuid
executeSql("update top(10) [QUeueTable] set [BatchGuid] = batchGuid where [BatchGuid] is null");
var itemsToProcess = executeSql("select * from [QueueTable] where [BatchGuid] = batchGuid");
tx.Commit()
}
So basically what I'd be doing is starting a transaction as serializable, marking 10 items with a specific GUID, then getting those 10 items, then committing.
Is this a feasible strategy? I believe the isolation level of serializable will basically lock the whole table to prevent read/write until the transaction is complete - is this correct? Basically the transaction will block all other read/write operations on the table? I believe this is what I want in this case as I don't want to read dirty data and I don't want concurrent running jobs to stomp on each other when marking a batch of 10 to process.
Any insights as to whether I'm on the right track with this would be much appreciated. If there are better ways to accomplish this I'd welcome alternatives as well.
Serializable isolation mode does not necessarily lock the whole table. If you have an index on BatchGuid you will probably do ok, but if not then SQL will probably escalate to a table lock.
A few things you may want to look at:
Using the OUTPUT statement you can combine your UPDATE and SELECT into one query
You may need to use UPDLOCK if you have multiple processes running this query
You can do this in a single statement if you use the OUTPUT clause:
UPDATE TOP (10) [QueueTable]
OUTPUT inserted.*
SET [BatchGuid] = batchGuid
WHERE [BatchGuid] IS NULL;
Or more specifically:
var itemsToProcess = executeSql("update top(10) [QUeueTable] output inserted.* set [BatchGuid] = batchGuid where [BatchGuid] is null");
It is personally preference I suppose, but I have never been a fan of the UPDATE TOP(n) syntax, because you can't specify an ORDER BY, and in most cases when specfifying top, you want to specify an order by, I much prefer using something like:
UPDATE q
OUTPTUT inserted.*
SET [BatchGuid] = batchGuid
FROM ( SELECT TOP (10) *
FROM dbo.QueueTable
WHERE BatchGuid IS NULL
ORDER BY ID
) AS q
ADDENDUM
In response to the comment, I don't believe there is any chance of a race condition, but I was not 100% certain. The reason I don't believe this because although the query reads as a SELECT, and an UPDATE, it is syntactic sugar, it is just an update, and uses exactly the same plan, and locks as the top query. However, since I don't know for sure I decided to test:
First I set up a sample table in temp DB, and a logging table to log updated IDs
USE TempDB;
GO
CREATE TABLE dbo.T (ID BIGINT NOT NULL IDENTITY PRIMARY KEY, Col UNIQUEIDENTIFIER NULL);
INSERT dbo.T (Col)
SELECT TOP 1000000 NULL
FROM sys.all_objects a, sys.all_objects b;
CREATE TABLE dbo.T2 (ID BIGINT NOT NULL PRIMARY KEY);
Then in 10 different SSMS windows I ran this:
WHILE 1 = 1
BEGIN
DECLARE #ID UNIQUEIDENTIFIER = NEWID();
UPDATE T
SET Col = #ID
OUTPUT inserted.ID INTO dbo.T2 (ID)
FROM ( SELECT TOP 10 *
FROM dbo.T
WHERE Col IS NULL
ORDER BY ID
) t;
IF ##ROWCOUNT = 0
RETURN;
END
The whole process ran for 20 minutes updating ~500,000 rows before I stopped all 10 threads. Since updating the same row twice would throw and error when inserting to T2 as a primary key violation, and all 10 threads needed to be stopped this shows that there was no race condition, and to confirm this, I ran the following:
SELECT Col, COUNT(*)
FROM dbo.T
WHERE Col IS NOT NULL
GROUP BY Col
HAVING COUNT(*) <> 10;
Which, as expected, returned no rows.
I am happy to be proved wrong and to concede I was lucky in that none of these 100,000 iterations clashed, but I don't believe it was luck. I really believe there is a single lock, therefore it doesn't matter if you have a transaction or not, you just need the correct isolation level.
I realise there have been debates about this, although I can find a real definitive answer.
A lot of times, this question leads to a definition of what varchar(MAX) etc is, their actual limits, and what NOT to use.
What I want to find out is this:
I am giving the user an option to type/paste in whatever they want to in a text box, without limit.
This could be anything from a word, to a byte array dump of an image.
I need to then be able to quickly reference to this data at a later stage, using its TITLE or ID.
What would be the best way to go about storing this data into a DB? I have read that using varchar(MAX) stops indexing, and generally should not be used.
I'm not really well-off in SQL, so I imagine a possible solution would be to split this string into arrays of 4000, and store them like that.
Is this a good lead? Or am I missing something obvious?
General Model:
public string a_Title { get; set; }
public string a_Content { get; set; }
public string a_AdditionalInfo { get; set; }
Where a_Content will be stored as the unknown.
I believe you don't have to worry about your storage type right now. You really have choice between varchar(n), varchar(max), nvarchar(n), nvarchar(max), even varbinary and text/ntext. All of them have its virtues and drawbacks. You can even chuck your blob-like strings into a file using FILESTREAM.
However, I believe you want to reference your data via some int or short varchar field, and you want to get your results relatively quick (not blazing-hot-get-me-data-yesterday-fast). There can be more or less optimal ways to do it, but don't think about that now.
I suggest this:
create a table structure that satisfies your needs
fill it with some dummy data
implement data access using your favorite data access layer
When you feel happy enough, create a performance test wit, say, few thousands of accesses. This is the first time you want to optimize performance. Until then I would even forget the indexing itself.
P.S. - don't split it to 4000-bytes chunks; this is a very awkward workaround and it can cause random search misses (not exactly random, you could find the bug after slow and painful debugging session)
VarChar(MAX) is intended for cases like yours.
Even if you chunk the data into NVarChar(4000) (or VarChar(8000)) byte segments to allow indexes to be built, what worth with those indexes really have?
You'll also be opening yourself up to the headache of figuring out where segments begin and end and then reconstituting them either in a nasty SQL statement or in some middle-tier client code.
Further, VarChar(MAX) and NVarChar(MAX) will be kept in-row when its possible to do so. That means until you are using 4001 or 8001 characters, respectively.
One approach would be to build two tables, and leverage VARCHAR(MAX):
CREATE TABLE table1 (ID INT PRIMARY KEY IDENTITY(1, 1), Name VARCHAR(1024) NOT NULL);
CREATE TABLE table1Text (ID INT PRIMARY KEY, data VARCHAR(MAX) NOT NULL);
and the ID in table1Text is a foreign key reference to table1. The benefit of this approach is that you can still index table1 and you can provide full-text searching on table1Text without ever having any issues of each table affecting one another.
I'd be tempted to use varbinary(max) for this "catch-all" data column you're describing. The varbinary data type can store text (see below), byte arrays and entire files. The size limit is 2GB, but the nice things about varbinary (as opposed to deprecated image) is that you can return/download a portion of the data using a simple substring function. So, if you only wanted the first 100mb of some data, you'd used:
Declare #data as varbinary(100000000)
select #data = substring(DataColumn,0,100000000) from SomeDataTable where ID = 1
To insert text into varbinary you'd use:
declare #test table (
data varbinary(max) not null,
datatype varchar(10) not null
)
--insert varchar
insert into #test (data, datatype)
values cast('Your Text Here' as varbinary(max)), 'varchar'
-- insert nvarchar
insert into #test (data, datatype) select cast(N'YourTextHere' as varbinary(max)), 'nvarchar'
-- see the results
select data, datatype from #test
select cast(data as varchar(max)) as data_to_varchar, datatype from #test
select cast(data as nvarchar(max)) as data_to_nvarchar, datatype from #test
This table schema in question is here: Oracle SQL: Selecting a single row with the latest date between multiple columns
I'm working with a table that has over 5 million entries. What is the fastest and most accurate way to upsert to this table AND return the last upserted row id using a stored procedure?
Most of what I've read recommends using the merge statement for upserts. However, merge doesn't support returning into.
In our table, we have the CREATE_DATE, CREATE_USER, UPDATE_DATE, and UPDATE_USER fields that are updated as expected. My thought was to create a stored procedure that returned the id of the row that has the latest date between those two columns and where the respective user data was equal to the current user data. This is what the people who answered the referring question helped me with (thanks!).
However, I'm concerned about the combined execution time vs other methods, as well as the huge gaps created in sequences due to merging. Calling a separate statement simply to get the id also seems a bit inefficient. However, almost everything I've read says that merge is much faster than the pre-merge upsert statements.
Note that these are being called through a c#/asp web application. Any help is appreciated :)
edit
Below is an example of the stored procedure I'm using for the Upsert. Note that the CREATE_DATE and UPDATE_DATE columns are updated with triggers.
create or replace
PROCEDURE P_SAVE_EXAMPLE_TABLE_ROW
(
pID IN OUT EXAMPLE_TABLE.ID%type,
--Other row params here
pUSER IN EXAMPLE_TABLE.CREATE_USER%type,
pPLSQLErrorNumber OUT NUMBER,
pPLSQLErrorMessage OUT VARCHAR2
)
AS
BEGIN
MERGE INTO USERS_WORKGROUPS_XREF USING dual ON (ID=pID)
WHEN NOT MATCHED THEN
INSERT (--OTHER COLS--, CREATE_USER) VALUES (--OTHER COLS--, pUSER)
WHEN MATCHED THEN
UPDATE SET
--OTHER COLS--
UPDATE_USER=pUSER
WHERE ID=pID;
EXCEPTION
WHEN OTHERS THEN
pID := 0;
pPLSQLErrorNumber := SQLCODE;
pPLSQLErrorMessage := SUBSTR(SQLERRM, 1, 256);
RETURN;
-- STATEMENT TO RETURN LAST AFFECTED ID INTO pID GOES HERE
END;
If you're trying to return the maximum value of a sequence-generated PK on the table then I'd just run a "Select max(id) .." directly afterwards. If other sessions are also modifying the table then maybe reading the currval of the sequence would be better.
We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.
Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.
Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.
The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)
I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.
I have some C#/Linq code used to merge data from excel file into db, which needs better performance.
There are
1. A List read from excel file: List<Score> newScoreList
2. A DB table named Scores, primary keys peopleId and testDate
I need to merge data from the list to the table, and if there is any duplicate data, update it.
My current solution is:
1) Find the duplicate data with this LINQ expression:
var dupliData =
from newScore in newScoreList
from oldScore in db.Scores
where newScore.peopleId == oldScore.peopleId && newScore.testDate == oldScore.testDate
select oldScore;
2) Delete the duplicate data.
db.Scores.DeleteAllOnSubmit(dupliData);
3) Insert the new data from list.
db.Scores.InsertAllOnSubmit(newScoreList);
Could anybody give me a better solution?
I really hate stored procedures in general, but this is probably a perfect case for using one. My TSQL is rusty, but this should give an idea.
CREATE PROCEDURE dbo.InsertOrUpdateScore
(
#id as Int,
#date as DateTime,
#result as varchar(20)
)
AS
if not exists(SELECT id FROM Scores WHERE id = #id AND date = #date)
begin
INSERT INTO Scores (id, date, result) values (#id, #date, #result)
end
else
begin
UPDATE Scores
SET result = #result
WHERE id = #id AND date = #date
end
GO
Now in your LINQ server browser, select the Score entity, and change its INSERT and UPDATE behaviour to use the stored procedure you just created. Make sure the user accessing the database has EXECUTE permission to the SPROC.
This should perform quite a bit quicker than your version. You're trading an IN clause for N SELECTs on an index which may be quicker. However, the result set of the IN clause is not transported back to the client over the network, which could save quite a bit of time.
Profile exactly how long your method is taking before implementing this, so you can gauge if this is truly quicker.
I'm not sure if this is the only way to create a Score in your application, but you might want to consider the case where you're INSERTing a record that doesn't yet have an ID. You'll need to modify the SPROC to allow #id as null, and handle the INSERT appropriately.
Then it should just be:
db.Scores.InsertAllOnSubmit(newScoreList);
If you are using SQL 2008 you can use the Merge command
http://www.builderau.com.au/program/sqlserver/soa/Using-SQL-Server-2008-s-MERGE-statement/0,339028455,339283059,00.htm