Recompute big table in SQL database

Recompute big table in SQL database - c#

We are fighting this problem:
We have a big table in SQL Server 2008 R2 (R2 !) (millions of rows)
We need to walk through this table and each row needs to be recomputed by a C# code (so the table is loaded from C# application)
How to do that if performance is crucial? Some kind of batching?

If performance is crucial, use SQL CLR to compute the new values. You can just use an update statement that way. SQL CLR places restrictions on what code you can use so this might not be an easy option.
There is another way if SQL CLR is not an option:
I assume there is an integer primary key. I'd do it like this:
request a batch of rows (10k or so)
compute the updates
send the updates
repeat
The trick is to keep track of which rows you already processed. Keep the ID value of the last row processed. Request a batch like this:
select top 10000 *
from T
where ID > #lastIDProcessed
order by ID
That will guarantee fast an reliable execution.

Related

C# - Mysql insert limit

I'm new in programming and databases. I've started learning Mysql and C#. So, I created a really simple test program in C# to test how many inserts can do in a minute. (Just a simple infinite loop to insert a simple text into a column) I am watching the dashboard in MySQL Workbench and the problem is that the program can only insert 1000 queries/second. If I run 2-3 instances of the program at the same time, I can see 2-3 * 1000 queries/second.
Is there any limit in MySQL?

There's no built-in limit to insert rates.
Lots of things come into play when you're trying for high insert rates. For example.
How big is each row you're inserting?
How complex are the indexes on your target table? Index updates take time during INSERT operations.
Which access method does your table use? MyISAM is transactionless, so a naive program can push more rows/sec. InnoDB has transactions, so doing your inserts in batches of 1000 or so, wrapped in BEGIN / COMMIT statements, can speed things up.
How fast are the disks / ssds on your server? How much RAM does it have?
How fast are your client machines and the network between them and the MySQL server?
Are other programs trying to read the target table at the same time you're doing inserts?
You've mentioned that your total insert rate scales up approximately linearly for 2-3 instances of your insert program. That means the bottleneck is in your insert program, not the server, at that scale.
C#, like many language frameworks, offers prepared statements. They are a way to write a query once and use it over and over with different data values. It's faster. It's also safer if your data comes from an untrusted source (look up SQL injection).
MySQL lets you insert multiple rows with a single INSERT operation. Faster.
INSERT INTO TABLE tbl (a, b, c)
VALUES (1,2,3),(4,5,6),(7,8,9)
MySQL offers the LOAD DATA INFILE statement. You can get astonishingly high bulk load rates with that statement if you need them..

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?

The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.

Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform

DataReader and SQLCommand

As they say on the radio - long time listener first time caller....
Here's my issue. VS 2005 SQL Server 2005 Database. Windows Forms app. C#. Big table - 780K records. I'll call it the source table. Need to loop through the source table, and for each record do something with another table, then write back to the source table that it was completed. I haven't got as far as updating the second table yet...
I loop through all records of the source table using a datareader using connection object A. For each record I build an update statement to update the source table to indicate this record has been processed - and use a SQL Command against connection object B to do this uodate. So different connection objects because I know the dataReader wants an exclusive.
Here's the issue. After processing X records - where X seems to be about 60 - the update timesout.
While writing this - funny how this happens isn't it - my brain tells me this is to do with transaction isolation and / or locking...i.e. I'm reading through the source records using a datareader but changing those records...I can see this causing problems with different transaction isolations so I'll look into that...
Anyone seen this and know how to solve it?
Cheers
Pete

Without more detail, the possibilities for solutions are very numerous. As iivel noted, you could perform all of the activities within the database itself -- if that is possible for the type of operations you must perform. I would just add that it would very likely be best to do such a thing using set-based operations rather than cursors.
If you must perform this operation outside of the database, then your source query can be executed without any locks at all, if that is a safe thing to do in your case. You can do that by setting the isolation level or by simply appending with (nolock) after your table name / alias in the query.
There are several other options as well. For instance, you could operate in batches, instead, such as pulling 1000 records at a time from the source table into memory, disconnecting, and then performing whatever operations and updates you need. Once all 1000 records are processed, work on another set of 1000 records, and so on, until all records in the queue have been processed.

Sounds like you need to set the Command Timeout propery on the command object.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlcommand.commandtimeout(v=vs.80).aspx

Is there any reason you can't have the select/updates in a cursor and return a final result tally to the app? If the process belongs in the the DB it is best to keep it there.
Alternately updating the command timout as John mentionted is your only other bet.

Inserting Large volume of data in SQL Server 2005

We have a application (written in c#) to store live stock market price in the database (SQL Server 2005). It insert about 1 Million record in a single day. Now we are adding some more segment of market into it and the no of records would be double (2 Millions/day).
Currently the average record insertion per second is about 50, maximum is 450 and minimum is 0.
To check certain conditions i have used service broker (asynchronous trigger) on my price table. It is running fine at this time(about 35% CPU utilization).
Now i am planning to create a in memory dataset of current stock price. we would like to do some simple calculations.
Currently i am using xml batch insertion method. (OPENXML in Storred Proc)
I want to know different views of members on this.
Please provide your way of dealing with such situation.

Your question is reading, but title implies writing?
When reading, consider (bit don't blindly use) temporary tables to cache data if you're going to do some processing. However, by simple calculations I assume aggregates live AVG, MAX etc?
It would generally be inane to drag data around, cache it in the client and aggregate it there.
If batch uploads:
SQLBulkCopy or similar to a staging table
Single write from staging to final table with
If single upload, just insert it
A million rows a day is a rounding error for what SQL Server ('Orable, MySQL, DB2 etc) is capable of
Example: 35k transaction (not rows) per second

Is it a problem if i query again and again to SQL Server 2005 and 2000?

Window app i am constructing is for very low end machines (Celeron with max 128 RAM). From the following two approaches which one is the best (I don't want that application becomes memory hog for low end machines):-
Approach One:-
Query the database Select GUID from Table1 where DateTime <= #givendate which is returning me more than 300 thousands records (but only one field i.e. GUID - 300 thousands GUIDs). Now running a loop to achieve next process of this software based on GUID.
Second Approach:-
Query the database Select Top 1 GUID from Table1 where DateTime <= #givendate with top 1 again and again until all 300 thousands records done. It will return me only one GUID at a time, and I can do my next step of operation.
What do you suggest which approach will use the less Memory Resources?? (Speed / performance is not the issue here).
PS: Database is also on local machine (MSDE or 2005 express version)

I would go with a hybrid approach. I would select maybe 50 records at a time instead of just one. This way, you aren't loading the entire number of records, but you are also drastically reducing the number of calls to the database.

Go with approach 1 and use SQLDataReader to iterate through the data without eating up memory.

If you only have 128 MB of ram I think number 2 would be your best approach......that said can't you do this SET based with a stored procedure perhaps, this way all the processing would happen on the server

If memory use is a concern, I would consider caching the data to disk locally. You can then read the data from the files using a FileStream object.
Your number 2 solution will be really slow, and put a lot of burden on the db server.

I would have a paged enabled Stored Procedure.
I would do it in chunks of 1k rows and test from there up until I get the best performance.
usp_GetGUIDS #from = 1, #to = 1000

This may be a totally inaproprite approach for you, but if you're that worried about performance and your machine is low spec, I'd try the following:
Move your SQL server to another machine, as this eats up a lot of resources.
Alternativly, if you don't have that many records, store as XML or SQLite, and get rid of the SQL server altogether?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.