I am looking for advice on a design to effectively allow scalability and reduce contention for my requirement:
My current environment is using messaging architecture with queues , windows services and sql 2014, c# .net etc. we have a requirement to persist data into a staging table, do some validation on this data and then once validation is complete load the data into a final table and do some more calculations on the data in the final table, once this work flow is complete then clear those specific data rows from the staging table. The load into the staging table can be a few thousand rows per load. Each message in the queue initiates this process and we expect there to be a burst of around 1000 to 2000 messages in a short periods throughout the day.
Due to the nature of using queues and trying to build a scalable architecture, within our window service we want to read a batch of messages of a queue say 10 per read, run a couple of threads in parallel one per message with each thread owning a transaction that is responsible for loading the staging staging table, validating in process the staging data - imagine mapping, enriching etc. and the shifting the data to the final target table, doing some calculations on the final table and then deleting the relative rows in the staging table. Sounds like lots of contention!
Both tables will have a unique index across 3 columns and a PK, but not neccesarily 1 to 1 relationship. Around 20 columns in staging and 15 in final table so not very wide... I would also prefer to not use msdtc as we will be using multiple sql connections. I realize in 2008 Microsoft introduced lightweight sql transaction to reduce the burden.
I have not looked into locking on columnar store tables and would prefer not to use dirty reads as a hack.
If you have successfully done something similar to this before I would really like to hear your comments! PS please note SQL 2014, C#, TPL.
Thanks
Related
I'm bulk loading files so they can be tied to internal data. After I join the data, the use of the database is moot. I load the file, join the data, and basically truncate the file. This is a multi tenant user situation so data from one job get's truncated while another job's data remains (all into one table with a job id to manage who's got what data.) Now the first thing I thought would help with all this would be temp tables but all this work is being called by a WCF service that runs under an administrative account and (help me if I'm wrong) the service keeps using the connection pool and my tempdb table just get's dropped between calls. All this work is organized to return progress to the end user so I'm finding hard to bulk operations together.
So here's my question: Is there a way to optimize bulk loads that will eventually get truncated to avoid high index fragmentation?
You could have a fresh partition (or table) for each loading operation. You can efficiently delete the contents of a partition or table.
You could create them as named tables in tempdb if you can tolerate complete data loss at any time (due to unexpected restart or failover which must be assumed to occur at any time).
Creating partitions or tables of course requires DDL rights.
I am storing an entity in two Azure Storage tables. The data is identical in both tables except that the RowKey and PartitionKey are different (this is for querying purposes).
Problem
When inserting into these tables, I need the operation to be transactional - the data must only commit if both inserts are successful.
CloudTable.ExecuteBatch(..) only works when entities belong to the same partition.
Is there no other way of doing this?
Short answer:
Unfortunately no. Entity batch transactions are only supported for just 1 table at a time and some other restrictions.
Long answer:
We too have faced similar problem where we had to insert data across multiple tables. One thing we have done is we have tried to implement some kind of eventual consistency. Instead of writing data directly into the table, we write that data in a queue and have a background worker role process that data. Once the data is written in the queue, we assume that data will eventually be persisted (there's a caching engine also involved which is updated with the latest data here as well so that application can get the latest data). Background worker role keeps on retrying inserts (using InsertOrReplace semantic instead of just Insert semantic) and once all the data is written, we simply delete the message from the queue.
I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform
We have a application (written in c#) to store live stock market price in the database (SQL Server 2005). It insert about 1 Million record in a single day. Now we are adding some more segment of market into it and the no of records would be double (2 Millions/day).
Currently the average record insertion per second is about 50, maximum is 450 and minimum is 0.
To check certain conditions i have used service broker (asynchronous trigger) on my price table. It is running fine at this time(about 35% CPU utilization).
Now i am planning to create a in memory dataset of current stock price. we would like to do some simple calculations.
Currently i am using xml batch insertion method. (OPENXML in Storred Proc)
I want to know different views of members on this.
Please provide your way of dealing with such situation.
Your question is reading, but title implies writing?
When reading, consider (bit don't blindly use) temporary tables to cache data if you're going to do some processing. However, by simple calculations I assume aggregates live AVG, MAX etc?
It would generally be inane to drag data around, cache it in the client and aggregate it there.
If batch uploads:
SQLBulkCopy or similar to a staging table
Single write from staging to final table with
If single upload, just insert it
A million rows a day is a rounding error for what SQL Server ('Orable, MySQL, DB2 etc) is capable of
Example: 35k transaction (not rows) per second
I have one BIG table(90k rows, size cca 60mb) which holds info about free rooms capacities for about 50 hotels. This table has very few updates/inserts per hour.
My application sends async requests to this(and joined tables) at max 30 times per sec.
When i start 30 threads(with default AppPool class at .NET 3.5 C#) at one time(with random valid sql query string), only few(cca 4) are processed asynchronously and other threads waits. Why?
Is it becouse of SQL SERVER 2008 table locking, or becouse of .NET core? Or something else?
If it is a SQL problem, can help if i split this big table into one table per each hotel model?
My goal is to have at least 10 threads servet at a time.
This table is tiny. It's doesn't even qualify as a "medium sized" table. It's trivial.
You can be full table scanning it 30 times per second, you can be copying the whole thing in ram, no server is going to be the slightest bit bothered.
If your data fits in ram, databases are fast. If you don't find this, you're doing something REALLY WRONG. Therefore I also think the problems are all on the client side.
It is more than likely on the .NET side. If it were table locking more threads would be processing, but they would be waiting on their queries to return. If I remember correctly there's a property for thread pools that controls how many actual threads they create at once. If there are more pending threads than that number, then they get in line and wait for running threads to finish. Check that.
Have you tried changing the transaction isolation level?
Even when reading from a table Sql Server will lock the table
try setting the isolation level to read uncommitted and see if that improves the situation,
but be advised that its feasible that you will read 'dirty' data make sure you understand the ramifications if this is the solution
this link explains what it is.
link text
Rather than ask, measure. Each of your SQL queries that is actually submitted by your application will create a request on the server, and the sys.dm_exec_requests DMV shows the state of each request. When the request is blocked the wait_type column shows a non-empty value. You can judge from this whether your requests are blocked are not. If they are blocked you'll also know the reason why they are blocked.