Bulk Loads, Indexes and Data Truncation: How do you optimize? - c#

I'm bulk loading files so they can be tied to internal data. After I join the data, the use of the database is moot. I load the file, join the data, and basically truncate the file. This is a multi tenant user situation so data from one job get's truncated while another job's data remains (all into one table with a job id to manage who's got what data.) Now the first thing I thought would help with all this would be temp tables but all this work is being called by a WCF service that runs under an administrative account and (help me if I'm wrong) the service keeps using the connection pool and my tempdb table just get's dropped between calls. All this work is organized to return progress to the end user so I'm finding hard to bulk operations together.
So here's my question: Is there a way to optimize bulk loads that will eventually get truncated to avoid high index fragmentation?

You could have a fresh partition (or table) for each loading operation. You can efficiently delete the contents of a partition or table.
You could create them as named tables in tempdb if you can tolerate complete data loss at any time (due to unexpected restart or failover which must be assumed to occur at any time).
Creating partitions or tables of course requires DDL rights.

Related

Safe pagination without a rowlock

I'm working with SQL Server and Entity Framework. I need to fetch a large number of records and process them.
I'm afraid of "out-of-memory" or other performance problems, so I want to implement the fetching and processing in batches. The problem is that between each fetch, the underlying data in the database might change, resulting in omitting elements (a record is removed, then next offset is applied to the select operation but the data in the database has 'moved to the left' and the first item of the next page is omitted).
This is important because the purpose of the select is not just presentation but further processing.
I thought about setting the transaction's isolation level to repeatable read to avoid changes coming from other users, but the processing takes a lot of time and that would lock the changes to the table for the whole time.
I also thought about paginating using the key rather than the offset (SQL where/limit). This way no data that was initially in the db could be omitted, only the changes that happen after the pagination has started could be omitted or not. However the user is informed about the number of items to be processed in the beginning of the process. And this number might not be correct, but we only learn about it in the end of the long process.
What would your advice be? Am I missing something?

Effective transaction management

I am looking for advice on a design to effectively allow scalability and reduce contention for my requirement:
My current environment is using messaging architecture with queues , windows services and sql 2014, c# .net etc. we have a requirement to persist data into a staging table, do some validation on this data and then once validation is complete load the data into a final table and do some more calculations on the data in the final table, once this work flow is complete then clear those specific data rows from the staging table. The load into the staging table can be a few thousand rows per load. Each message in the queue initiates this process and we expect there to be a burst of around 1000 to 2000 messages in a short periods throughout the day.
Due to the nature of using queues and trying to build a scalable architecture, within our window service we want to read a batch of messages of a queue say 10 per read, run a couple of threads in parallel one per message with each thread owning a transaction that is responsible for loading the staging staging table, validating in process the staging data - imagine mapping, enriching etc. and the shifting the data to the final target table, doing some calculations on the final table and then deleting the relative rows in the staging table. Sounds like lots of contention!
Both tables will have a unique index across 3 columns and a PK, but not neccesarily 1 to 1 relationship. Around 20 columns in staging and 15 in final table so not very wide... I would also prefer to not use msdtc as we will be using multiple sql connections. I realize in 2008 Microsoft introduced lightweight sql transaction to reduce the burden.
I have not looked into locking on columnar store tables and would prefer not to use dirty reads as a hack.
If you have successfully done something similar to this before I would really like to hear your comments! PS please note SQL 2014, C#, TPL.
Thanks

Concurrent inserts into database

I'm building a program that takes push data from six different sources and inserts the data into a database. Each source has its own function to execute the inserts as soon as they come, but all sources write to the same table.
I would have the following questions:
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?
If one source is currently writing to the table and another source
begins to write at the same time is there any chance the inserts will
block each other?
It depends on the indexes. If the index keys have the same or contiguous values, you may see short=term blocking for the duration of the transaction.
The table is also constantly being used to read the data via a view
that join some more tables to show the data, can this pose any
problems?
It depends on the isolation level. No blocking will occur if:
SELECT queries are running in READ_COMMITTED isolation level and
the READ_COMMITTED_SNAPSHSOT database option is turned on
the SELECT queries don't touch uncommitted data
the SELECT queries run in READ_UNCOMMITTED isolation level
Even if blocking does occur, it may be short-lived if the INSERT transactions are short.
Currently each source has its own DB connection to write data, would
it be better to have only one connection, or have each use its own?
Depends on the problem you are trying to solve. A single connection will ensure inserts don't block/deadlock with each other but might not be an issue anyway.
Please find the below inline answer
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
In this case another resource will wait for it.(Insert will be in waiting state for next one)
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
No problem.
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?
Its better to have one DB connection.
Block "each other" i.e. dead-lock is not possible.
No problem. Only if select is too slow, it can delay next insert.
No problem with different connections.

Transactional inserts across table partitions

I am storing an entity in two Azure Storage tables. The data is identical in both tables except that the RowKey and PartitionKey are different (this is for querying purposes).
Problem
When inserting into these tables, I need the operation to be transactional - the data must only commit if both inserts are successful.
CloudTable.ExecuteBatch(..) only works when entities belong to the same partition.
Is there no other way of doing this?
Short answer:
Unfortunately no. Entity batch transactions are only supported for just 1 table at a time and some other restrictions.
Long answer:
We too have faced similar problem where we had to insert data across multiple tables. One thing we have done is we have tried to implement some kind of eventual consistency. Instead of writing data directly into the table, we write that data in a queue and have a background worker role process that data. Once the data is written in the queue, we assume that data will eventually be persisted (there's a caching engine also involved which is updated with the latest data here as well so that application can get the latest data). Background worker role keeps on retrying inserts (using InsertOrReplace semantic instead of just Insert semantic) and once all the data is written, we simply delete the message from the queue.

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?

I have roughly 30M rows to Insert Update in SQL Server per day what are my options?
If I use SqlBulkCopy, does it handle not inserting data that already exists?
In my scenario I need to be able to run this over and over with the same data without duplicating data.
At the moment I have a stored procedure with an update statement and an insert statement which read data from a DataTable.
What should I be looking for to get better performance?
The usual way to do something like this is to maintain a permanent work table (or tables) that have no constraints on them. Often these might live in a separate work database on the same server.
To load the data, you empty the work tables, blast the data in via BCP/bulk copy. Once the data is loaded, you do whatever cleanup and/or transforms are necessary to prep the newly loaded data. Once that's done, as a final step, you migrate the data to the real tables by performing the update/delete/insert operations necessary to implement the delta between the old data and the new, or by simply truncating the real tables and reloading them.
Another option, if you've got something resembling a steady stream of data flowing in, might be to set up a daemon to monitor for the arrival of data and then do the inserts. For instance, if your data is flat files get dropped into a directory via FTP or the like, the daemon can monitor the directory for changes and do the necessary work (as above) when stuff arrives.
One thing to consider, if this is a production system, is that doing massive insert/delete/update statements is likely to cause blocking while the transaction is in-flight. Also, a gigantic transaction failing and rolling back has its own disadvantages:
The rollback can take quite a while to process.
Locks are held for the duration of the rollback, so more opportunity for blocking and other contention in the database.
Worst, after all that happens, you've achieved no forward motion, so to speak: a lot of time and effort and you're right back where you started.
So, depending on your circumstances, you might be better off doing your insert/update/deletes in smaller batches so as to guarantee that you achieve forward progress. 30 million rows over 24 hours works out to be c. 350 per second.
Bulk insert into a holding table then perform either a single Merge statement or an Update and an Insert statement. Either way you want to compare your source table to your holding table to see which action to perform

Categories