I am storing an entity in two Azure Storage tables. The data is identical in both tables except that the RowKey and PartitionKey are different (this is for querying purposes).
Problem
When inserting into these tables, I need the operation to be transactional - the data must only commit if both inserts are successful.
CloudTable.ExecuteBatch(..) only works when entities belong to the same partition.
Is there no other way of doing this?
Short answer:
Unfortunately no. Entity batch transactions are only supported for just 1 table at a time and some other restrictions.
Long answer:
We too have faced similar problem where we had to insert data across multiple tables. One thing we have done is we have tried to implement some kind of eventual consistency. Instead of writing data directly into the table, we write that data in a queue and have a background worker role process that data. Once the data is written in the queue, we assume that data will eventually be persisted (there's a caching engine also involved which is updated with the latest data here as well so that application can get the latest data). Background worker role keeps on retrying inserts (using InsertOrReplace semantic instead of just Insert semantic) and once all the data is written, we simply delete the message from the queue.
Related
I need to create a WebJob that handles some work that is not super time-sensitive. I use both DocumentDb and Azure Table Storage so I deal with de-normalized data that need to be handled by some backend process to keep consistent.
I have multiple uses for this WebJob so I'm trying to figure out the best way to pass data to it.
Which is the right approach for sending requests to my WebJob?
Create a generic object with properties that can store the data while the request is in the queue. In other words, some type of container that I use for transporting data through the queue.
Persist the data in some backend database and send a simple command via the queue?
One example I can think of is that I have a list of 15 entities that I need to update in my Azure Table Storage. This may take multiple read/write operations which will take time.
If I use approach #1, I'd "package" the list of 15 entities in an object and put it in a message in my queue. Depending on the situation, some messages may get a bit fat which concerns me.
If I use approach #2, I'd save the ID's of the entities in a table somewhere (SQL or Azure Table Storage) and I'd send some type of batch ID via a message. Then my WebJob would receive the batch Id, first retrieve the entities and process them. Though this approach seems like a better one, I'm afraid, this will be pretty slow.
Please keep in mind that the primary use of this particular WebJob is to speed up response times for end users in situations that require multiple backend operations. I'm trying to handle them in a WebJob so that what's time-sensitive gets processed right away and the other "not-so-time-sensitive" operations can be handled by the WebJob.
I want my solution to be both very robust and as fast as possible -- though the job is not highly time sensitive, I still want to process the backend job quickly.
I am looking for advice on a design to effectively allow scalability and reduce contention for my requirement:
My current environment is using messaging architecture with queues , windows services and sql 2014, c# .net etc. we have a requirement to persist data into a staging table, do some validation on this data and then once validation is complete load the data into a final table and do some more calculations on the data in the final table, once this work flow is complete then clear those specific data rows from the staging table. The load into the staging table can be a few thousand rows per load. Each message in the queue initiates this process and we expect there to be a burst of around 1000 to 2000 messages in a short periods throughout the day.
Due to the nature of using queues and trying to build a scalable architecture, within our window service we want to read a batch of messages of a queue say 10 per read, run a couple of threads in parallel one per message with each thread owning a transaction that is responsible for loading the staging staging table, validating in process the staging data - imagine mapping, enriching etc. and the shifting the data to the final target table, doing some calculations on the final table and then deleting the relative rows in the staging table. Sounds like lots of contention!
Both tables will have a unique index across 3 columns and a PK, but not neccesarily 1 to 1 relationship. Around 20 columns in staging and 15 in final table so not very wide... I would also prefer to not use msdtc as we will be using multiple sql connections. I realize in 2008 Microsoft introduced lightweight sql transaction to reduce the burden.
I have not looked into locking on columnar store tables and would prefer not to use dirty reads as a hack.
If you have successfully done something similar to this before I would really like to hear your comments! PS please note SQL 2014, C#, TPL.
Thanks
I'm building a program that takes push data from six different sources and inserts the data into a database. Each source has its own function to execute the inserts as soon as they come, but all sources write to the same table.
I would have the following questions:
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?
If one source is currently writing to the table and another source
begins to write at the same time is there any chance the inserts will
block each other?
It depends on the indexes. If the index keys have the same or contiguous values, you may see short=term blocking for the duration of the transaction.
The table is also constantly being used to read the data via a view
that join some more tables to show the data, can this pose any
problems?
It depends on the isolation level. No blocking will occur if:
SELECT queries are running in READ_COMMITTED isolation level and
the READ_COMMITTED_SNAPSHSOT database option is turned on
the SELECT queries don't touch uncommitted data
the SELECT queries run in READ_UNCOMMITTED isolation level
Even if blocking does occur, it may be short-lived if the INSERT transactions are short.
Currently each source has its own DB connection to write data, would
it be better to have only one connection, or have each use its own?
Depends on the problem you are trying to solve. A single connection will ensure inserts don't block/deadlock with each other but might not be an issue anyway.
Please find the below inline answer
If one source is currently writing to the table and another source begins to write at the same time is there any chance the inserts will block each other?
In this case another resource will wait for it.(Insert will be in waiting state for next one)
The table is also constantly being used to read the data via a view that join some more tables to show the data, can this pose any problems?
No problem.
Currently each source has its own DB connection to write data, would it be better to have only one connection, or have each use its own?
Its better to have one DB connection.
Block "each other" i.e. dead-lock is not possible.
No problem. Only if select is too slow, it can delay next insert.
No problem with different connections.
I'm bulk loading files so they can be tied to internal data. After I join the data, the use of the database is moot. I load the file, join the data, and basically truncate the file. This is a multi tenant user situation so data from one job get's truncated while another job's data remains (all into one table with a job id to manage who's got what data.) Now the first thing I thought would help with all this would be temp tables but all this work is being called by a WCF service that runs under an administrative account and (help me if I'm wrong) the service keeps using the connection pool and my tempdb table just get's dropped between calls. All this work is organized to return progress to the end user so I'm finding hard to bulk operations together.
So here's my question: Is there a way to optimize bulk loads that will eventually get truncated to avoid high index fragmentation?
You could have a fresh partition (or table) for each loading operation. You can efficiently delete the contents of a partition or table.
You could create them as named tables in tempdb if you can tolerate complete data loss at any time (due to unexpected restart or failover which must be assumed to occur at any time).
Creating partitions or tables of course requires DDL rights.
One of my co-workers is building a C# windows app that needs to grab a set of data and then row-by-row alter that data. If the system encounters a problem at any step in the process, it needs to roll back all of the changes.
The process he has created works well when dealing with smaller sets of data, but as soon as the number of rows get larger, it starts to puke.
The process of altering the data needs to happen in the windows app. What is the best way to handle large data changes atomically in a windows app?
Thank you.
Edit-
We are using a background thread with this process.
I apologize - I don't think I articulated the quandary we're in. We are using transactions right now with the system, I don't know if we're doing it effectively, so I'll definitely review the notes below.
We were thinking that we could spin off additional worker threads to get the work done more quickly, but assumed we would lose our atomic capabilities. Then we were thinking that maybe we could pull all the data into a data table, make changes in that object and then persist the data to the database.
So I was just going to see if someone had a brilliant way to handle this type of situation. Thanks for the comments so far.
I would suggest that you take a look at .NET's transactional model.
Here is some additional reading that may be helpful as well:
Writing a Transactional Application
Doing Transactions in C# and ADO.NET
As the problem is not well detailed in this question, I have to presume the "Is it plugged in?" type of question.
If you haven't done so already, potentially long-running operations should be spun off into their own separate BackgroundWorker thread, rather than blocking the UI thread.
We had a similar situation in dealing with mass updates of a large database. The solution we used was:
Step through each record in database. If it needs updating, create a dummy record with a new primary key.
Update the dummy record from the original, including the originals' PK in the dummy.
When you have created dummy records for all updates, lock the database and replace all foreign keys of the updated records with the primary key of the dummy records. Unlock the database.
Delete all records with the old primary keys.
Note this technique works best if there is only a single reference to the primary key.