When NOT to use transactions? [duplicate] - c#

This question already has answers here:
How to Decide to use Database Transactions
(7 answers)
Closed 5 years ago.
I used to rarely use transactions until all of the sudden I was faced with a scenario that had a lot of db footprints, so I panicked and since then started to think about using transactions for any logic that might involve data manipulation statements (Insert, Update, Delete) and could result in unexpected disasters, exceptions.
Here is the model that I keep on using:
using (var db = new X_Entity())
{
using (var trans = db.Database.BeginTransaction())
{
try
{
#region Logic
// Logic that might include at least one data manipulation statement
// db.Insert(), db.Update, db.Delete
#endregion
db.SaveChanges();
trans.Commit();
}
catch (Exception exc)
{
trans.Rollback();
HandleExceptions(exc);
}
}
}
Say there is a single Update, insert or delete statement, should a transaction be used in this case? My understanding is that in the following cases transactions are not necessary to use:
Get/Select/Join statements...etc
When the data manipulation statement is not followed by error prone logic

Transactions should be used to prevent your database changes falling into invalid state. What an invalid state is will depend on your domain. It is used when you have more than one actions that would complete the changes. For get/select, there is no point in using Transactions because you are not altering the state of the data. Even if it fails, the data will be in a valid state.
A simple example is a amount transfer situation in a banking app. Lets say when you transfer $100 to your friend, for this action to be complete (in a very simple scenario)
$100 must be deducted from your bank account
$100 must be added to your recipients account
Without both of these steps succeeding (atomic) the data will go in invalid state. So to prevent these, you should use transaction/transaction scope. When you wrap both of the steps in a transaction, either they both will succeed or both will fail.

Related

Shared transaction across multiple connections, or ReadUncommitted in PostgreSQL

I want to open several connections within a single transaction scope, so that each connection could see the changes done by the previous ones.
I need this for tests - real code writes to the database, and testing code verifies the data was actually inserted/updated. In the end I rollback transaction scope so that the real database is not affected.
This approach works fine in SQL Server, but doesn't seem to work in PostgreSQL (I use 9.3 with Npgsql provider), below is a small example.
Here's the helper to run arbitrary query within a transaction scope
private void RunQuery(string query, Action<IDataReader> process)
{
using (var connection = new NpgsqlConnection(Config.ConnectionString)) {
connection.Open();
connection.EnlistTransaction(Transaction.Current);
using (var command = connection.CreateCommand()) {
command.CommandText = query;
using (var reader = command.ExecuteReader()) {
while (reader.Read()) {
process(reader);
}
}
}
}
}
..and here's the test code - it inserts into users table and then checks whether the user was actually inserted:
using (var scope = new TransactionScope()) {
//"tested scenario"
int id = 0;
RunQuery("INSERT INTO users (name) VALUES ('foo') RETURNING id;", reader => {
id = (int)reader.GetValue(0);
});
//checking
int id2 = 0;
RunQuery("SELECT id, name FROM users WHERE id=" + id, reader => {
id2 = (int)reader.GetValue(0);
});
Assert.That(id2, Is.Not.EqualTo(0));
}
The test above fails on Postgres as id2 is always zero. I tried TransactionScope constructor with TransactionOptions.ReadUncommitted but it doesn't seem to help. Note that if I run this against SQL Server (change NpgsqlConnection to SqlConection, use SCOPE_IDENTITY to retrieve the id) then everything works just fine and id2 is not zero.
As you may expect, selects within the same connection work for Postgres, but I don't need that, my goal is to use multiple connections on a shared transaction scope. I also don't need multithreading, those connections happen sequentially.
First a disclaimer: while I know a bit about postgresql, I know very little about .NET.
I suspect you are conflating two related but separate concepts - that of Distributed Transactions and the level of transaction isolation that exists.
According to the .NET Documentation, EnlistTransaction adds the connection into a distributed transaction. A distributed transaction is described as follows
A distributed transaction is a transaction that affects several
resources. For a distributed transaction to commit, all participants
must guarantee that any change to data will be permanent. Changes must
persist despite system crashes or other unforeseen events. If even a
single participant fails to make this guarantee, the entire
transaction fails, and any changes to data within the scope of the
transaction are rolled back.
In a database, such transactions are implemented by a two-phase commit process amongst what are actually separate transactions in the database. All of the participating transactions are progressed to the end of the first phase by executing PREPARE TRANSACTION. Once they are all in this state, then they can be fully committed by executing COMMIT PREPARED. If any of them fails during PREPARE TRANSACTION, then they can all be rolled back by ROLLBACK PREPARED. This guarantees that either they are all committed, or they are all rolled back.
When using middleware such as that provided by .NET, you do not see any of these details: the framework handles the two-phase commit for you.
So, you might be wondering what this has to do with the fact that you are not seeing changes made in one part of this distributed transaction reflected in another. The answer is probably nothing. The two transactions are actually completely separate - in fact it is possible for them to be on completely separate databases.
What you are trying to achieve - to be able to see changes made in one transaction from another prior to commit - is related to the level of transaction isolation.
The bad news for you is that it sounds like the isolation level you would like to have is 'read uncommitted', which is not supported in postgresql.
Maybe you need to describe what you are trying to achieve, at a higher level - it is likely there is another (maybe better) way to achieve it.

Using Transactions or Locking in Entity Framework to ensure proper operation

I am fairly new to EF and SQL in general, so I could use some help clarifying this point.
Let's say I have a table "wallet" (and EF code first object Wallet) that has an ID and a balance. I need to do an operation like this:
if(wallet.balance > 100){
doOtherChecksThatTake10Seconds();
wallet.balance -= 50;
context.SaveChanges();
}
As you can see, it checks to see if a condition is valid, then if so it has to do a bunch of other operations first that take a long time (in this exaggerated example we say 10 seconds), then if that passes it subtracts $50 from the wallet and saves the new data.
The issue is, there are other things happening that can change the wallet balance at any time (this is a web application). If this happens:
wallet.balance = 110;
this operation passes its "if" check because wallet.balance > 110
while it's doing the "doOtherChecksThatTake10Seconds()", a user transfers $40 out of their wallet
now wallet.balance = 70
"doOtherChecksThatTake10Seconds()" finishes, subtracts 50 from wallet.balance, and then saves the context with the new data.
In this case, the check of wallet.balance > 100 is no longer true, but the operation still happened because of the delay. I need to find a way of locking the table and not releasing it until the entire operation is finished, so nothing gets edited during. What is the most effective way to do this?
It should be noted that I have tried putting this operation within a TransactionScope(), I am not sure if that will have the intended effect or not but I did notice it started causing a lot of deadlocks with an entirely different database operation that is running.
Use Optimistic concurrency http://msdn.microsoft.com/en-us/data/jj592904
//Object Property:
public byte[] RowVersion { get; set; }
//Object Configuration:
Property(p => p.RowVersion).IsRowVersion().IsConcurrencyToken();
This Allows dirty read. BUT when you go to update the record the system checks the rowversion hasn't changed in the mean time, it fails if someone has changed the record in the meantime.
Rowversion is maintained by DB each time a record changes.
Out of the box EF optimistic locking.
you can use Transaction Scope.
Import the namespace
using System.Transactions;
and use it like below:
public string InsertBrand()
{
try
{
using (TransactionScope transaction = new TransactionScope())
{
//Do your operations here
transaction.Complete();
return "Mobile Brand Added";
}
}
catch (Exception ex)
{
throw ex;
}
}
Another approach could be to use one or many internal queues and consume this queue(s) by one thread only (producer-consumer-pattern). I use this approach in a booking system and it works quite well and is very easy.
In my case I have multiple queues (one for each 'product') that are created and deleted dynamically and multiple consumers, where only one consumer can be assigned to one queue. This allows also to handle higher concurrency. In a high-concurrency scenario with houndredthousands of user you could also use separate servers and queues like msmq to handle this.
There might be a problem with this approach in a ticket system where a lot of users want to have a ticket for a concert or in a shopping system, when a new "Harry Potter" is released but I dont have this scenarios.

Delay on Loading Contents while Using Transactions

I felt some delay on Loading Contents while Using Transactions to Edit the contents,
(Testing this situation is a bit hard for me as I don't know how could be better to test it)
I have some doubts about Transactions usages:
There are some minor issues and things I should understand about Transactions
and these parts are related to this question :
When should we use Transactions in a Own-Made CMS ?
My-case-specific notes :
Should I use transactions on any CMS , While we have sprocs on Insert,Update,Retrieve, .... ?
Is the necessity of using transactions just when we are working on more tables than one ?
The Transaction strategy I used :
Adding Product Method ( Which uses add Product sproc ) :
TransactionOptions txOptions = new TransactionOptions();
using (TransactionScope txScope = new TransactionScope
(TransactionScopeOption.Required, txOptions))
{
try
{
connection.Open();
command.ExecuteNonQuery();
LastInserted = (int)pInsertedID.Value;
txScope.Complete();
}
catch (Exception ex)
{
logErrors.Warn(ex.Message);
}
finally
{
command.Dispose();
connection.Close();
}
Transactions may help to ensure consistency of the database. For example, if a stored procedure used to add a product inserts data in more than one table, and something fails along the way, a transaction might be helpful to rollback the whole operation, thus the database is free of half-baked products (e.g. lacking some critical info in related tables).
Transaction scopes (TransactionScope) are used to provide an ambient implicit transaction for whatever code runs inside a code block. These scopes may help to severely simplify the code, however, they also may add complexities in multithreaded environments (unfortunately, I don't know quite a lot about such cases).
Therefore, the code you provided would probably make sense to ensure database's consistency, especially if the command uses more than one table. It may add some performance overhead; however, you would be better off relying on gathered profiling data rather than any sort of feelings before conducting any optimizations (i.e. try to gather some quantitative data as to how slower things are under transactions). Modern database engines usually handle transactions quite efficiently; in my own experience, there were no transactions for removal due to their performance overhead.

Is checking rows affected count after database action (insert, update, delete) overkill?

Lately in apps I've been developing I have been checking the number of rows affected by an insert, update, delete to the database and logging an an error if the number is unexpected. For example on a simple insert, update, or delete of one row if any number of rows other than one is returned from an ExecuteNonQuery() call, I will consider that an error and log it. Also, I realize now as I type this that I do not even try to rollback the transaction if that happens, which is not the best practice and should definitely be addressed. Anyways, here's code to illustrate what I mean:
I'll have a data layer function that makes the call to the db:
public static int DLInsert(Person person)
{
Database db = DatabaseFactory.CreateDatabase("dbConnString");
using (DbCommand dbCommand = db.GetStoredProcCommand("dbo.Insert_Person"))
{
db.AddInParameter(dbCommand, "#FirstName", DbType.Byte, person.FirstName);
db.AddInParameter(dbCommand, "#LastName", DbType.String, person.LastName);
db.AddInParameter(dbCommand, "#Address", DbType.Boolean, person.Address);
return db.ExecuteNonQuery(dbCommand);
}
}
Then a business layer call to the data layer function:
public static bool BLInsert(Person person)
{
if (DLInsert(campusRating) != 1)
{
// log exception
return false;
}
return true;
}
And in the code-behind or view (I do both webforms and mvc projects):
if (BLInsert(person))
{
// carry on as normal with whatever other code after successful insert
}
else
{
// throw an exception that directs the user to one of my custom error pages
}
The more I use this type of code, the more I feel like it is overkill. Especially in the code-behind/view. Is there any legitimate reason to think a simple insert, update, or delete wouldn't actually modify the correct number of rows in the database? Is it more plausible to only worry about catching an actual SqlException and then handling that, instead of doing the monotonous check for rows affected every time?
Thanks. Hope you all can help me out.
UPDATE
Thanks everyone for taking the time to answer. I still haven't 100% decided what setup I will use going forward, but here's what I have taken away from all of your responses.
Trust the DB and .Net libraries to handle a query and do their job as they were designed to do.
Use transactions in my stored procedures to rollback the query on any errors and potentially use raiseerror to throw those exceptions back to the .Net code as a SqlException, which could handle these errors with a try/catch. This approach would replace the problematic return code checking.
Would there be any issue with the second bullet point that I am missing?
I guess the question becomes, "Why are you checking this?" If it's just because you don't trust the database to perform the query, then it's probably overkill. However, there could exist a logical reason to perform this check.
For example, I worked at a company once where this method was employed to check for concurrency errors. When a record was fetched from the database to be edited in the application, it would come with a LastModified timestamp. Then the standard CRUD operations in the data access layer would include a WHERE LastMotified=#LastModified clause when doing an UPDATE and check the record modified count. If no record was updated, it would assume a concurrency error had occurred.
I felt it was kind of sloppy for concurrency checking (especially the part about assuming the nature of the error), but it got the job done for the business.
What concerns me more in your example is the structure of how this is being accomplished. The 1 or 0 being returned from the data access code is a "magic number." That should be avoided. It's leaking an implementation detail from the data access code into the business logic code. If you do want to keep using this check, I'd recommend moving the check into the data access code and throwing an exception if it fails. In general, return codes should be avoided.
Edit: I just noticed a potentially harmful bug in your code as well, related to my last point above. What if more than one record is changed? It probably won't happen on an INSERT, but could easily happen on an UPDATE. Other parts of the code might assume that != 1 means no record was changed. That could make debugging very problematic :)
On the one hand, most of the time everything should behave the way you expect, and on those times the additional checks don't add anything to your application. On the other hand, if something does go wrong, not knowing about it means that the problem may become quite large before you notice it. In my opinion, the little bit of additional protection is worth the little bit of extra effort, especially if you implement a rollback on failure. It's kinda like an airbag in your car... it doesn't really serve a purpose if you never crash, but if you do it could save your life.
I've always prefered to raiserror in my sproc and handle exceptions rather than counting. This way, if you update a sproc to do something else, like logging/auditing, you don't have to worry about keeping the row counts in check.
Though if you like the second check in your code or would prefer not to deal with exceptions/raiserror, I've seen teams return 0 on successful sproc executions for every sproc in the db, and return another number otherwise.
It is absolutely overkill. You should trust that your core platform (.Net libraries, Sql Server) work correctly -you shouldn't be worrying about that.
Now, there are some related instances where you might want to test, like if transactions are correctly rolled back, etc.
If there's is a need for that check, why not do that check within the database itself? You save yourself from doing a round trip and it's done at a more 'centralized' stage - If you check in the database, you can be assured it's being applied consistently from any application that hits that database. Whereas if you put the logic in the UI, then you need to make sure that any UI application that hits that particular database applies the correct logic and does it consistently.

Multi threading C# application with SQL Server database calls

I have a SQL Server database with 500,000 records in table main. There are also three other tables called child1, child2, and child3. The many to many relationships between child1, child2, child3, and main are implemented via the three relationship tables: main_child1_relationship, main_child2_relationship, and main_child3_relationship. I need to read the records in main, update main, and also insert into the relationship tables new rows as well as insert new records in the child tables. The records in the child tables have uniqueness constraints, so the pseudo-code for the actual calculation (CalculateDetails) would be something like:
for each record in main
{
find its child1 like qualities
for each one of its child1 qualities
{
find the record in child1 that matches that quality
if found
{
add a record to main_child1_relationship to connect the two records
}
else
{
create a new record in child1 for the quality mentioned
add a record to main_child1_relationship to connect the two records
}
}
...repeat the above for child2
...repeat the above for child3
}
This works fine as a single threaded app. But it is too slow. The processing in C# is pretty heavy duty and takes too long. I want to turn this into a multi-threaded app.
What is the best way to do this? We are using Linq to Sql.
So far my approach has been to create a new DataContext object for each batch of records from main and use ThreadPool.QueueUserWorkItem to process it. However these batches are stepping on each other's toes because one thread adds a record and then the next thread tries to add the same one and ... I am getting all kinds of interesting SQL Server dead locks.
Here is the code:
int skip = 0;
List<int> thisBatch;
Queue<List<int>> allBatches = new Queue<List<int>>();
do
{
thisBatch = allIds
.Skip(skip)
.Take(numberOfRecordsToPullFromDBAtATime).ToList();
allBatches.Enqueue(thisBatch);
skip += numberOfRecordsToPullFromDBAtATime;
} while (thisBatch.Count() > 0);
while (allBatches.Count() > 0)
{
RRDataContext rrdc = new RRDataContext();
var currentBatch = allBatches.Dequeue();
lock (locker)
{
runningTasks++;
}
System.Threading.ThreadPool.QueueUserWorkItem(x =>
ProcessBatch(currentBatch, rrdc));
lock (locker)
{
while (runningTasks > MAX_NUMBER_OF_THREADS)
{
Monitor.Wait(locker);
UpdateGUI();
}
}
}
And here is ProcessBatch:
private static void ProcessBatch(
List<int> currentBatch, RRDataContext rrdc)
{
var topRecords = GetTopRecords(rrdc, currentBatch);
CalculateDetails(rrdc, topRecords);
rrdc.Dispose();
lock (locker)
{
runningTasks--;
Monitor.Pulse(locker);
};
}
And
private static List<Record> GetTopRecords(RecipeRelationshipsDataContext rrdc,
List<int> thisBatch)
{
List<Record> topRecords;
topRecords = rrdc.Records
.Where(x => thisBatch.Contains(x.Id))
.OrderBy(x => x.OrderByMe).ToList();
return topRecords;
}
CalculateDetails is best explained by the pseudo-code at the top.
I think there must be a better way to do this. Please help. Many thanks!
Here's my take on the problem:
When using multiple threads to insert/update/query data in SQL Server, or any database, then deadlocks are a fact of life. You have to assume they will occur and handle them appropriately.
That's not so say we shouldn't attempt to limit the occurence of deadlocks. However, it's easy to read up on the basic causes of deadlocks and take steps to prevent them, but SQL Server will always surprise you :-)
Some reason for deadlocks:
Too many threads - try to limit the number of threads to a minimum, but of course we want more threads for maximum performance.
Not enough indexes. If selects and updates aren't selective enough SQL will take out larger range locks than is healthy. Try to specify appropriate indexes.
Too many indexes. Updating indexes causes deadlocks, so try to reduce indexes to the minimum required.
Transaction isolational level too high. The default isolation level when using .NET is 'Serializable', whereas the default using SQL Server is 'Read Committed'. Reducing the isolation level can help a lot (if appropriate of course).
This is how I might tackle your problem:
I wouldn't roll my own threading solution, I would use the TaskParallel library. My main method would look something like this:
using (var dc = new TestDataContext())
{
// Get all the ids of interest.
// I assume you mark successfully updated rows in some way
// in the update transaction.
List<int> ids = dc.TestItems.Where(...).Select(item => item.Id).ToList();
var problematicIds = new List<ErrorType>();
// Either allow the TaskParallel library to select what it considers
// as the optimum degree of parallelism by omitting the
// ParallelOptions parameter, or specify what you want.
Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = 8},
id => CalculateDetails(id, problematicIds));
}
Execute the CalculateDetails method with retries for deadlock failures
private static void CalculateDetails(int id, List<ErrorType> problematicIds)
{
try
{
// Handle deadlocks
DeadlockRetryHelper.Execute(() => CalculateDetails(id));
}
catch (Exception e)
{
// Too many deadlock retries (or other exception).
// Record so we can diagnose problem or retry later
problematicIds.Add(new ErrorType(id, e));
}
}
The core CalculateDetails method
private static void CalculateDetails(int id)
{
// Creating a new DeviceContext is not expensive.
// No need to create outside of this method.
using (var dc = new TestDataContext())
{
// TODO: adjust IsolationLevel to minimize deadlocks
// If you don't need to change the isolation level
// then you can remove the TransactionScope altogether
using (var scope = new TransactionScope(
TransactionScopeOption.Required,
new TransactionOptions {IsolationLevel = IsolationLevel.Serializable}))
{
TestItem item = dc.TestItems.Single(i => i.Id == id);
// work done here
dc.SubmitChanges();
scope.Complete();
}
}
}
And of course my implementation of a deadlock retry helper
public static class DeadlockRetryHelper
{
private const int MaxRetries = 4;
private const int SqlDeadlock = 1205;
public static void Execute(Action action, int maxRetries = MaxRetries)
{
if (HasAmbientTransaction())
{
// Deadlock blows out containing transaction
// so no point retrying if already in tx.
action();
}
int retries = 0;
while (retries < maxRetries)
{
try
{
action();
return;
}
catch (Exception e)
{
if (IsSqlDeadlock(e))
{
retries++;
// Delay subsequent retries - not sure if this helps or not
Thread.Sleep(100 * retries);
}
else
{
throw;
}
}
}
action();
}
private static bool HasAmbientTransaction()
{
return Transaction.Current != null;
}
private static bool IsSqlDeadlock(Exception exception)
{
if (exception == null)
{
return false;
}
var sqlException = exception as SqlException;
if (sqlException != null && sqlException.Number == SqlDeadlock)
{
return true;
}
if (exception.InnerException != null)
{
return IsSqlDeadlock(exception.InnerException);
}
return false;
}
}
One further possibility is to use a partitioning strategy
If your tables can naturally be partitioned into several distinct sets of data, then you can either use SQL Server partitioned tables and indexes, or you could manually split your existing tables into several sets of tables. I would recommend using SQL Server's partitioning, since the second option would be messy. Also built-in partitioning is only available on SQL Enterprise Edition.
If partitioning is possible for you, you could choose a partion scheme that broke you data in lets say 8 distinct sets. Now you could use your original single threaded code, but have 8 threads each targetting a separate partition. Now there won't be any (or at least a minimum number of) deadlocks.
I hope that makes sense.
Overview
The root of your problem is that the L2S DataContext, like the Entity Framework's ObjectContext, is not thread-safe. As explained in this MSDN forum exchange, support for asynchronous operations in the .NET ORM solutions is still pending as of .NET 4.0; you'll have to roll your own solution, which as you've discovered isn't always easy to do when your framework assume single-threadedness.
I'll take this opportunity to note that L2S is built on top of ADO.NET, which itself fully supports asynchronous operation - personally, I would much prefer to deal directly with that lower layer and write the SQL myself, just to make sure that I fully understood what was transpiring over the network.
SQL Server Solution?
That being said, I have to ask - must this be a C# solution? If you can compose your solution out of a set of insert/update statements, you can just send over the SQL directly and your threading and performance problems vanish.* It seems to me that your problems are related not to the actual data transformations to be made, but center around making them performant from .NET. If .NET is removed from the equation, your task becomes simpler. After all, the best solution is often the one that has you writing the smallest amount of code, right? ;)
Even if your update/insert logic can't be expressed in a strictly set-relational manner, SQL Server does have a built-in mechanism for iterating over records and performing logic - while they are justly maligned for many use cases, cursors may in fact be appropriate for your task.
If this is a task that has to happen repeatedly, you could benefit greatly from coding it as a stored procedure.
*of course, long-running SQL brings its own problems like lock escalation and index usage that you'll have to contend with.
C# Solution
Of course, it may be that doing this in SQL is out of the question - maybe your code's decisions depend on data that comes from elsewhere, for example, or maybe your project has a strict 'no-SQL-allowed' convention. You mention some typical multithreading bugs, but without seeing your code I can't really be helpful with them specifically.
Doing this from C# is obviously viable, but you need to deal with the fact that a fixed amount of latency will exist for each and every call you make. You can mitigate the effects of network latency by using pooled connections, enabling multiple active result sets, and using the asynchronous Begin/End methods for executing your queries. Even with all of those, you will still have to accept that there is a cost to shipping data from SQL Server to your application.
One of the best ways to keep your code from stepping all over itself is to avoid sharing mutable data between threads as much as possible. That would mean not sharing the same DataContext across multiple threads. The next best approach is to lock critical sections of code that touch the shared data - lock blocks around all DataContext access, from the first read to the final write. That approach might just obviate the benefits of multithreading entirely; you can likely make your locking more fine-grained, but be ye warned that this is a path of pain.
Far better is to keep your operations separate from each other entirely. If you can partition your logic across 'main' records, that's ideal - that is to say, as long as there aren't relationships between the various child tables, and as long as one record in 'main' doesn't have implications for another, you can split your operations across multiple threads like this:
private IList<int> GetMainIds()
{
using (var context = new MyDataContext())
return context.Main.Select(m => m.Id).ToList();
}
private void FixUpSingleRecord(int mainRecordId)
{
using (var localContext = new MyDataContext())
{
var main = localContext.Main.FirstOrDefault(m => m.Id == mainRecordId);
if (main == null)
return;
foreach (var childOneQuality in main.ChildOneQualities)
{
// If child one is not found, create it
// Create the relationship if needed
}
// Repeat for ChildTwo and ChildThree
localContext.SaveChanges();
}
}
public void FixUpMain()
{
var ids = GetMainIds();
foreach (var id in ids)
{
var localId = id; // Avoid closing over an iteration member
ThreadPool.QueueUserWorkItem(delegate { FixUpSingleRecord(id) });
}
}
Obviously this is as much a toy example as the pseudocode in your question, but hopefully it gets you thinking about how to scope your tasks such that there is no (or minimal) shared state between them. That, I think, will be the key to a correct C# solution.
EDIT Responding to updates and comments
If you're seeing data consistency issues, I'd advise enforcing transaction semantics - you can do this by using a System.Transactions.TransactionScope (add a reference to System.Transactions). Alternately, you might be able to do this on an ADO.NET level by accessing the inner connection and calling BeginTransaction on it (or whatever the DataConnection method is called).
You also mention deadlocks. That you're battling SQL Server deadlocks indicates that the actual SQL queries are stepping on each other's toes. Without knowing what is actually being sent over the wire, it's difficult to say in detail what's happening and how to fix it. Suffice to say that SQL deadlocks result from SQL queries, and not necessarily from C# threading constructs - you need to examine what exactly is going over the wire. My gut tells me that if each 'main' record is truly independent of the others, then there shouldn't be a need for row and table locks, and that Linq to SQL is likely the culprit here.
You can get a dump of the raw SQL emitted by L2S in your code by setting the DataContext.Log property to something e.g. Console.Out. Though I've never personally used it, I understand the LINQPad offers L2S facilities and you may be able to get at the SQL there, too.
SQL Server Management Studio will get you the rest of the way there - using the Activity Monitor, you can watch for lock escalation in real time. Using the Query Analyzer, you can get a view of exactly how SQL Server will execute your queries. With those, you should be able to get a good notion of what your code is doing server-side, and in turn how to go about fixing it.
I would recommend moving all the XML processing into the SQL server, too. Not only will all your deadlocks disappear, but you will see such a boost in performance that you will never want to go back.
It will be best explained by an example. In this example I assume that the XML blob already is going into your main table (I call it closet). I will assume the following schema:
CREATE TABLE closet (id int PRIMARY KEY, xmldoc ntext)
CREATE TABLE shoe(id int PRIMARY KEY IDENTITY, color nvarchar(20))
CREATE TABLE closet_shoe_relationship (
closet_id int REFERENCES closet(id),
shoe_id int REFERENCES shoe(id)
)
And I expect that your data (main table only) initially looks like this:
INSERT INTO closet(id, xmldoc) VALUES (1, '<ROOT><shoe><color>blue</color></shoe></ROOT>')
INSERT INTO closet(id, xmldoc) VALUES (2, '<ROOT><shoe><color>red</color></shoe></ROOT>')
Then your whole task is as simple as the following:
INSERT INTO shoe(color) SELECT DISTINCT CAST(CAST(xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) AS color from closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id) SELECT closet.id, shoe.id FROM shoe JOIN closet ON CAST(CAST(closet.xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) = shoe.color
But given that you will do a lot of similar processing, you can make your life easier by declaring your main blob as XML type, and further simplifying to this:
INSERT INTO shoe(color)
SELECT DISTINCT CAST(xmldoc.query('//shoe/color/text()') AS nvarchar)
FROM closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id)
SELECT closet.id, shoe.id
FROM shoe JOIN closet
ON CAST(xmldoc.query('//shoe/color/text()') AS nvarchar) = shoe.color
There are additional performance optimizations possible, like pre-computing repeatedly invoked Xpath results in a temporary or permanent table, or converting the initial population of the main table into a BULK INSERT, but I don't expect that you will really need those to succeed.
sql server deadlocks are normal & to be expected in this type of scenario - MS's recommendation is that these should be handled on the application side rather than the db side.
However if you do need to make sure that a stored procedure is only called once then you can use a sql mutex lock using sp_getapplock. Here's an example of how to implement this
BEGIN TRAN
DECLARE #mutex_result int;
EXEC #mutex_result = sp_getapplock #Resource = 'CheckSetFileTransferLock',
#LockMode = 'Exclusive';
IF ( #mutex_result < 0)
BEGIN
ROLLBACK TRAN
END
-- do some stuff
EXEC #mutex_result = sp_releaseapplock #Resource = 'CheckSetFileTransferLock'
COMMIT TRAN
This may be obvious, but looping through each tuple and doing your work in your servlet container involves a lot of per-record overhead.
If possible, move some or all of that processing to the SQL server by rewriting your logic as one or more stored procedures.
If
You don't have a lot of time to spend on this issue and need it to fix it right now
You are sure that your code is done so that different thread will NOT modify the same record
You are not afraid
Then ... you can just add "WITH NO LOCK" to your queries so that MSSQL doesn't apply the locks.
To use with caution :)
But anyway, you didn't tell us where the time is lost (in the mono-threaded version). Because if it's in the code, I'll advise you to write everything in the DB directly to avoid continuous data exchange. If it's in the db, I'll advise to check index (too much ?), i/o, cpu etc.

Categories