I have an Excel file with a long list of usernames. Col A contains old user names Col. B has the new names. I want to rename users in a SQL table based on the excel file. My question is the following:
Is it ok to call SQL with a using statement multiple times within a loop where I iterate through the excel? Or is there a better way where I open a single connection and make all the SQL update queries with “one” go?
The answer to this is always that "it depends" and "can you justify it".
Right or wrong aside, your business case may include circumstances than mean multiple connections are an acceptable solution.
When iterating a data list, although not the greatest performance, it is generally acceptable to execute individual statements that would only affect a single record in a database.
You might do this to capture specific error information about each row, and in your business logic you will not re-process the rows that did succeed.
If you need to fail the whole batch when one row fails then you would need to ensure that you use a transaction scope so that you can roll back the entire set.
You would however generally NOT create multiple connections. a standard code pattern would be to create a connection outside of the loop and re-use the same connection for each transaction.
using (var conn = new SqlConnection(connectionString))
{
...
start a transaction
...
try
{
foreach(var record in dataRecords)
{
try
{
...
Execute your transactions
...
}
catch(SqlException sx)
{
...
Process the exception,
record relevant information based on the input parameters for this record
...
throw; // or throw a new exception with formatted info...
}
}
}
catch(Exception ex)
{
...
rollback the transaction
...
}
}
A set-based approach usually offers greater performance, but that requires a bit of plumbing to set up form a best practises point of view, this advice from Gordon Linoff works well too.
Multiple transactions, or multiple executions within the same transaction is acceptable, multiple connections however should be avoided.
You should load the Excel table into a table in the database with two columns (at least) called old_username and new_username.
Then you can run an update directly in the database. You haven't specified the database. But because of the C# tag, I'll provide SQL Server syntax for the update -- this syntax varies by database:
update u
set username = nc.new_username
from users u join -- the table you want to update
name_changes nc
on u.username = nc.old_username;
That is, it is generally better to get the data into the database and do all the work there.
Related
I have the following code that takes about an hour to run through a few hundred thousand rows:
public void Recording(int rowindex)
{
using (OleDbCommand cmd = new OleDbCommand())
{
try
{
using (OleDbConnection connection = new OleDbConnection(Con))
{
cmd.Connection = connection;
connection.Open();
using (OleDbTransaction Scope = connection.BeginTransaction(SD.IsolationLevel.ReadCommitted))
{
try
{
string Query = #"UPDATE [" + SetupAction.currentTable + "] set Description=#Description, Description_Department=#Description_Department, Accounts=#Accounts where ID=#ID";
cmd.Parameters.AddWithValue("#Description", VirtualTable.Rows[rowindex][4].ToString());
cmd.Parameters.AddWithValue("#Description_Department", VirtualTable.Rows[rowindex][18].ToString());
cmd.Parameters.AddWithValue("#Accounts", VirtualTable.Rows[rowindex][22].ToString());
cmd.Parameters.AddWithValue("#ID", VirtualTable.Rows[rowindex][0].ToString());
cmd.CommandText = Query;
cmd.Transaction = Scope;
cmd.ExecuteNonQuery();
Scope.Commit();
}
catch (OleDbException odex)
{
MessageBox.Show(odex.Message);
Scope.Rollback();
}
}
}
}
catch (OleDbException ex)
{
MessageBox.Show("SQL: " + ex);
}
}
}
It works as I expect it to, however today my program crashed while running the query (in a for loop where rowindex is the index of a datatable), the computer crashed, and when I rebooted the program, it said:
Multi-step OleDB operation generated errors: followed by my connection string.
What happened is that database is entirely uninteractable, even microsoft access's recovery methods can't seem to help out here.
I've read that this may be caused when the data structure of the database is altered from what it expected it to be. My question is, how do I prevent this, since I can't really detect whether my program stopped functioning all of a sudden.
There could be a way for me to restructure it somehow, maybe there's a function I don't know about. Perhaps it is sending something of an empty query when the crash happens, but I don't know how to stop it.
The Jet/ACE database engine already attempts to avoid corruption and to automatically recover from catastrophic events (lost connections, computer crashing). Transactions can further protected against inconsistent data by committing (or discarding) multiple operations altogether. But eventually there may be some coincidental system failure which could terminate an operation at some critical write position, thereby creating critical inconsistencies in the database file. Making regular and timely backups is part of an overall solution. For very long operations it might be worth making an automated copy of the entire database file prior to the operation.
Otherwise, an extreme alternative is to
Create a second intermediate database into which all data is first inserted. (Only needs to be done once.)
In this intermediate database, create linked tables to relevant tables in the permanent, working database.
Also in the intermediate database, create an indexed local table that mirrors the linked table structure into which data will be inserted. OR if the intermediate database and table already exist, clear the local table (i.e. delete all rows).
Have your current software insert into the local intermediate table.
Run a single query which then updates the linked table from the temporary table. Wrap that update in a transaction.
Here's where the linked table has the benefit that it can be referenced in an SQL query just like any local table. You only have to explicitly open the intermediate data. In other words, just perform a simple query like UPDATE LocalTable INNER JOIN LinkedTable ON LocalTable.UpdateID = LinkedTable.ID SET LinkedTable.Data = LocalTable.Data
The benefit to this process is that the single query that updates one Access table from another can be very fast, possibly much faster than the multiple update operations in your code. This could reduce the likelihood that errors in your update code will negatively effect your database. This of course doesn't completely eliminate the random computer crash that can effect the database, but reducing the time that multiple connections and update queries are executed might make it less likely.
I think your catch block is wrong, because if you get an exception other than OleDbException, you will not roll back the transaction
try
{
// ...
Scope.Commit();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
Scope.Rollback();
}
That is, Exception instead of OleDbException. Exceptions could come from anywhere and not necessarily Ole DB, and you still want to roll back everything you've done so far in that case.
That being said, if you have a few hundred thousand rows, I would seriously consider batching the update, and processing just a few thousand per iteration with a transaction per iteration
In terms of transactional behavior, the main question would be: Do you really want to roll back everything you have updated so far in case of failure, or just retry/continue where you left off? If answer is that you want to retry/continue then you will likely want to create a BatchUpdateTask table or similar... with all the information you need for each iteration
I am working with a situation where we are dealing with money transactions.
For example, I have a table of users wallets, with their balance in that row.
UserId; Wallet Id; Balance
Now in our website and web services, every time a certain transaction happens, we need to:
check that there is enough funds available to perform that transaction:
deduct the costs of the transaction from the balance.
How and what is the correct way to go about locking that row / entity for the entire duration of my transaction?
From what I have read there are some solutions where EF marks an entity and then compares that mark when it saves it back to the DB, however what does it do when another user / program has already edited the amount?
Can I achieve this with EF? If not what other options do I have?
Would calling a stored procedure possibly allow for me to lock the row properly so that no one else can access that row in the SQL Server whilst program A has the lock on it?
EF doesn't have built-in locking mechanism, you probably would need to use raw query like
using (var scope = new TransactionScope(...))
{
using (var context = new YourContext(...))
{
var wallet =
context.ExecuteStoreQuery<UserWallet>("SELECT UserId, WalletId, Balance FROM UserWallets WITH (UPDLOCK) WHERE ...");
// your logic
scope.Complete();
}
}
you can set the isolationlevel on the transaction in Entity framework to ensure no one else can change it:
YourDataContext.Database.BeginTransaction(IsolationLevel.RepeatableRead)
RepeatableRead
Summary:
Locks are placed on all data that is used in a query, preventing other users from updating the data. Prevents non-repeatable reads but phantom rows are still possible.
The whole point of a transactional database is that the consumer of the data determines how isolated their view of the data should be.
Irrespective of whether your transaction is serialized someone else can perform a dirty read on the same data that you just changed, but did not commit.
You should firstly concern yourself with the integrity of your view and then only accept a degredation of the quality of that view to improve system performance where you are sure it is required.
Wrap everthing in a TransactionScope with Serialized isolation level and you personally cannot really go wrong. Only drop the isolation level when you see it is genuinely required (i.e. when getting things wrong sometimes is OK).
Someone asks about this here: SQL Server: preventing dirty reads in a stored procedure
I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.
"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).
You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.
I have a Project Table, a Stakeholder Table, and an Association Table (which takes a ProjectID and a StakeholderID as foreign keys).
I want to delete a single Project but must first delete all that Project's rows in the Association Table.
Here is the method. ProjectRow is a strongly typed DataRow created with the DataSet Designer.
public void RemoveProject(ProjectRow project)
{
try
{
var associations = from a in ds.Association.AsEnumerable()
where a.Project == project.ProjID
select a;
foreach (DataRow assoc in associations)
{
assoc.Delete();
}
project.Delete();
using (TransactionScope scope = new TransactionScope())
{
assocTableAdapter.Update(ds.Association);
System.Threading.Thread.Sleep(40000); // to test the transaction.
projTableAdapter.Update(ds.Project);
scope.Complete();
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
This method does achieve the effect required (stops associations being added to the deleted project during the transaction) but it seems to place a read and write lock on all the tables so I cannot even read from the Project Table during the sleep period.
I would like to be able to add other Project/Stakeholder pairs to the Association Table during the transaction. How do I achieve this?
Cheers.
A few links but you can hint that you'd like row level locking and the databaase engine may or may not take the suggestion. However, since you're letting the library handle the deletes, who knows what it's doing (short of turning on profiler and capturing statements). It could very well be issuing table locks or you simply have the misfortune of the row locks escalating to page locks and the rows you are attempting to access in your query outside the transaction happen to be on the same page.
Is it possible to force row level locking in SQL Server?
Why is SQL Server 2008 blocking SELECT's on long transaction INSERT's?
https://dba.stackexchange.com/questions/6512/difference-between-row-level-and-page-level-locking-and-consequences
What's a body to do? You need to balance your concurrency needs against your risk for bad data. Here's a fun poster about SQL Server Isolation Levels
History
I have a list of "records" (3,500) which I save to XML and compress on exit of the program. Since:
the number of the records increases
only around 50 records need to be updated on exit
saving takes about 3 seconds
I needed another solution -- embedded database. I chose SQL CE because it works with VS without any problems and the license is OK for me (I compared it to Firebird, SQLite, EffiProz, db4o and BerkeleyDB).
The data
The record structure: 11 fields, 2 of them make primary key (nvarchar + byte). Other records are bytes, datatimes, double and ints.
I don't use any relations, joins, indices (except for primary key), triggers, views, and so on. It is flat Dictionary actually -- pairs of Key+Value. I modify some of them, and then I have to update them in database. From time to time I add some new "records" and I need to store (insert) them. That's all.
LINQ approach
I have blank database (file), so I make 3500 inserts in a loop (one by one). I don't even check if the record already exists because db is blank.
Execution time? 4 minutes, 52 seconds. I fainted (mind you: XML + compress = 3 seconds).
SQL CE raw approach
I googled a bit, and despite such claims as here:
LINQ to SQL (CE) speed versus SqlCe
stating it is SQL CE itself fault I gave it a try.
The same loop but this time inserts are made with SqlCeResultSet (DirectTable mode, see: Bulk Insert In SQL Server CE) and SqlCeUpdatableRecord.
The outcome? Do you sit comfortably? Well... 0.3 second (yes, fraction of the second!).
The problem
LINQ is very readable, and raw operations are quite contrary. I could write a mapper which translates all column indexes to meaningful names, but it seems like reinventing the wheel -- after all it is already done in... LINQ.
So maybe it is a way to tell LINQ to speed things up? QUESTION -- how to do it?
The code
LINQ
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
PrimLibrary.Database.Progress record = null;
record = new PrimLibrary.Database.Progress();
record.Text = entry.Text;
record.Direction = (byte)entry.dir;
db.Progress.InsertOnSubmit(record);
record.Status = (byte)entry.LastLearningInfo.status.Value;
// ... and so on
db.SubmitChanges();
}
Raw operations
SqlCeCommand cmd = conn.CreateCommand();
cmd.CommandText = "Progress";
cmd.CommandType = System.Data.CommandType.TableDirect;
SqlCeResultSet rs = cmd.ExecuteResultSet(ResultSetOptions.Updatable);
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
SqlCeUpdatableRecord record = null;
record = rs.CreateRecord();
int col = 0;
record.SetString(col++, entry.Text);
record.SetByte(col++,(byte)entry.dir);
record.SetByte(col++,(byte)entry.LastLearningInfo.status.Value);
// ... and so on
rs.Insert(record);
}
Do more work per transaction.
Commits are generally very expensive operations for a typical relational database as the database must wait for disk flushes to ensure data is not lost (ACID guarantees and all that). Conventional HDD disk IO without specialty controllers is very slow in this sort of operation: the data must be flushed to the physical disk -- perhaps only 30-60 commits can occur a second with an IO sync between!
See the SQLite FAQ: INSERT is really slow - I can only do few dozen INSERTs per second. Ignoring the different database engine, this is the exact same issue.
Normally, LINQ2SQL creates a new implicit transaction inside SubmitChanges. To avoid this implicit transaction/commit (commits are expensive operations) either:
Call SubmitChanges less (say, once outside the loop) or;
Setup an explicit transaction scope (see TransactionScope).
One example of using a larger transaction context is:
using (var ts = new TransactionScope()) {
// LINQ2SQL will automatically enlist in the transaction scope.
// SubmitChanges now will NOT create a new transaction/commit each time.
DoImportStuffThatRunsWithinASingleTransaction();
// Important: Make sure to COMMIT the transaction.
// (The transaction used for SubmitChanges is committed to the DB.)
// This is when the disk sync actually has to happen,
// but it only happens once, not 3500 times!
ts.Complete();
}
However, the semantics of an approach using a single transaction or a single call to SubmitChanges are different than that of the code above calling SubmitChanges 3500 times and creating 3500 different implicit transactions. In particular, the size of the atomic operations (with respect to the database) is different and may not be suitable for all tasks.
For LINQ2SQL updates, changing the optimistic concurrency model (disabling it or just using a timestamp field, for instance) may result in small performance improvements. The biggest improvement, however, will come from reducing the number of commits that must be performed.
Happy coding.
i'm not positive on this, but it seems like the db.SubmitChanges() call should be made outside of the loop. maybe that would speed things up?