Investigating Google Cloud Spanner Latency - are secondary indexes locking? - c#

I have an application which inserts hundreds of thousands of rows (each with only 3 columns) to spanner using the following code ran concurrently with batches of 5000.
public async Task ExecuteBatchInsertOrReplaceAsync<T>(List<T> items, SpannerConnection connection)
{
// This will throw if the items count * column > 20,000. In this case, batch the batches.
await connection.RunWithRetriableTransactionAsync(async transaction =>
{
await Task.WhenAll(items.Select(item => ExecuteInsertOrReplaceAsync(item, connection, transaction)));
});
Logger.LogInformation($"ExecuteBatchInsertOrReplaceAsync executed on {items.Count} items.");
}
public async Task<int> ExecuteInsertOrReplaceAsync<T>(T item, SpannerConnection connection, SpannerTransaction spannerTransaction = null)
{
var parameters = new SpannerParameterCollection().CreateKeys<T>();
parameters.PopulateFrom(item);
await using var command = connection.CreateInsertOrUpdateCommand(TableName, parameters);
command.Transaction = spannerTransaction;
var count = await command.ExecuteNonQueryAsync();
return count;
}
But when executed spanner runs with latency, making the writes take more time than I'd like. Spanner monitoring shows I have a latency of around 40s. My write throughput is about 14MiB/s using 5 pods.
The table I'm inserting to has a single unique index. The Spanner docs suggest that high latency can be the result of table locking. Checking Spanner's lock stats with
SELECT CAST(s.row_range_start_key AS STRING) AS row_range_start_key,
t.total_lock_wait_seconds,
s.lock_wait_seconds,
s.lock_wait_seconds/t.total_lock_wait_seconds frac_of_total,
s.sample_lock_requests,
t.interval_end
FROM spanner_sys.lock_stats_total_10minute t, spanner_sys.lock_stats_top_10minute s
WHERE
t.interval_end = "2022-03-04T16:00:00Z"
shows me that there are indeed many locks that are being awaited for several seconds, each with sample_lock_requests = _Index_ix_my_index_name._exists,Exclusive.
So here is my question: is spanner slowing down my writes because my unique secondary index is locking the table for each write, or could the latency be caused by something else? If I'm missing any key information, my apologies, please let me know.
Thanks

is spanner slowing down my writes because my unique secondary index is
locking the table for each write
Please note that Spanner does not lock the table, the lock granularity is row-and-column, or cell. More details about locking: https://cloud.google.com/spanner/docs/transactions#locking
or could the latency be caused by something else?
It is often hard to conclude what caused high latency in Spanner without knowing more details of your workload: schema, key ranges you are inserting, is your database currently empty or already have data, if have data how they are distributed, etc. Generally speaking, unique index constraint will be validated at commit time, thus if you are not inserting a spread-out key range, there would be lock contentions. But it is hard to conclude whether the 40s latency are fully contributed by this factor.
This page https://cloud.google.com/spanner/docs/bulk-loading have some information on best practices on bulk loading. If your database is currently empty and you are doing one time bulk loading, it would be faster if you delete the constraint and add it back after data load.
If you are inserting into a non-empty table, try to use a smaller mutation size per transaction, which might help.
You could also use Key visualizer https://cloud.google.com/spanner/docs/key-visualizer to see whether your inserts cause hot spots or rolling hot spots, which usually contribute to high latency.
Please feel free to file a service ticket if you need more detailed help.

Related

C# Reading SQLite table concurrently

The goal here is to use SQL to read a SQLite database, uncompress a BLOB field, and parse the data. The parsed data is written to a different SQLite DB using EF6. Because the size of the incoming database could be 200,000 records or more, I want to do this all in parallel with 4 C# Tasks.
SQLite is in its default SERIALIZED mode. I am converting a working single background task into multiple tasks. The SQLite docs say to use a single connection and so I am using a single connection for all the tasks to read the database:
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read() && !Token.IsCancellationRequested)
{
....
}
However, each task reads each record of the database. Not what I want. I need each task to take the next record from the table.
Any ideas?
From SQLite's standpoint, it's likely the limiting factor is the raw disk or network I/O. Naively splitting the basic query into separate tasks or parts would mean more seeks, which makes things slower. We see, then, that the fastest way to get the raw data from the DB is a simple query over a single connection, just like the sqlite documentation says.
But now we want to do some meaningful processing on this data, and this part might benefit from parallel work. What you need to do to get good parallelization, therefore, is create a queuing system as you receive each record.
For this, you want a single process to send the one SQL statement to the sqlite database and retrieve the results from the datareader. This thread will then queue an additional task from each record as quickly as possible, such that each task acts only the received data for the one record... that is, the additional tasks neither know nor care the data came from a database or any other specific source.
The result is you'll end up with as many tasks as you have records. However, you don't have to run that many tasks all at once. You can tune it to 4 or whatever other number you want (2 * the number CPU cores is a good rule of thumb to start with). And the easiest way to do this is to turn to ThreadPool.QueueUserWorkItem().
As we do this, one thing to remember is the DataReader will mutate itself with each read. So our main thread creating the queue must also be smart enough to copy this data to a new object with each read, so the individual threads don't end up looking at data that was already changed out for a later record.
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
var temp = CopyDataFromReader(sqlite_datareader);
ThreadPool.QueueUserWorkItem(a => ProcessRecord(temp));
}
Additionally, each task itself has some overhead. If you have enough records, you may also gain some benefit from batching up a bunch of records before sending them to the queue:
int index = 0;
object[] temp;
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
temp[count] = CopyDataFromReader(sqlite_datareader);
if (++count >= 50)
{
ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, 50));
count = 0;
}
}
if (count != 0) ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, count));
Finally, you probably want to do something with this data once it is no longer compressed. One option is wait for all the items to finish, so you can stitch them back into a single IEnumerable of some variety (List, Array, DataTable, iterator, etc). Another is to make sure to include all of the work with the ProcessRecord() method. Another is to use an Event delegate to signal when each item is ready for further work.

Deleting rows by batches. How to open / reuse SQL Server connection?

What would be the most effective way to open/use a SQL Server connection if we're reading rows to be deleted in batches?
foreach(IEnumerable<Log> logsPage in LogsPages)
{
foreach(Log logEntry in logsPage)
{
// 1. get associated filenames
// 2. delete row
// 3. try delete each file
}
}
Log page size is about 5000 rows
Files associated with the log entries may vary in size. I don't think they are larger than say 500 Mb.
We use Dapper
Should we let Dapper open connections on each step of the foreach loop? I suppose SQL Server connection pooling takes place here?
Or should we open an explicit connection per batch?
If you're performing multiple database operations in a tight loop, it would usually be preferable to open the connection for the duration of all the operations. Returning the connection to the pool can be beneficial in contested systems where there can be an indeterminate interval before the next database operation, but if you're doing lots of sequential operations: constantly fetching and returning connections from the pool (and executing sp_reset_connection, which happens behind the scenes) add overhead for no good reason.
So to be explicit, I'd have the Open[Async]() here above the first foreach.
Note: for batching, you might find that there are ways of doing this with fewer round-trips, in particular making use of the IN re-writing in Dapper based on the ids. Since you mention SQL-Server, This can be combined with setting a SqlMapper.Settings.InListStringSplitCount to something positive (5, 10, etc are reasonable choices; note that this is a global setting); for example, for a simple scenario:
connection.Execute("delete from Foo where Id in #ids",
new { ids = rows.Select(x => x.Id) });
is much more efficient than:
foreach (var row in rows)
{
connection.Execute("delete from Foo where Id = #id",
new { id = row.Id });
}
Without InListStringSplitCount, the first version will be re-written as something like:
delete from Foo where Id in (#ids0, #ids1, #ids2, ..., #idsN)
With InListStringSplitCount, the first version will be re-written as something like:
delete from Foo where Id in (select cast([value] as int) from string_split(#ids,','))
which allows the exact same query to be used many times, which is good for query-plan re-use.

Performance of Multiple Parallel (async) SqlBulkCopy inserts, against different tables, in a single Transaction

TL;DR
Why does running multiple SqlBulkCopy inserts, against unrelated tables, async & in parallel, on a single Transaction seem to behave as though it's running in series instead?
Context
I have some code that is calculating and storing a large volume of data.
The calculation is done up-front, so the storage section of the code gets handed this big pile of data to be stored.
My DB writes are being done with SqlBulkCopy.WriteToServerAsync which does the job nicely, in general.
Amongst the things I need to store are 6 tables which are business-related, but not SQL-related. As such, my write to them needs to be in a transaction, so that an error on any one write reverts the writes on all the others.
The performance of this code is fairly critical, so I want to be able to run the BulkInserts in parallel. There are no FKeys or any other tables being interacted with, (data integrity is managed by the code) so I don't see any reason that this shouldn't be possible.
What I've currently written
I thought I knew how to write all the code and have been able to get it all working, but there's a weird performance slow-down that I don't understand:
Happy to provide actual bits of code you want, but this is already a very long Q, and the code would be pretty long to0. LMK if you do want to see anything.
I can write:
"BulkInsert into each table sequentially, all in a single Transaction".
i.e. I open a new SqlConnection() and .BeginTransaction(),
then I foreach over the 6 tables, and await InsertToTable(transaction) each table before the foreach moves to the next one.
When the foreach concludes then I .Commit() the transaction and close the connection.
I have a large-volume test that runs this version in 184 seconds (95%, +/- 2.45s).
"BulkInsert into each table sequentially, with a new connection & Transaction for each table."
i.e. I foreach over the 6 tables, and await InsertToTable() each table before the foreach moves to the next one.
Inside each InsertToTable() call I open a new SqlConnection and BeginTransaction, and then I .Commit() and .Close() before returning from the method.
I have a large-volume test that runs this version in 185 seconds (95%, +/- 3.34s).
"BulkInsert into each table in parallel, with a new connection & Transaction for each table."
i.e. I initiate all 6 of my tasks by calling thisTableTask = InsertToTable() for each table, and capturing the Tasks but not awaiting them (yet).
I await Task.WhenAll() the 6 tasks captured.
Inside each InsertToTable() call I open a new SqlConnection and BeginTransaction, and then I .Commit() and .Close() before returning from the method. (but note that the foreach has moved onto the next table, because it doesn't await the Task immediately.
I have a large-volume test that runs this version in 144 seconds (95%, +/- 5.20s).
"BulkInsert into each table in parallel, all in a single Transaction".
i.e. I open a new SqlConnection() and .BeginTransaction().
Then I initiate all 6 of my tasks by calling thisTableTask = InsertToTable(transaction) for each table, and capturing the Tasks but not awaiting them (yet).
I await Task.WhenAll() the 6 tasks captured.
Once the WhenAll concludes then I .Commit() the transaction and close the connection.
I have a large-volume test that runs this version in 179 seconds (95%, +/- 1.78s).
In all cases the eventual BulkInsert looks like:
using (var sqlBulk = BuildSqlBulkCopy(tableName, columnNames, transactionToUse))
{
await sqlBulk.WriteToServerAsync(dataTable);
}
private SqlBulkCopy BuildSqlBulkCopy(string tableName, string[] columnNames, SqlTransaction transaction)
{
var bulkCopy = new SqlBulkCopy(transaction.Connection, SqlBulkCopyOptions.Default, transaction)
{
BatchSize = 10000,
DestinationTableName = tableName,
BulkCopyTimeout = 3600
};
foreach (var columnName in columnNames)
{
// Relies on setting up the data table with column names matching the database columns.
bulkCopy.ColumnMappings.Add(columnName, columnName);
}
return bulkCopy;
}
Current Performance stats
As listed above
Sequential + single Tran = 184s
Sequential + separate Trans = 185s
Parallel + separate Tran = 144s
Parallel + single Tran = 179s
Those first 3 results all make sense to me.
#1 vs #2: As long as the inserts all work, the Transactions don't do much. The DB is still doing all the same work, at the same points in time.
#2 vs #3: This was the entire point of running the inserts in parallel. By running the inserts in parallel, we spend less time waiting around for SQL to do it's thing. We're making the DB do a lot of work in parallel, so it's not as much as a 6-fold speed up, but it's still plenty.
QUESTION:
Why is the last case so slow? And can I fix it?
Parallel + single Tran = 179
That's almost as slow as doing it in series, and fully 25% slower than doing it in parallel, but with multiple transactions!
What's going on?
Why does running multiple SqlBulkCopy inserts, against unrelated tables, async & in parallel, on a single Transaction seem to behave as though it's running in series instead?
Non-Dupes:
SqlBulkCopy Multiple Tables Insert under single Transaction OR Bulk Insert Operation between Entity Framework and Classic Ado.net (Isn't running the queries in parallel)
Using SqlBulkCopy in one transaction for multiple, related tables (Tables are related and they're trying to read back out of them)
Parallel Bulk Inserting with SqlBulkCopy and Azure (that's talking about parallel load into a single table)
The only way to execute multiple commands concurrently on the same SQL Server connection/transaction is using Multiple Active Result Sets (MARS). MARS is used in the parallel single transaction case because you're using the same connection/transaction for each parallel bulk copy.
MARS executes SELECT and insert bulk operations as interleaved, not parallel, so you'll get about the same performance as serial execution. You need a distributed transaction with different connections for true parallel execution within the same transaction scope.

What is the best way to load huge result set in memory?

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified

How to speed up LINQ inserts with SQL CE?

History
I have a list of "records" (3,500) which I save to XML and compress on exit of the program. Since:
the number of the records increases
only around 50 records need to be updated on exit
saving takes about 3 seconds
I needed another solution -- embedded database. I chose SQL CE because it works with VS without any problems and the license is OK for me (I compared it to Firebird, SQLite, EffiProz, db4o and BerkeleyDB).
The data
The record structure: 11 fields, 2 of them make primary key (nvarchar + byte). Other records are bytes, datatimes, double and ints.
I don't use any relations, joins, indices (except for primary key), triggers, views, and so on. It is flat Dictionary actually -- pairs of Key+Value. I modify some of them, and then I have to update them in database. From time to time I add some new "records" and I need to store (insert) them. That's all.
LINQ approach
I have blank database (file), so I make 3500 inserts in a loop (one by one). I don't even check if the record already exists because db is blank.
Execution time? 4 minutes, 52 seconds. I fainted (mind you: XML + compress = 3 seconds).
SQL CE raw approach
I googled a bit, and despite such claims as here:
LINQ to SQL (CE) speed versus SqlCe
stating it is SQL CE itself fault I gave it a try.
The same loop but this time inserts are made with SqlCeResultSet (DirectTable mode, see: Bulk Insert In SQL Server CE) and SqlCeUpdatableRecord.
The outcome? Do you sit comfortably? Well... 0.3 second (yes, fraction of the second!).
The problem
LINQ is very readable, and raw operations are quite contrary. I could write a mapper which translates all column indexes to meaningful names, but it seems like reinventing the wheel -- after all it is already done in... LINQ.
So maybe it is a way to tell LINQ to speed things up? QUESTION -- how to do it?
The code
LINQ
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
PrimLibrary.Database.Progress record = null;
record = new PrimLibrary.Database.Progress();
record.Text = entry.Text;
record.Direction = (byte)entry.dir;
db.Progress.InsertOnSubmit(record);
record.Status = (byte)entry.LastLearningInfo.status.Value;
// ... and so on
db.SubmitChanges();
}
Raw operations
SqlCeCommand cmd = conn.CreateCommand();
cmd.CommandText = "Progress";
cmd.CommandType = System.Data.CommandType.TableDirect;
SqlCeResultSet rs = cmd.ExecuteResultSet(ResultSetOptions.Updatable);
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
SqlCeUpdatableRecord record = null;
record = rs.CreateRecord();
int col = 0;
record.SetString(col++, entry.Text);
record.SetByte(col++,(byte)entry.dir);
record.SetByte(col++,(byte)entry.LastLearningInfo.status.Value);
// ... and so on
rs.Insert(record);
}
Do more work per transaction.
Commits are generally very expensive operations for a typical relational database as the database must wait for disk flushes to ensure data is not lost (ACID guarantees and all that). Conventional HDD disk IO without specialty controllers is very slow in this sort of operation: the data must be flushed to the physical disk -- perhaps only 30-60 commits can occur a second with an IO sync between!
See the SQLite FAQ: INSERT is really slow - I can only do few dozen INSERTs per second. Ignoring the different database engine, this is the exact same issue.
Normally, LINQ2SQL creates a new implicit transaction inside SubmitChanges. To avoid this implicit transaction/commit (commits are expensive operations) either:
Call SubmitChanges less (say, once outside the loop) or;
Setup an explicit transaction scope (see TransactionScope).
One example of using a larger transaction context is:
using (var ts = new TransactionScope()) {
// LINQ2SQL will automatically enlist in the transaction scope.
// SubmitChanges now will NOT create a new transaction/commit each time.
DoImportStuffThatRunsWithinASingleTransaction();
// Important: Make sure to COMMIT the transaction.
// (The transaction used for SubmitChanges is committed to the DB.)
// This is when the disk sync actually has to happen,
// but it only happens once, not 3500 times!
ts.Complete();
}
However, the semantics of an approach using a single transaction or a single call to SubmitChanges are different than that of the code above calling SubmitChanges 3500 times and creating 3500 different implicit transactions. In particular, the size of the atomic operations (with respect to the database) is different and may not be suitable for all tasks.
For LINQ2SQL updates, changing the optimistic concurrency model (disabling it or just using a timestamp field, for instance) may result in small performance improvements. The biggest improvement, however, will come from reducing the number of commits that must be performed.
Happy coding.
i'm not positive on this, but it seems like the db.SubmitChanges() call should be made outside of the loop. maybe that would speed things up?

Categories