Deleting Large data without transaction logs in SQL azure - c#

I want to delete a large amount of data from azure SQL table frequently using the below code, but when deleting records then transactions logs will be created which will consume Database data storage ,how could we perform deletion without transactions logs and consuming database data storage ?
Task.Run(async () =>
{
long maxId = crumbManager.GetMaxId(fromDate,tenantId);
var startingTime = DateTime.UtcNow;
while (!cancellationToken.IsCancellationRequested && maxId > 0 && startingTime.AddHours(2) > DateTime.UtcNow)
{
try
{
var query = $#"delete top(10000) from Crumbs where CrumbId <= #maxId and TenantId =#tenantId ";
using (var con = new SqlConnection(connection))
{
con.Open();
using (var cmd = new SqlCommand(query, con))
{
cmd.Parameters.AddWithValue("#maxId", maxId);
cmd.Parameters.AddWithValue("#tenantId", tenantId);
cmd.CommandTimeout = 200;
var affected = cmd.ExecuteNonQuery();
if (affected == 0)
{
break;
}
}
}
}
catch (Exception ex)
{
}
finally
{
await Task.Delay(TimeSpan.FromSeconds(5), cancellationToken.Token);
}
}
});

You can't. Databases make changes using a transaction log so that it can handle failures in the middle of a transaction. So, even delete operations use space in the transaction log. Now, the transaction log only takes space (when using full recovery like SQL Azure does for user databases) until the next backup operation. Those are happening every few minutes today, so the time in which space is required on disk for the log is minimal.
There are some operations which are minimally logged and use less space than doing row-by-row deletes. For example, if you do a truncate table or swap out a partition from a partitioned table (and then drop it), then you generate much less log than doing row-by-row. You would need to consider some design changes to your schema to enable this pattern since you aren't just deleting all rows now.
Ultimately, you should just focus on making sure that the operation you perform in SQL Azure is efficient. if you loop over a heap and delete K rows over and over, that can algorithmically perform many scans over the table instead of range scans. If you do that even without any of the fancy truncate/partition approaches, you may be able to improve the performance of the system over what you might have now.
Hope that helps explain how SQL works a bit.

Try to use batching techniques to minimize log usage.
declare
#batch_size int,
#del_rowcount int = 1
set #batch_size = 100
set nocount on;
while #del_rowcount > 0
begin
begin tran
delete top (#batch_size)
from dbo.LargeDeleteTest
set #del_rowcount = ##rowcount
print 'Delete row count: ' + cast(#del_rowcount as nvarchar(32))
commit tran
end
Drop any foreign keys, delete the rows and then recreate the foreign keys can speed up things also.

Related

How to manually lock and unlock table such that inserts are prevented

The code below works but does not prevent a different user from inserting a row and thus creating a duplicate ID.
The IDs for the table being updated are auto incremented and assigned. In the code below I do the following:
Get the next available ID (nextID)
Set the ID of each entity to nextID++
Bulk insert
How do I lock the table such that another user cannot insert while the the three tasks above are running? I have seen similar questions that propose setting ISOLATIONLEVEL READCOMMITTED however I don't think that will lock the table at the time I am getting the nextID.
public void BulkInsertEntities(List<Entity> entities)
{
if (entities == null)
throw new ArgumentNullException(nameof(entities));
string tableName = "Entities";
// -----------------------------------------------------------------
// Prevent other users from inserting (but not reading) here
// -----------------------------------------------------------------
long lastID = GetLastID(tableName);
entities.ForEach(x => x.ID = lastID++);
using (SqlConnection con = new SqlConnection(db.Database.GetDbConnection().ConnectionString))
{
con.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(con.ConnectionString, SqlBulkCopyOptions.KeepIdentity))
{
bulkCopy.DestinationTableName = tableName;
DataTable tbl = DataUtil.ToDataTable<Entity>(entities);
foreach (DataColumn col in tbl.Columns)
bulkCopy.ColumnMappings.Add(col.ColumnName, col.ColumnName);
bulkCopy.WriteToServer(tbl);
}
}
// ---------------------------
// Allow other users to insert
// ---------------------------
}
protected long GetLastID(string tableName)
{
long lastID = 0;
using (var command = db.Database.GetDbConnection().CreateCommand())
{
command.CommandText = $"SELECT IDENT_CURRENT('{tableName}') + IDENT_INCR('{tableName}')";
db.Database.OpenConnection();
lastID = Convert.ToInt64(command.ExecuteScalar());
}
return lastID;
}
For identity-like functionality with a variant on the flexibility, you can create a named sequence:
create sequence dbo.MySequence as int
...and have a default constraint on the table: default(next value for dbo.MySequence).
Nice thing about this is that you can "burn" IDs and send them to clients so they have a key they can put into their data...and then, when the data comes in pre-populated, no harm, no foul. It takes a little more work than identity fields, but it's not too terrible. By "burn" I mean you can get a new ID anytime by calling next value for dbo.MySequence anywhere you like. If you hold onto that value, you know it's not going to be assigned to the table. The table will get the next value after yours. You can then, at your leisure insert a row with the value you got and held...knowing it's a legit key.
There is a feature in SQL Server call application locks. I've only rarely seen it used, but your example might be suitable. Basically, the idea is that you'd put triggers on tables that start by testing for an outstanding app_lock:
if ( applock_test( 'public', 'MyLock', 'Exclusive' ) = 1 )
begin
raiserror( ... )
return
--> or wait and retry
end
...and the long-running process that can't be interrupted gets the applock at the beginning and releases it at the end:
exec #rc = get_applock #dbPrincipal='public', #resource='MyLock', #lockMode='Exclusive'
if ( #rc = 0 )
begin
--> got the lock, do the damage...
--> and then, after carefully handling the edge cases,
--> and making sure we dont skip the release...
exec release_applock #resource='MyLock' #dbPrincipal='public'
end
There are lots of variations. Session-based locks which can be auto-released when a session ends (beware of connection pooling), timeouts, multiple lock modes (shared, exclusive, etc.), and scoped locks (that may not apply to privileged db users).

difference between OracleBulkCopyOptions.Default and OracleBulkCopyOptions.UseInternalTransaction.

Could anybody explain what is the difference between OracleBulkCopyOptions.Default and OracleBulkCopyOptions.UseInternalTransaction & how i can rollback all the record if any bulk insert error happens in between.
I am using OracleBulkCopy to bulk insert(records range varies from 100000 to 500000 ) to Oracle data base. My requirement is either all the records should insert into db or none of the record should insert into db(roll back all records). I am giving 25000 as BatchSize & 150 seconds as BulkCopyTimeout. Below is my current code block.
public bool WriteExcelDataToServerRouteOne(DataTable excelTable)
{
var columnMapping = from table in excelTable.Columns.Cast<DataColumn>() select new OracleBulkCopyColumnMapping(table.ColumnName, table.ColumnName);
using (var bulkcopy = new OracleBulkCopy(ConnectionString, OracleBulkCopyOptions.Default))
{
bulkcopy.DestinationTableName = DestinationTable;
foreach (var mapping in columnMapping)
bulkcopy.ColumnMappings.Add(mapping);
bulkcopy.BulkCopyTimeout = TimeOut.Value;
bulkcopy.BatchSize = BatchSize.Value;
bulkcopy.WriteToServer(excelTable);
}
return true;
}
OracleBulkCopy doesn't support transaction for all the records, it only support transaction for batches if UseInternalTransaction is specified.
From OracleBulkCopy Class
If BatchSize > 0 and the UseInternalTransaction bulk copy option is specified, each batch of the bulk copy operation occurs within a
transaction. If the connection used to perform the bulk copy
operation is already part of a transaction, an
InvalidOperationException exception is raised.
If BatchSize > 0 and the UseInternalTransaction option is not specified, rows are sent to the database in batches of size BatchSize,
but no transaction-related action is taken.
For your question:
Could anybody explain what is the difference between
OracleBulkCopyOptions.Default and
OracleBulkCopyOptions.UseInternalTransaction
Default: Doesn't uses transaction for batches.
UseInternalTransaction: Supports transaction for batches if the batch size is greater than 0.
See:
OracleBulkCopyOptions Enumeration

SqLite C# extremely slow on update

I'm really struggling to iron out this issue. When I use the following code to update my database for large numbers of records it runs extremely slow. I've got 500,000 records to update which takes nearly an hour. During this operation, the journal file grows slowly with little change on the main SQLite db3 file - is this normal?
The operation only seems to be a problem when I have large numbers or records to update - it runs virtually instantly on smaller numbers of records.
Some other operations are performed on the database prior to this code running so could they be some how tying up the database? I've tried to ensure that all other connections are closed properly.
Thanks for any suggestions
using (SQLiteConnection sqLiteConnection = new SQLiteConnection("Data Source=" + _case.DatabasePath))
{
sqLiteConnection.Open();
using (SQLiteCommand sqLiteCommand = new SQLiteCommand("begin", sqLiteConnection))
{
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.CommandText = "UPDATE CaseFiles SET areaPk = #areaPk, KnownareaPk = #knownareaPk WHERE mhash = #mhash";
var pcatpk = sqLiteCommand.CreateParameter();
var pknowncatpk = sqLiteCommand.CreateParameter();
var pmhash = sqLiteCommand.CreateParameter();
pcatpk.ParameterName = "#areaPk";
pknowncatpk.ParameterName = "#knownareaPk";
pmhash.ParameterName = "#mhash";
sqLiteCommand.Parameters.Add(pcatpk);
sqLiteCommand.Parameters.Add(pknowncatpk);
sqLiteCommand.Parameters.Add(pmhash);
foreach (CatItem CatItem in _knownFiless)
{
if (CatItem.FromMasterHashes == true)
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = CatItem.areaPk;
pmhash.Value = CatItem.mhash;
}
else
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = null;
pmhash.Value = CatItem.mhash;
}
sqLiteCommand.ExecuteNonQuery();
}
sqLiteCommand.CommandText = "end";
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.Dispose();
sqLiteConnection.Close();
}
sqLiteConnection.Close();
}
The first thing to ensure that you have an index on mhash.
Group commands into batches.
Use more than one thread.
Or [inserted]
Bulk import the records to a temporary table. Create an index on the mhash column. Perform a single update statement to update the records.
You need to wrap everything inside a transaction otherwise I believe SQLite will create and commit one for you for every update ... hence the slowness. You clearly know that looking at your code but I am not sure using "Begin" and "End" commands achieve the same result here, you might end up with empty transaction at start and finish instead of one wrapping everything. Try something like this instead just in case:
using (SQLiteTransaction mytransaction = myconnection.BeginTransaction())
{
using (SQLiteCommand mycommand = new SQLiteCommand(myconnection))
{
SQLiteParameter myparam = new SQLiteParameter();
mycommand.CommandText = "YOUR QUERY HERE";
mycommand.Parameters.Add(myparam);
foreach (CatItem CatItem in _knownFiless)
{
...
mycommand.ExecuteNonQuery();
}
}
mytransaction.Commit();
}
This part is most certainly your problem.
foreach (CatItem CatItem in _knownFiless)
{
....
sqLiteCommand.ExecuteNonQuery();
}
You are looping a List(?) and executing a query against the database. That is not a good way to do it. Because database calls are quite expensive. So you might consider using another way of updating these items.
The SQL code appears to be okay. The C# code is not wrong, but it has some redundancy (explicit close/dispose is not needed since you're using a using already).
There is a for loop on _knownFiless (intended with double s?), could that run slowly possibly? It is unusual to run a query in a for loop against the DB, rather you should create a query with the respective set of parameters. Consider that (especially without an index on the hash) you will perform n * m operations (n being the run count of the for loop, m being the table size).
Considering that m is around 500k, and assuming that m = n you will get 250,000,000,000 operations. That may well last an hour.
Former connections or operations should have no effect as far as I know.
You should also ensure that the internal structure of the database is not causing problems. Is there a compound index that is affected from this operation? Any foreign keys / complex contraints?

How batch size affects bulk insert performance?

I am doing bulk insert in syabse database by grouping insert query and sending it to database in batch where size of batch is configurable, the code looks somewhat like this
public static void InsertModelValueInBulk(DataSet modelValueData, int clsaId)
{
int batchSize = Convert.ToInt32(ConfigurationManager.AppSettings["BatchSize"].ToString());
IList<string> queryBuffer = new List<string>();
using (var connection = GetAseConnection())
{
connection.Open();
var tran = connection.BeginTransaction();
try
{
for (int i = 0; i < modelValueData.Tables[0].Rows.Count; i++)
{
var insertItem = string.Format(#"select '{0}',{1},{2},{3},'{4}','{5}','{6}',{7}", row["ModelValueID"], Convert.ToInt32(row["StockModelID"]), Convert.ToInt32(row["ModelItemID"]),
fyeStr, row["Period"], value, row["UpdatedUser"], clsaId);
queryBuffer.Add(insertItem);
if (queryBuffer.Count % (batchSize) == 0 && queryBuffer.Count > 0)
{
var finalQuery = #"INSERT INTO InsertTable (ModelValueID, StockModelID, ModelItemID, FYE, Period, Value, UpdatedUser,id)
" + String.Join(" union ", queryBuffer.ToArray<string>());
using (var cmd = new AseCommand(finalQuery, connection, tran))
{
cmd.ExecuteNonQuery();
}
queryBuffer.Clear();
}
}
tran.Commit();
}
catch
{
tran.Rollback();
throw;
}
finally
{
tran.Dispose();
}
}
}
using this the performance observed for batch size vs time taken to insert 20000 forms a J curve, sample data is somewhat like
batch size 10 => Operation completes in 30 sec, when batch size is 50 => 20 sec, 100=>10 sec, 200=>20 sec, 500 30 sec, 1000=>1 min.
Would like to understand what is reason behind this J curve. Is it something to do with app server memory or some database server setting or its something else? What makes 100 optimum and can this be tweaked further?
BULK insert locks the table for the duration of the batch size. Locks have a basic overhead, so small batches won't benefit nearly as much, but do let other operations happen against the table in-between batches.
So larger batches are good, to a point. Because it's a transaction, the data is not committed until the current batch is complete. This means writing to the log file. Really large batches will cause the log to grow, which is IO intensive, it also increases contention as more of your log will be in use.
Something along those lines.
edit: Two other things
1) Use parameterized inputs
2) If you don't do #1, "union" causes a distinct. Use "union all"
I see quite a feww Issues with you existing code.. for example.. on your Commit I would not assume that Commits would always be successful..
I would wrap all code that could have the potential to fail or explode around a try catch Commits, Rollbacks cmd.Execute
I would look at my Select statement and personally I would create a stored procedure and if you can't do that I would make the select string a const.
I would name my transactions personally.. but that's up to you
does this line have the potential of changing during every method call..
int batchSize = Convert.ToInt32(ConfigurationManager.AppSettings["BatchSize"].ToString());
if not I would make it a static call and not call it everytime you go into the method
try to refactor your code .. it's starting to look a bit confusing to follow..

Improve large data import performance into SQLite with C#

I am using C# to import a CSV with 6-8million rows.
My table looks like this:
CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)
CREATE INDEX IDLookup ON Data(ID ASC)
I am using System.Data.SQLite to do the import.
Currently to do 6 millions rows its taking 2min 55 secs on a Windows 7 32bit, Core2Duo 2.8Ghz & 4GB RAM. That's not too bad but I was just wondering if anyone could see a way of importing it quicker.
Here is my code:
public class Data
{
public string IDData { get; set; }
public string RawData { get; set; }
}
string connectionString = #"Data Source=" + Path.GetFullPath(AppDomain.CurrentDomain.BaseDirectory + "\\dbimport");
System.Data.SQLite.SQLiteConnection conn = new System.Data.SQLite.SQLiteConnection(connectionString);
conn.Open();
//Dropping and recreating the table seems to be the quickest way to get old data removed
System.Data.SQLite.SQLiteCommand command = new System.Data.SQLite.SQLiteCommand(conn);
command.CommandText = "DROP TABLE Data";
command.ExecuteNonQuery();
command.CommandText = #"CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)";
command.ExecuteNonQuery();
command.CommandText = "CREATE INDEX IDLookup ON Data(ID ASC)";
command.ExecuteNonQuery();
string insertText = "INSERT INTO Data (ID,RAW) VALUES(#P0,#P1)";
SQLiteTransaction trans = conn.BeginTransaction();
command.Transaction = trans;
command.CommandText = insertText;
Stopwatch sw = new Stopwatch();
sw.Start();
using (CsvReader csv = new CsvReader(new StreamReader(#"C:\Data.txt"), false))
{
var f = csv.Select(x => new Data() { IDData = x[27], RawData = String.Join(",", x.Take(24)) });
foreach (var item in f)
{
command.Parameters.AddWithValue("#P0", item.IDData);
command.Parameters.AddWithValue("#P1", item.RawData);
command.ExecuteNonQuery();
}
}
trans.Commit();
sw.Stop();
Debug.WriteLine(sw.Elapsed.Minutes + "Min(s) " + sw.Elapsed.Seconds + "Sec(s)");
conn.Close();
This is quite fast for 6 million records.
It seems that you are doing it the right way, some time ago I've read on sqlite.org that when inserting records you need to put these inserts inside transaction, if you don't do this your inserts will be limited to only 60 per second! That is because each insert will be treated as a separate transaction and each transaction must wait for the disk to rotate fully. You can read full explanation here:
http://www.sqlite.org/faq.html#q19
Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.
Comparing your time vs Average stated above: 50,000 per second => that should take 2m 00 sec. Which is only little faster than your time.
Transaction speed is limited by disk drive speed because (by default) SQLite actually waits until the data really is safely stored on the disk surface before the transaction is complete. That way, if you suddenly lose power or if your OS crashes, your data is still safe. For details, read about atomic commit in SQLite..
By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
There is some hint in next paragraph that you could try to speed up the inserts:
Another option is to run PRAGMA synchronous=OFF. This command will cause SQLite to not wait on data to reach the disk surface, which will make write operations appear to be much faster. But if you lose power in the middle of a transaction, your database file might go corrupt.
I always thought that SQLite was designed for "simple things", 6 millions of records seems to me is a job for some real database server like MySQL.
Counting records in a table in SQLite with so many records can take long time, just for your information, instead of using SELECT COUNT(*), you can always use SELECT MAX(rowid) which is very fast, but is not so accurate if you were deleting records in that table.
EDIT.
As Mike Woodhouse stated, creating the index after you inserted the records should speed up the whole thing, that is a common advice in other databases, but can't say for sure how it works in SQLite.
One thing you might try is to create the index after the data has been inserted - typically it's much faster for databases to build indexes in a single operation than to update it after each insert (or transaction).
I can't say that it'll definitely work with SQLite, but since it only needs two lines to move it's worth trying.
I'm also wondering if a 6 million row transaction might be going too far - could you change the code to try different transaction sizes? Say 100, 1000, 10000, 100000? Is there a "sweet spot"?
You can gain quite some time when you bind your parameters in the following way:
...
string insertText = "INSERT INTO Data (ID,RAW) VALUES( ? , ? )"; // (1)
SQLiteTransaction trans = conn.BeginTransaction();
command.Transaction = trans;
command.CommandText = insertText;
//(2)------
SQLiteParameter p0 = new SQLiteParameter();
SQLiteParameter p1 = new SQLiteParameter();
command.Parameters.Add(p0);
command.Parameters.Add(p1);
//---------
Stopwatch sw = new Stopwatch();
sw.Start();
using (CsvReader csv = new CsvReader(new StreamReader(#"C:\Data.txt"), false))
{
var f = csv.Select(x => new Data() { IDData = x[27], RawData = String.Join(",", x.Take(24)) });
foreach (var item in f)
{
//(3)--------
p0.Value = item.IDData;
p1.Value = item.RawData;
//-----------
command.ExecuteNonQuery();
}
}
trans.Commit();
...
Make the changes in sections 1, 2 and 3.
In this way parameter binding seems to be quite a bit faster.
Especially when you have a lot of parameters, this method can save quite some time.
I did a similar import, but I let my c# code just write the data to a csv first and then ran the sqlite import utility. I was able to import over 300million records in a matter of maybe 10 minutes this way.
Not sure if this can be done directly from c# or not though.

Categories