Is it possible to extract the code section inside my code and have it run in multiple threads?
The app copies over data from a FoxPro db to our SQL server over a network (the files are quite huge so the bulk copy needs to happen in increments...
It works, but I'd like to bump up the speed a bit.
1) By either having the section I marked run in multiple threads, OR as an alternative,
2) not loop through each column in the datarow,
I went for the second option... (Updated code below)
CODE
private void BulkCopy(OleDbDataReader reader, string tableName, Table table)
{
if (Convert.ToBoolean(ConfigurationManager.AppSettings["CopyData"]))
{
Console.WriteLine(tableName + " BulkCopy Started.");
try
{
DataTable tbl = new DataTable();
foreach (Column col in table.Columns)
{
tbl.Columns.Add(col.Name, ConvertDataTypeToType(col.DataType));
}
int batch = 1;
int counter = 0;
DataRow tblRow = tbl.NewRow();
while (reader.Read())
{
counter++;
////This section changed
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
////**********
tbl.LoadDataRow(tblRow.ItemArray, true);
if (counter == BulkInsertIncrement)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
counter = PerformInsert(tableName, tbl, batch);
batch++;
}
}
if (counter > 0)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
PerformInsert(tableName, tbl, counter);
}
tbl = null;
Console.WriteLine("BulkCopy Success!");
}
catch (Exception)
{
Console.WriteLine("BulkCopy Fail!");
}
finally
{
reader.Close();
reader.Dispose();
}
Console.WriteLine(tableName + " BulkCopy Ended.");
}
}
UPDATE
I went for the second option
I wasn't aware that while inside the while(reader.Read()) loop that i could do the following. I't helped to greatly increase the apps performance
while (reader.Read())
{
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
tbl.LoadDataRow(tblRow.ItemArray, true);
}
Thre is no need to multithread if you take out the beginner mistakes you do. TONS of slow code everywhere.
tblRow[col.Name] = reader[col.Name];
SLOW. NEVER use name - get the index outside the loop, then use the indices. This line has 2 (!) dictionary lookups for eery row, taking more time than the row procesing.
DataTables / DataSet is dead slow to start with (bad trechnological choice) but code like that really slowy dou down. Use a profiler to see the other bad elements.
This may not be the answer you're after, but have you tried running the console application in release mode first, with just one try statement, and using indexes on the reader? It's probably not going to increase the speed a great deal by making it multi-threaded as SQL Server will be the main bottleneck.
Of course if you don't care too much about data integrity (for example your IDs aren't sequential) you could change the table locking type for inserts and spin up 3-4 threads to read from certain points in the table.
I don't think that your usecase will significantly benefit from a parallel for each. Also it would be pretty hard to implement cause of the OleDbReader that is used in you code.
But what you could do is Schedule the inserts on a new thread that your loop will not block for the time the SQL Server needs to insert your data.
You can use the Task.Factory.StartNew() method for this. But this will make error handling a bit more complex, in terms that when the insert fails you might have processed more data or in the worst case there is already another thread waiting with new inserts for the database.
If you're using .NET 4, you could try using the TPL , and convert the foreach loop into something like
Parallel.ForEach(table.Columns, col => {/*rest of function here */}
Related
ORIGINAL QUESTION:
I have some code which looks like this:
for (int i = start_i; i <= i_s; i++)
{
var json2 = JObject.Parse(RequestServer("query_2", new List<JToken>(){json1["result"]}));
foreach (var data_1 in json2["result"]["data_1"])
{
var json3 = JObject.Parse(RequestServer("query_3", new List<JToken>(){data_1, 1}));
foreach (var data_2 in json3["result"]["data_2"])
{
var data_1 = data_2["id"];
var index = data_2["other"];
}
foreach (var other in json3["result"]["other"])
{
var data_3_1 = other["data_3"]["data_3_1"];
var data_4 = other["data_4"];
var data_5 = other["data_5"];
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
insert_data((string)data_3_1); <- very slow
}
}
}
}
This code was able to generate about 5000 WriteLines in less than a minute. However, I now want to insert that data into a database. When I try to do that, the code now takes much much longer to get through the 5000 sets of data.
My question is, how do I batch the database inserts into about 1000 inserts at a time, instead of doing one at a time. I have tried creating the insert statement using a stringbuilder which is fine, what I can't figure out is how to generate 1000 at a time. I have tried using for loops upto 1000, and then trying to break out of the foreach loop, before starting with the next 1000, but it just makes a big mess.
I have looked at questions like this example, but they are no good for my loop scenario. I know how to do bulk inserts at the sql level, I just can't seem to figure out how to generate the bulk sql inserts using the unique loop situation I have above using the those very specific loops in the example code.
The 5000 records was just a test run. The end code will have to deal with millions, if not billions of inserts. Based on rough calculations, the end result will use about 500GB of drive space when inserted into a database, so I will need to batch an optimum amount into RAM before inserting into the database.
UPDATE 1:
This is what happens in insert_data:
public static string insert_data(string data_3_1)
{
string str_conn = #"server=localhost;port=3306;uid=username;password=password;database=database";
MySqlConnection conn = null;
conn = new MySqlConnection(str_conn);
conn.Open();
MySqlCommand cmd = new MySqlCommand();
cmd.Connection = conn;
cmd.CommandText = "INSERT INTO database_table (data_3_1) VALUES (#data_3_1)";
cmd.Prepare();
cmd.Parameters.AddWithValue("#data_3_1", data_3_1);
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
return null;
}
You're correct that doing bulk inserts in batches can be a big throughput win. Here's why it's a win: When you do INSERT operations one at a time, the database server does an implicit COMMIT operation after every insert, and that can be slow. So, if you can wrap every hundred or so INSERTs in a single transaction, you'll reduce that overhead.
Here's an outline of how to do that. I'll try to put it in the context of your code, but you didn't show your MySQLConnection object or query objects, so this solution of mine necessarily will be incomplete.
var batchSize = 100;
var batchCounter = batchSize;
var beginBatch = new MySqlCommand("START TRANSACTION;", conn);
var endBatch = new MySqlCommand("COMMIT;", conn);
beginBatch.ExecuteNonQuery();
for (int i = start_i; i <= i_s; i++)
{
....
foreach (var data_1 in json2["result"]["data_1"])
{
...
foreach (var other in json3["result"]["other"])
{
...
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
/****************** batch handling **********************/
if ( --batchCounter <= 0) {
/* commit one batch, start the next */
endBatch.ExecuteNonQuery();
beginBatch.ExecuteNonQuery();
batchCounter = batchSize;
}
insert_data((string)data_3_1); <- very slow
}
}
}
}
/* commit the last batch. It's OK if it contains no records */
endBatch.ExecuteNonQuery();
If you want, you can try different values of batchSize to find a good value. But generally something like the 100 I suggest works well.
Batch sizes of 1000 are also OK. But the larger each transaction gets, the more server RAM it uses before it's committed, and the longer it might block other programs using the same MySQL server.
There's a nice and popular extension called MoreLinq that offers an extension method called Batch(int batchSize). To get an IEnumerable containing up to 1000 elements:
foreach (var upTo1000 in other["data_3"]["data_3_1"].Batch(1000))
{
// Build a query using the (up to) 1000 elements in upTo1000
}
The best approach for me was using LOAD DATA LOCAL INFILE statement. To make it work first you have to turn ON MySQL server parameter local_infile.
I used mysql2 package for NodeJS and query function:
db.query({
sql: "LOAD DATA LOCAL INFILE .......",
infileStreamFactory: <readable stream which provides your data in flat file format>
}, function(err, results) {....});
The trick is to provide a readable stream properly. By default, LOAD DATA expects tab delimited text file. Also LOAD DATA expects some file name and in you case if you provide a stream then file name can be arbitrary string.
I wrote a program some time ago that delimits and reads in pretty big text files. The program works but the problem is it basically freezes the computer and takes long time to finish. On average each text file has around 10K to 15K lines, and each line represents a new row in a SQL table.
Way my program works is I first read all of the lines (this is where delimiting happens) and store them in array, after that I go through each array element and insert them into SQL table. This is all done at once and I suspect is eating up to much memory which is causing the program to freeze the computer.
Here is my code for reading file:
private void readFile()
{
//String that will hold each line read from the file
String line;
//Instantiate new stream reader
System.IO.StreamReader file = new System.IO.StreamReader(txtFilePath.Text);
try
{
while (!file.EndOfStream)
{
line = file.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
{
if (this.meetsCondition(line))
{
badLines++;
continue;
} // end if
else
{
collection.readIn(line);
counter++;
} // end else
} // end if
} // end while
file.Close();
} // end try
catch (Exception exceptionError)
{
//Placeholder
}
Code for inserting:
for (int i = 0; i < counter; i++)
{
//Iterates through the collection array starting at first index and going through until the end
//and inserting each element into our SQL Table
//if (!idS.Contains(collection.getIdItems(i)))
//{
da.InsertCommand.Parameters["#Id"].Value = collection.getIdItems(i);
da.InsertCommand.Parameters["#Date"].Value = collection.getDateItems(i);
da.InsertCommand.Parameters["#Time"].Value = collection.getTimeItems(i);
da.InsertCommand.Parameters["#Question"].Value = collection.getQuestionItems(i);
da.InsertCommand.Parameters["#Details"].Value = collection.getDetailsItems(i);
da.InsertCommand.Parameters["#Answer"].Value = collection.getAnswerItems(i);
da.InsertCommand.Parameters["#Notes"].Value = collection.getNotesItems(i);
da.InsertCommand.Parameters["#EnteredBy"].Value = collection.getEnteredByItems(i);
da.InsertCommand.Parameters["#WhereReceived"].Value = collection.getWhereItems(i);
da.InsertCommand.Parameters["#QuestionType"].Value = collection.getQuestionTypeItems(i);
da.InsertCommand.Parameters["#AnswerMethod"].Value = collection.getAnswerMethodItems(i);
da.InsertCommand.Parameters["#TransactionDuration"].Value = collection.getTransactionItems(i);
da.InsertCommand.ExecuteNonQuery();
//}
//Updates the progress bar using the i in addition to 1
_worker.ReportProgress(i + 1);
} // end for
If you can map your collection to a DataTable then you could use an SqlBulkCopy to import your data. SqlBulkCopy is the fastest way to import data from .Net into SqlServer.
Use SqlBulkCopy class for bulk inserts.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
You will cut down the time to mere seconds.
+1 for SqlBulkCopy as others have stated, but be aware that it requires INSERT permission. If you work in a strictly controlled environment, as I do, where you aren't allowed to use dynamic SQL an alternative approach is to have your stored proc use Table-Valued parameters. That way you can still pass in chunks of records and have the proc do the actual inserting.
As an example how to use the functionaloty of the SqlBulkCopy class, (It is just pseudocode to render the idea)
First change your collection class to host an internal DataTable, and in the constructor define the schema used by your readIn method
public class MyCollection
{
private DataTable loadedData = null;
public MyCollection()
{
loadedData = new DataTable();
loadedData.Columns.Add("Column1", typeof(string));
.... and so on for every field expected
}
// A property to return the collected data
public DataTable GetData
{
get{return loadedData;}
}
public void readIn(string line)
{
// split the line in fields
DataRow r = loadedData.NewRow();
r["Column1"] = splittedLine[0];
.... and so on
loadedData.Rows.Add(r);
}
}
Finally the code that upload the data to your server
using (SqlConnection connection = new SqlConnection(connectionString))
{
connection.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "destinationTable";
try
{
bulkCopy.WriteToServer(collection.GetData());
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
As mentioned, using SqlBulkCopy will be faster than inserting one-by-one, but there are other things that you could look at:
Is there a clustered index on the table? If so will you be inserting rows with values in the middle of that index? It's much more efficient to add values at the end of a clustered index since otherwise it will have to rearrange data to insert in in the middle (this is only for CLUSTERED indexes). On example I've seen us using SSN as a clustered primary key. Since SSNs will be distributed randomly, you are rearranging the physical structure on virtually every insert. Having a date as part of the clustered key may be OK if you are MOSTLY inserting data at the end (e.g. adding daily records)
Are there a lot of indexes on that table? it may be more efficient to drop the indexes, add the data, and re-add the indexes after the inserts. (or just drop indexes you don't need)
I have an Excel document that has about 250000 rows which takes forever to import. I have done many variations of this import, however there are a few requirements:
- Need to validate the data in each cell
- Must check if a duplicate exists in the database
- If a duplicate exists, update the entry
- If no entry exists, insert a new one
I have used parallelization as much as possible however I am sure that there must be some way to get this import to run much faster. Any assistance or ideas would be greatly appreciated.
Note that the database is on a LAN, and yes I know I haven't used parameterized sql commands (yet).
public string BulkUserInsertAndUpdate()
{
DateTime startTime = DateTime.Now;
try
{
ProcessInParallel();
Debug.WriteLine("Time taken: " + (DateTime.Now - startTime));
}
catch (Exception ex)
{
return ex.Message;
}
return "";
}
private IEnumerable<Row> ReadDocument()
{
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(_fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
Sheet ss = workbookPart.Workbook.Descendants<Sheet>().SingleOrDefault(s => s.Name == "User");
if (ss == null)
throw new Exception("There was a problem trying to import the file. Please insure that the Sheet's name is: User");
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(ss.Id);
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
StringTablePart = workbookPart.SharedStringTablePart;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
do
{
if (reader.HasAttributes)
{
var rowNum = int.Parse(reader.Attributes.First(a => a.LocalName == "r").Value);
if (rowNum == 1)
continue;
var row = (Row)reader.LoadCurrentElement();
yield return row;
}
} while (reader.ReadNextSibling()); // Skip to the next row
break; // We just looped through all the rows so no need to continue reading the worksheet
}
}
}
}
private void ProcessInParallel()
{
// Use ConcurrentQueue to enable safe enqueueing from multiple threads.
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(ReadDocument(), (row, loopState) =>
{
List<Cell> cells = row.Descendants<Cell>().ToList();
if (string.IsNullOrEmpty(GetCellValue(cells[0], StringTablePart)))
return;
// validation code goes here....
try
{
using (SqlConnection connection = new SqlConnection("user id=sa;password=D3vAdm!n#;server=196.30.181.143;database=TheUnlimitedUSSD;MultipleActiveResultSets=True"))
{
connection.Open();
SqlCommand command = new SqlCommand("SELECT count(*) FROM dbo.[User] WHERE MobileNumber = '" + mobileNumber + "'", connection);
var userCount = (int) command.ExecuteScalar();
if (userCount > 0)
{
// update
command = new SqlCommand("UPDATE [user] SET NewMenu = " + (newMenuIndicator ? "1" : "0") + ", PolicyNumber = '" + policyNumber + "', Status = '" + status + "' WHERE MobileNumber = '" + mobileNumber + "'", connection);
command.ExecuteScalar();
Debug.WriteLine("Update cmd");
}
else
{
// insert
command = new SqlCommand("INSERT INTO dbo.[User] ( MobileNumber , Status , PolicyNumber , NewMenu ) VALUES ( '" + mobileNumber + "' , '" + status + "' , '" + policyNumber + "' , " + (newMenuIndicator ? "1" : "0") + " )", connection);
command.ExecuteScalar();
Debug.WriteLine("Insert cmd");
}
}
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
Debug.WriteLine(ex.Message);
loopState.Break();
}
});
// Throw the exceptions here after the loop completes.
if (exceptions.Count > 0)
throw new AggregateException(exceptions);
}
I would have suggested that you do a bulk import WITHOUT any validation to an intermediary table, and only then do all the validation via SQL. Your spreadsheet's data will now be in a similiar structure as a SQL table.
This is what I have done with industrial strenght imports of 3 million rows + from Excel and CSV with great success.
Mostly I'd suggest you check that your parallelism is optimal. Since your bottlenecks are likely to be disk IO on the Excel file and IO to the Sql server, I'd suggest that it may not be. You've parallelised across those two processes (so each of them is reduced to the speed of the slowest); your parallel threads will be fighting over the database and potentially slowing eachother down. There's no point having (say) eight threads if your hard disk can't keep up with one - it just creates overhead.
Two things I'd suggest. First: take out all the parallelism and see if it's actually helping. If you single-threadedly parse the whole file into a single Queue in memory, then run the whole thing into the database, you might find it's faster.
Then, I'd try splitting it to just two threads: one to process the incoming file to the Queue, and one to take the items from the Queue and push them into the database. This way you have one thread per slow resource that you're handling - so you minimise contention - and each thread is blocked by only one resource - so you're handling that resource as optimally as possible.
This is the real trick of multithreaded programming. Throwing extra threads at a problem doesn't necessarily improve performance. What you're trying to do is minimise the time that your program is waiting idly for something external (such as disk or network IO) to complete. If one thread only waits on the Excel file, and one thread only waits on the SQL server, and what they do in between is minimal (which, in your case, it is), you'll find your code will run as fast as those external resources will allow it to.
Also, you mention it yourself, but using parameterised Sql isn't just a cool thing to point out: it will increase your performance. At the moment, you're creating a new SqlCommand for every insert, which has overhead. If you switch to a parameterised command, you can keep the same command throughout and just change the parameter values, which will save you some time. I don't think this is possible in a parallel ForEach (I doubt you can reuse the SqlCommand across threads), but it'd work fine with either of the approaches above.
Some tips for enhanced processing (as I believe this is what you need, not really a code fix).
Have Excel check for duplicate rows beforehand. It's a really decent tool for weeding out the obsolete tools. If A and B were duplicate, you'd create A then update with B's data. This way, you can weed out A and only create B.
Don't process it as an .xls(x) file, convert it to a CSV. (if you haven't already).
Create some stored procedures on your database. I generally dislike stored procedures when used in projects for simple data retrieval, but it works wonders for automated scripts that need to run efficiently. Just add a Create function (I assume the update function will be unnecessary after you've weeded out the duplicates (in tip 1)).+
Some tips I'm not sure will help your specific situation:
Use LINQ instead of creating command strings. LINQ automatically fine-tunes your queries. However, suddenly switching to LINQ is not something you can do at the blink of an eye, so you'll need to outweigh effort against how much you need it.
I know you said there is not Excel on the database server, but you can have the database process .csv files instead, there is no need for installed software for csv files. You can look into the following: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
I'm really struggling to iron out this issue. When I use the following code to update my database for large numbers of records it runs extremely slow. I've got 500,000 records to update which takes nearly an hour. During this operation, the journal file grows slowly with little change on the main SQLite db3 file - is this normal?
The operation only seems to be a problem when I have large numbers or records to update - it runs virtually instantly on smaller numbers of records.
Some other operations are performed on the database prior to this code running so could they be some how tying up the database? I've tried to ensure that all other connections are closed properly.
Thanks for any suggestions
using (SQLiteConnection sqLiteConnection = new SQLiteConnection("Data Source=" + _case.DatabasePath))
{
sqLiteConnection.Open();
using (SQLiteCommand sqLiteCommand = new SQLiteCommand("begin", sqLiteConnection))
{
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.CommandText = "UPDATE CaseFiles SET areaPk = #areaPk, KnownareaPk = #knownareaPk WHERE mhash = #mhash";
var pcatpk = sqLiteCommand.CreateParameter();
var pknowncatpk = sqLiteCommand.CreateParameter();
var pmhash = sqLiteCommand.CreateParameter();
pcatpk.ParameterName = "#areaPk";
pknowncatpk.ParameterName = "#knownareaPk";
pmhash.ParameterName = "#mhash";
sqLiteCommand.Parameters.Add(pcatpk);
sqLiteCommand.Parameters.Add(pknowncatpk);
sqLiteCommand.Parameters.Add(pmhash);
foreach (CatItem CatItem in _knownFiless)
{
if (CatItem.FromMasterHashes == true)
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = CatItem.areaPk;
pmhash.Value = CatItem.mhash;
}
else
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = null;
pmhash.Value = CatItem.mhash;
}
sqLiteCommand.ExecuteNonQuery();
}
sqLiteCommand.CommandText = "end";
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.Dispose();
sqLiteConnection.Close();
}
sqLiteConnection.Close();
}
The first thing to ensure that you have an index on mhash.
Group commands into batches.
Use more than one thread.
Or [inserted]
Bulk import the records to a temporary table. Create an index on the mhash column. Perform a single update statement to update the records.
You need to wrap everything inside a transaction otherwise I believe SQLite will create and commit one for you for every update ... hence the slowness. You clearly know that looking at your code but I am not sure using "Begin" and "End" commands achieve the same result here, you might end up with empty transaction at start and finish instead of one wrapping everything. Try something like this instead just in case:
using (SQLiteTransaction mytransaction = myconnection.BeginTransaction())
{
using (SQLiteCommand mycommand = new SQLiteCommand(myconnection))
{
SQLiteParameter myparam = new SQLiteParameter();
mycommand.CommandText = "YOUR QUERY HERE";
mycommand.Parameters.Add(myparam);
foreach (CatItem CatItem in _knownFiless)
{
...
mycommand.ExecuteNonQuery();
}
}
mytransaction.Commit();
}
This part is most certainly your problem.
foreach (CatItem CatItem in _knownFiless)
{
....
sqLiteCommand.ExecuteNonQuery();
}
You are looping a List(?) and executing a query against the database. That is not a good way to do it. Because database calls are quite expensive. So you might consider using another way of updating these items.
The SQL code appears to be okay. The C# code is not wrong, but it has some redundancy (explicit close/dispose is not needed since you're using a using already).
There is a for loop on _knownFiless (intended with double s?), could that run slowly possibly? It is unusual to run a query in a for loop against the DB, rather you should create a query with the respective set of parameters. Consider that (especially without an index on the hash) you will perform n * m operations (n being the run count of the for loop, m being the table size).
Considering that m is around 500k, and assuming that m = n you will get 250,000,000,000 operations. That may well last an hour.
Former connections or operations should have no effect as far as I know.
You should also ensure that the internal structure of the database is not causing problems. Is there a compound index that is affected from this operation? Any foreign keys / complex contraints?
I am doing bulk insert in syabse database by grouping insert query and sending it to database in batch where size of batch is configurable, the code looks somewhat like this
public static void InsertModelValueInBulk(DataSet modelValueData, int clsaId)
{
int batchSize = Convert.ToInt32(ConfigurationManager.AppSettings["BatchSize"].ToString());
IList<string> queryBuffer = new List<string>();
using (var connection = GetAseConnection())
{
connection.Open();
var tran = connection.BeginTransaction();
try
{
for (int i = 0; i < modelValueData.Tables[0].Rows.Count; i++)
{
var insertItem = string.Format(#"select '{0}',{1},{2},{3},'{4}','{5}','{6}',{7}", row["ModelValueID"], Convert.ToInt32(row["StockModelID"]), Convert.ToInt32(row["ModelItemID"]),
fyeStr, row["Period"], value, row["UpdatedUser"], clsaId);
queryBuffer.Add(insertItem);
if (queryBuffer.Count % (batchSize) == 0 && queryBuffer.Count > 0)
{
var finalQuery = #"INSERT INTO InsertTable (ModelValueID, StockModelID, ModelItemID, FYE, Period, Value, UpdatedUser,id)
" + String.Join(" union ", queryBuffer.ToArray<string>());
using (var cmd = new AseCommand(finalQuery, connection, tran))
{
cmd.ExecuteNonQuery();
}
queryBuffer.Clear();
}
}
tran.Commit();
}
catch
{
tran.Rollback();
throw;
}
finally
{
tran.Dispose();
}
}
}
using this the performance observed for batch size vs time taken to insert 20000 forms a J curve, sample data is somewhat like
batch size 10 => Operation completes in 30 sec, when batch size is 50 => 20 sec, 100=>10 sec, 200=>20 sec, 500 30 sec, 1000=>1 min.
Would like to understand what is reason behind this J curve. Is it something to do with app server memory or some database server setting or its something else? What makes 100 optimum and can this be tweaked further?
BULK insert locks the table for the duration of the batch size. Locks have a basic overhead, so small batches won't benefit nearly as much, but do let other operations happen against the table in-between batches.
So larger batches are good, to a point. Because it's a transaction, the data is not committed until the current batch is complete. This means writing to the log file. Really large batches will cause the log to grow, which is IO intensive, it also increases contention as more of your log will be in use.
Something along those lines.
edit: Two other things
1) Use parameterized inputs
2) If you don't do #1, "union" causes a distinct. Use "union all"
I see quite a feww Issues with you existing code.. for example.. on your Commit I would not assume that Commits would always be successful..
I would wrap all code that could have the potential to fail or explode around a try catch Commits, Rollbacks cmd.Execute
I would look at my Select statement and personally I would create a stored procedure and if you can't do that I would make the select string a const.
I would name my transactions personally.. but that's up to you
does this line have the potential of changing during every method call..
int batchSize = Convert.ToInt32(ConfigurationManager.AppSettings["BatchSize"].ToString());
if not I would make it a static call and not call it everytime you go into the method
try to refactor your code .. it's starting to look a bit confusing to follow..