I have an Excel document that has about 250000 rows which takes forever to import. I have done many variations of this import, however there are a few requirements:
- Need to validate the data in each cell
- Must check if a duplicate exists in the database
- If a duplicate exists, update the entry
- If no entry exists, insert a new one
I have used parallelization as much as possible however I am sure that there must be some way to get this import to run much faster. Any assistance or ideas would be greatly appreciated.
Note that the database is on a LAN, and yes I know I haven't used parameterized sql commands (yet).
public string BulkUserInsertAndUpdate()
{
DateTime startTime = DateTime.Now;
try
{
ProcessInParallel();
Debug.WriteLine("Time taken: " + (DateTime.Now - startTime));
}
catch (Exception ex)
{
return ex.Message;
}
return "";
}
private IEnumerable<Row> ReadDocument()
{
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(_fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
Sheet ss = workbookPart.Workbook.Descendants<Sheet>().SingleOrDefault(s => s.Name == "User");
if (ss == null)
throw new Exception("There was a problem trying to import the file. Please insure that the Sheet's name is: User");
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(ss.Id);
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
StringTablePart = workbookPart.SharedStringTablePart;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
do
{
if (reader.HasAttributes)
{
var rowNum = int.Parse(reader.Attributes.First(a => a.LocalName == "r").Value);
if (rowNum == 1)
continue;
var row = (Row)reader.LoadCurrentElement();
yield return row;
}
} while (reader.ReadNextSibling()); // Skip to the next row
break; // We just looped through all the rows so no need to continue reading the worksheet
}
}
}
}
private void ProcessInParallel()
{
// Use ConcurrentQueue to enable safe enqueueing from multiple threads.
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(ReadDocument(), (row, loopState) =>
{
List<Cell> cells = row.Descendants<Cell>().ToList();
if (string.IsNullOrEmpty(GetCellValue(cells[0], StringTablePart)))
return;
// validation code goes here....
try
{
using (SqlConnection connection = new SqlConnection("user id=sa;password=D3vAdm!n#;server=196.30.181.143;database=TheUnlimitedUSSD;MultipleActiveResultSets=True"))
{
connection.Open();
SqlCommand command = new SqlCommand("SELECT count(*) FROM dbo.[User] WHERE MobileNumber = '" + mobileNumber + "'", connection);
var userCount = (int) command.ExecuteScalar();
if (userCount > 0)
{
// update
command = new SqlCommand("UPDATE [user] SET NewMenu = " + (newMenuIndicator ? "1" : "0") + ", PolicyNumber = '" + policyNumber + "', Status = '" + status + "' WHERE MobileNumber = '" + mobileNumber + "'", connection);
command.ExecuteScalar();
Debug.WriteLine("Update cmd");
}
else
{
// insert
command = new SqlCommand("INSERT INTO dbo.[User] ( MobileNumber , Status , PolicyNumber , NewMenu ) VALUES ( '" + mobileNumber + "' , '" + status + "' , '" + policyNumber + "' , " + (newMenuIndicator ? "1" : "0") + " )", connection);
command.ExecuteScalar();
Debug.WriteLine("Insert cmd");
}
}
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
Debug.WriteLine(ex.Message);
loopState.Break();
}
});
// Throw the exceptions here after the loop completes.
if (exceptions.Count > 0)
throw new AggregateException(exceptions);
}
I would have suggested that you do a bulk import WITHOUT any validation to an intermediary table, and only then do all the validation via SQL. Your spreadsheet's data will now be in a similiar structure as a SQL table.
This is what I have done with industrial strenght imports of 3 million rows + from Excel and CSV with great success.
Mostly I'd suggest you check that your parallelism is optimal. Since your bottlenecks are likely to be disk IO on the Excel file and IO to the Sql server, I'd suggest that it may not be. You've parallelised across those two processes (so each of them is reduced to the speed of the slowest); your parallel threads will be fighting over the database and potentially slowing eachother down. There's no point having (say) eight threads if your hard disk can't keep up with one - it just creates overhead.
Two things I'd suggest. First: take out all the parallelism and see if it's actually helping. If you single-threadedly parse the whole file into a single Queue in memory, then run the whole thing into the database, you might find it's faster.
Then, I'd try splitting it to just two threads: one to process the incoming file to the Queue, and one to take the items from the Queue and push them into the database. This way you have one thread per slow resource that you're handling - so you minimise contention - and each thread is blocked by only one resource - so you're handling that resource as optimally as possible.
This is the real trick of multithreaded programming. Throwing extra threads at a problem doesn't necessarily improve performance. What you're trying to do is minimise the time that your program is waiting idly for something external (such as disk or network IO) to complete. If one thread only waits on the Excel file, and one thread only waits on the SQL server, and what they do in between is minimal (which, in your case, it is), you'll find your code will run as fast as those external resources will allow it to.
Also, you mention it yourself, but using parameterised Sql isn't just a cool thing to point out: it will increase your performance. At the moment, you're creating a new SqlCommand for every insert, which has overhead. If you switch to a parameterised command, you can keep the same command throughout and just change the parameter values, which will save you some time. I don't think this is possible in a parallel ForEach (I doubt you can reuse the SqlCommand across threads), but it'd work fine with either of the approaches above.
Some tips for enhanced processing (as I believe this is what you need, not really a code fix).
Have Excel check for duplicate rows beforehand. It's a really decent tool for weeding out the obsolete tools. If A and B were duplicate, you'd create A then update with B's data. This way, you can weed out A and only create B.
Don't process it as an .xls(x) file, convert it to a CSV. (if you haven't already).
Create some stored procedures on your database. I generally dislike stored procedures when used in projects for simple data retrieval, but it works wonders for automated scripts that need to run efficiently. Just add a Create function (I assume the update function will be unnecessary after you've weeded out the duplicates (in tip 1)).+
Some tips I'm not sure will help your specific situation:
Use LINQ instead of creating command strings. LINQ automatically fine-tunes your queries. However, suddenly switching to LINQ is not something you can do at the blink of an eye, so you'll need to outweigh effort against how much you need it.
I know you said there is not Excel on the database server, but you can have the database process .csv files instead, there is no need for installed software for csv files. You can look into the following: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Related
my Code is working, the function gives me the correct Select count (*) value but anyway, it throws an ORA-25191 Exception - Cannot reference overflow table of an index-organized table tips,
at retVal = Convert.ToInt32(cmd.ExecuteScalar());
Since I use the function very often, the exceptions slow down my program tremendously.
private int getSelectCountQueryOracle(string Sqlquery)
{
try
{
int retVal = 0;
using (DataTable dataCount = new DataTable())
{
using (OracleCommand cmd = new OracleCommand(Sqlquery))
{
cmd.CommandType = CommandType.Text;
cmd.Connection = oraCon;
using (OracleDataAdapter dataAdapter = new OracleDataAdapter())
{
retVal = Convert.ToInt32(cmd.ExecuteScalar());
}
}
}
return retVal;
}
catch (Exception ex)
{
exceptionProtocol("Count Function", ex.ToString());
return 1;
}
}
This function is called in a foreach loop
// function call in foreach loop which goes through tablenames
foreach (DataRow row in dataTbl.Rows)
{...
tableNameFromRow = row["TABLE_NAME"].ToString();
tableRows=getSelectCountQueryOracle("select count(*) as 'count' from " +tableNameFromRow);
tableColumns = getSelectCountQueryOracle("SELECT COUNT(*) as 'count' FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name='" + tableNameFromRow + "'");
...}
dataTbl.rows in this outer loop, in turn, comes from the query
SELECT * FROM USER_TABLES ORDER BY TABLE_NAME
If you're using a database-agnostic API like ADO.Net, you would almost always want to use the API's framework to fetch metadata rather than writing custom queries against each database's metadata tables. The various ADO.Net providers are much more likely to write data dictionary queries that handle all the various corner cases and are much more likely to be optimized than the queries you're likely to write. So rather than writing your own query to populate the dataTbl data table, you'd want to use the GetSchema method
DataTable dataTbl = connection.GetSchema("Tables");
If you want to keep your custom-coded data dictionary query for some reason, you'd need to filter out the IOT overflow tables since you can't query those directly.
select *
from user_tables
where iot_type IS NULL
or iot_type != 'IOT_OVERFLOW'
Be aware, however, that there are likely to be other tables that you don't want to try to get a count from. For example, the dropped column indicates whether a table has been dropped-- presumably, you don't want to count the number of rows in an object in the recycle bin. So you'd want a dropped = 'NO' predicate as well. And you can't do a count(*) on a nested table so you'd want to have a nested = 'NO' predicate as well if your schema happens to contain nested tables. There are probably other corner cases depending on the exact set of features your particular schema makes use of that the developers of the provider have added code for that you'd have to deal with.
So I'd start with
select *
from user_tables
where ( iot_type IS NULL
or iot_type != 'IOT_OVERFLOW')
and dropped = 'NO'
and nested = 'NO'
but know that you'll probably need/ want to add some additional filters depending on the specific features users make use of. I'd certainly much rather let the fine folks that develop the ADO.Net provider worry about all those corner cases than to deal with finding all of them myself.
Taking a step back, though, I'd question why you're regularly doing a count(*) on every table in a schema and why you need an exact answer. In most cases where you're doing counts, you're either doing a one-off where you don't much care how long it takes (i.e. a validation step after a migration) or approximate counts would be sufficient (i.e. getting a list of the biggest tables in the system in order to triage some effort or to track growth over time for projections) in which case you could just use the counts that are already stored in the data dictionary- user_tables.num_rows- from the last time that statistics were run.
This article helped me to solve my problem.
I've changed my query to this:
SELECT * FROM user_tables
WHERE iot_type IS NULL OR iot_type != 'IOT_OVERFLOW'
ORDER BY TABLE_NAME
Maybe this is not the best source I've ever written but it is for a simple form that has the goal to write data remotely.
I've two MySQLConnections both for a local database. localconnection is used to read the DB and updateconnection edit every single row. The problem is that when i'm trying to update the Database the program raise a timeout and it crash.
I think the problem is generated by the while loop.
My intention is to read a single row, post it on the server and update it if the server returns the status equals 200.
Here's the code, it fails on updateConnection.ExcecuteNonQuery();
// Local Database here.
localCommand.Parameters.Clear();
// 0 - Grab unsent emails
string receivedMessages = "SELECT * FROM EMAIL WHERE HASSENT = 0";
// Update connection init START
string updateConnectionString = "Server=" + this.localServer + ";Database=" + this.localDatabase + ";Uid=" + this.localUser + ";Pwd=" + this.localpassword;
MySqlConnection updateConnection = new MySqlConnection(updateConnectionString);
updateConnection.Open();
MySqlTransaction transaction = updateConnection.BeginTransaction();
MySqlCommand updateCommand = new MySqlCommand();
// Update connection init END
localCommand.Connection = localConnection;
localCommand.Prepare();
try
{
localCommand.CommandText = receivedMessages;
MySqlDataReader reader = localCommand.ExecuteReader();
while (reader.Read()) // Local db read
{
String EID = reader.GetString(0);
String message = reader.GetString(3);
String fromEmail = reader.GetString(6);
String toEmail= reader.GetString(12);
// 1 - Post Request via HttpWebRequest
var receivedResponse = JObject.Parse(toSend.setIsReceived(fromEmail, message, toEmail));
// 2 - Read the JSON response from the server
if ((int)receivedResponse["status"] == 200)
{
string updateInbox = "UPDATE EMAIL SET HASSENT = 1 WHERE EMAILID = #EID";
MySqlParameter EMAILID = new MySqlParameter("#EID", MySqlDbType.String);
EMAILID.Value = EID; // We use the same fetched above
updateCommand.Connection = updateConnection;
updateCommand.Parameters.Add(IID_value);
updateCommand.Prepare();
updateCommand.CommandText = updateInbox;
updateCommand.ExecuteNonQuery();
}
else
{
// Notice the error....
}
}
}
catch (MySqlException ex)
{
transaction.Rollback();
// Notice...
}
finally
{
updateConnection.Close();
}
It is hard to tell exactly what's wrong here without doing some experiments.
There are two possibilities, though.
First, your program appears to be running on a web server, which necessarily constrains it to run for a limited amount of time. But, you loop through a possibly large result set, and do stuff of an uncontrollable duration for each item in that result set.
Second, you read a result set row by row from the MySQL server, and with a different connection try to update the tables behind that result set. This may cause a deadlock, in which the MySQL server blocks one of your update queries until the select query completes, thus preventing the completion of the select query.
How to cure this? First of all, try to handle a fixed and small number of rows in each invocation of this code. Change your select query to
SELECT * FROM EMAIL WHERE HASSENT = 0 LIMIT 10
and you'll handle ten records each time through.
Second, read in the whole result set from the select query, into a data structure, then loop over the items. In other words, don't nest the updates in the select.
Third, reduce the amount of data you handle by changing SELECT * to SELECT field, field, field.
I'm really struggling to iron out this issue. When I use the following code to update my database for large numbers of records it runs extremely slow. I've got 500,000 records to update which takes nearly an hour. During this operation, the journal file grows slowly with little change on the main SQLite db3 file - is this normal?
The operation only seems to be a problem when I have large numbers or records to update - it runs virtually instantly on smaller numbers of records.
Some other operations are performed on the database prior to this code running so could they be some how tying up the database? I've tried to ensure that all other connections are closed properly.
Thanks for any suggestions
using (SQLiteConnection sqLiteConnection = new SQLiteConnection("Data Source=" + _case.DatabasePath))
{
sqLiteConnection.Open();
using (SQLiteCommand sqLiteCommand = new SQLiteCommand("begin", sqLiteConnection))
{
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.CommandText = "UPDATE CaseFiles SET areaPk = #areaPk, KnownareaPk = #knownareaPk WHERE mhash = #mhash";
var pcatpk = sqLiteCommand.CreateParameter();
var pknowncatpk = sqLiteCommand.CreateParameter();
var pmhash = sqLiteCommand.CreateParameter();
pcatpk.ParameterName = "#areaPk";
pknowncatpk.ParameterName = "#knownareaPk";
pmhash.ParameterName = "#mhash";
sqLiteCommand.Parameters.Add(pcatpk);
sqLiteCommand.Parameters.Add(pknowncatpk);
sqLiteCommand.Parameters.Add(pmhash);
foreach (CatItem CatItem in _knownFiless)
{
if (CatItem.FromMasterHashes == true)
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = CatItem.areaPk;
pmhash.Value = CatItem.mhash;
}
else
{
pcatpk.Value = CatItem.areaPk;
pknowncatpk.Value = null;
pmhash.Value = CatItem.mhash;
}
sqLiteCommand.ExecuteNonQuery();
}
sqLiteCommand.CommandText = "end";
sqLiteCommand.ExecuteNonQuery();
sqLiteCommand.Dispose();
sqLiteConnection.Close();
}
sqLiteConnection.Close();
}
The first thing to ensure that you have an index on mhash.
Group commands into batches.
Use more than one thread.
Or [inserted]
Bulk import the records to a temporary table. Create an index on the mhash column. Perform a single update statement to update the records.
You need to wrap everything inside a transaction otherwise I believe SQLite will create and commit one for you for every update ... hence the slowness. You clearly know that looking at your code but I am not sure using "Begin" and "End" commands achieve the same result here, you might end up with empty transaction at start and finish instead of one wrapping everything. Try something like this instead just in case:
using (SQLiteTransaction mytransaction = myconnection.BeginTransaction())
{
using (SQLiteCommand mycommand = new SQLiteCommand(myconnection))
{
SQLiteParameter myparam = new SQLiteParameter();
mycommand.CommandText = "YOUR QUERY HERE";
mycommand.Parameters.Add(myparam);
foreach (CatItem CatItem in _knownFiless)
{
...
mycommand.ExecuteNonQuery();
}
}
mytransaction.Commit();
}
This part is most certainly your problem.
foreach (CatItem CatItem in _knownFiless)
{
....
sqLiteCommand.ExecuteNonQuery();
}
You are looping a List(?) and executing a query against the database. That is not a good way to do it. Because database calls are quite expensive. So you might consider using another way of updating these items.
The SQL code appears to be okay. The C# code is not wrong, but it has some redundancy (explicit close/dispose is not needed since you're using a using already).
There is a for loop on _knownFiless (intended with double s?), could that run slowly possibly? It is unusual to run a query in a for loop against the DB, rather you should create a query with the respective set of parameters. Consider that (especially without an index on the hash) you will perform n * m operations (n being the run count of the for loop, m being the table size).
Considering that m is around 500k, and assuming that m = n you will get 250,000,000,000 operations. That may well last an hour.
Former connections or operations should have no effect as far as I know.
You should also ensure that the internal structure of the database is not causing problems. Is there a compound index that is affected from this operation? Any foreign keys / complex contraints?
Is it possible to extract the code section inside my code and have it run in multiple threads?
The app copies over data from a FoxPro db to our SQL server over a network (the files are quite huge so the bulk copy needs to happen in increments...
It works, but I'd like to bump up the speed a bit.
1) By either having the section I marked run in multiple threads, OR as an alternative,
2) not loop through each column in the datarow,
I went for the second option... (Updated code below)
CODE
private void BulkCopy(OleDbDataReader reader, string tableName, Table table)
{
if (Convert.ToBoolean(ConfigurationManager.AppSettings["CopyData"]))
{
Console.WriteLine(tableName + " BulkCopy Started.");
try
{
DataTable tbl = new DataTable();
foreach (Column col in table.Columns)
{
tbl.Columns.Add(col.Name, ConvertDataTypeToType(col.DataType));
}
int batch = 1;
int counter = 0;
DataRow tblRow = tbl.NewRow();
while (reader.Read())
{
counter++;
////This section changed
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
////**********
tbl.LoadDataRow(tblRow.ItemArray, true);
if (counter == BulkInsertIncrement)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
counter = PerformInsert(tableName, tbl, batch);
batch++;
}
}
if (counter > 0)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
PerformInsert(tableName, tbl, counter);
}
tbl = null;
Console.WriteLine("BulkCopy Success!");
}
catch (Exception)
{
Console.WriteLine("BulkCopy Fail!");
}
finally
{
reader.Close();
reader.Dispose();
}
Console.WriteLine(tableName + " BulkCopy Ended.");
}
}
UPDATE
I went for the second option
I wasn't aware that while inside the while(reader.Read()) loop that i could do the following. I't helped to greatly increase the apps performance
while (reader.Read())
{
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
tbl.LoadDataRow(tblRow.ItemArray, true);
}
Thre is no need to multithread if you take out the beginner mistakes you do. TONS of slow code everywhere.
tblRow[col.Name] = reader[col.Name];
SLOW. NEVER use name - get the index outside the loop, then use the indices. This line has 2 (!) dictionary lookups for eery row, taking more time than the row procesing.
DataTables / DataSet is dead slow to start with (bad trechnological choice) but code like that really slowy dou down. Use a profiler to see the other bad elements.
This may not be the answer you're after, but have you tried running the console application in release mode first, with just one try statement, and using indexes on the reader? It's probably not going to increase the speed a great deal by making it multi-threaded as SQL Server will be the main bottleneck.
Of course if you don't care too much about data integrity (for example your IDs aren't sequential) you could change the table locking type for inserts and spin up 3-4 threads to read from certain points in the table.
I don't think that your usecase will significantly benefit from a parallel for each. Also it would be pretty hard to implement cause of the OleDbReader that is used in you code.
But what you could do is Schedule the inserts on a new thread that your loop will not block for the time the SQL Server needs to insert your data.
You can use the Task.Factory.StartNew() method for this. But this will make error handling a bit more complex, in terms that when the insert fails you might have processed more data or in the worst case there is already another thread waiting with new inserts for the database.
If you're using .NET 4, you could try using the TPL , and convert the foreach loop into something like
Parallel.ForEach(table.Columns, col => {/*rest of function here */}
I am using C# to import a CSV with 6-8million rows.
My table looks like this:
CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)
CREATE INDEX IDLookup ON Data(ID ASC)
I am using System.Data.SQLite to do the import.
Currently to do 6 millions rows its taking 2min 55 secs on a Windows 7 32bit, Core2Duo 2.8Ghz & 4GB RAM. That's not too bad but I was just wondering if anyone could see a way of importing it quicker.
Here is my code:
public class Data
{
public string IDData { get; set; }
public string RawData { get; set; }
}
string connectionString = #"Data Source=" + Path.GetFullPath(AppDomain.CurrentDomain.BaseDirectory + "\\dbimport");
System.Data.SQLite.SQLiteConnection conn = new System.Data.SQLite.SQLiteConnection(connectionString);
conn.Open();
//Dropping and recreating the table seems to be the quickest way to get old data removed
System.Data.SQLite.SQLiteCommand command = new System.Data.SQLite.SQLiteCommand(conn);
command.CommandText = "DROP TABLE Data";
command.ExecuteNonQuery();
command.CommandText = #"CREATE TABLE [Data] ([ID] VARCHAR(100) NULL,[Raw] VARCHAR(200) NULL)";
command.ExecuteNonQuery();
command.CommandText = "CREATE INDEX IDLookup ON Data(ID ASC)";
command.ExecuteNonQuery();
string insertText = "INSERT INTO Data (ID,RAW) VALUES(#P0,#P1)";
SQLiteTransaction trans = conn.BeginTransaction();
command.Transaction = trans;
command.CommandText = insertText;
Stopwatch sw = new Stopwatch();
sw.Start();
using (CsvReader csv = new CsvReader(new StreamReader(#"C:\Data.txt"), false))
{
var f = csv.Select(x => new Data() { IDData = x[27], RawData = String.Join(",", x.Take(24)) });
foreach (var item in f)
{
command.Parameters.AddWithValue("#P0", item.IDData);
command.Parameters.AddWithValue("#P1", item.RawData);
command.ExecuteNonQuery();
}
}
trans.Commit();
sw.Stop();
Debug.WriteLine(sw.Elapsed.Minutes + "Min(s) " + sw.Elapsed.Seconds + "Sec(s)");
conn.Close();
This is quite fast for 6 million records.
It seems that you are doing it the right way, some time ago I've read on sqlite.org that when inserting records you need to put these inserts inside transaction, if you don't do this your inserts will be limited to only 60 per second! That is because each insert will be treated as a separate transaction and each transaction must wait for the disk to rotate fully. You can read full explanation here:
http://www.sqlite.org/faq.html#q19
Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.
Comparing your time vs Average stated above: 50,000 per second => that should take 2m 00 sec. Which is only little faster than your time.
Transaction speed is limited by disk drive speed because (by default) SQLite actually waits until the data really is safely stored on the disk surface before the transaction is complete. That way, if you suddenly lose power or if your OS crashes, your data is still safe. For details, read about atomic commit in SQLite..
By default, each INSERT statement is its own transaction. But if you surround multiple INSERT statements with BEGIN...COMMIT then all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.
There is some hint in next paragraph that you could try to speed up the inserts:
Another option is to run PRAGMA synchronous=OFF. This command will cause SQLite to not wait on data to reach the disk surface, which will make write operations appear to be much faster. But if you lose power in the middle of a transaction, your database file might go corrupt.
I always thought that SQLite was designed for "simple things", 6 millions of records seems to me is a job for some real database server like MySQL.
Counting records in a table in SQLite with so many records can take long time, just for your information, instead of using SELECT COUNT(*), you can always use SELECT MAX(rowid) which is very fast, but is not so accurate if you were deleting records in that table.
EDIT.
As Mike Woodhouse stated, creating the index after you inserted the records should speed up the whole thing, that is a common advice in other databases, but can't say for sure how it works in SQLite.
One thing you might try is to create the index after the data has been inserted - typically it's much faster for databases to build indexes in a single operation than to update it after each insert (or transaction).
I can't say that it'll definitely work with SQLite, but since it only needs two lines to move it's worth trying.
I'm also wondering if a 6 million row transaction might be going too far - could you change the code to try different transaction sizes? Say 100, 1000, 10000, 100000? Is there a "sweet spot"?
You can gain quite some time when you bind your parameters in the following way:
...
string insertText = "INSERT INTO Data (ID,RAW) VALUES( ? , ? )"; // (1)
SQLiteTransaction trans = conn.BeginTransaction();
command.Transaction = trans;
command.CommandText = insertText;
//(2)------
SQLiteParameter p0 = new SQLiteParameter();
SQLiteParameter p1 = new SQLiteParameter();
command.Parameters.Add(p0);
command.Parameters.Add(p1);
//---------
Stopwatch sw = new Stopwatch();
sw.Start();
using (CsvReader csv = new CsvReader(new StreamReader(#"C:\Data.txt"), false))
{
var f = csv.Select(x => new Data() { IDData = x[27], RawData = String.Join(",", x.Take(24)) });
foreach (var item in f)
{
//(3)--------
p0.Value = item.IDData;
p1.Value = item.RawData;
//-----------
command.ExecuteNonQuery();
}
}
trans.Commit();
...
Make the changes in sections 1, 2 and 3.
In this way parameter binding seems to be quite a bit faster.
Especially when you have a lot of parameters, this method can save quite some time.
I did a similar import, but I let my c# code just write the data to a csv first and then ran the sqlite import utility. I was able to import over 300million records in a matter of maybe 10 minutes this way.
Not sure if this can be done directly from c# or not though.