I want to load large .DBF (Visual FoxPro) files into a DataTable.
For smaller files < 300MB it works fine with a fill command, and it runs pretty fast.
But for larger file I run out of memory and need to load them into smaller parts.
(Loading row 0...1000, then 1001..2000 and so on)
Based on some code found on the internet I made this operation, input start is the row to start reading from and max is the number of rows that I want to read.
The problem is that even if I just want to read 5 rows it takes around 30-60seconds on my machine due to the very slow execution of the Command.ExecuteReader.
public DataTable LoadTable2(string folder, string table, int start, int max)
{
string ConnectionString = "Provider=vfpoledb.1;Data Source="+folder+"\\"+table;
OleDbConnection Connection = new OleDbConnection(ConnectionString);
Connection.Open();
string dataString = String.Format("Select * from {0}", table);
OleDbCommand Command = new OleDbCommand(dataString, Connection);
//Takes very long time on large files.
OleDbDataReader Reader = Command.ExecuteReader(CommandBehavior.SequentialAccess);
DataSet ds = new DataSet();
var dt = ds.Tables.Add(table);
// Add the table columns.
for (int i = 0; i < Reader.FieldCount; i++)
{
dt.Columns.Add(Reader.GetName(i), Reader.GetFieldType(i));
}
int intIdx = 0;
int cnt = 0;
while (Reader.Read())
{
if (intIdx >= start)
{
DataRow r = dt.NewRow();
// Assign DataReader values to DataRow.
for (int i = 0; i < Reader.FieldCount; i++)
r[i] = Reader[i];
dt.Rows.Add(r);
cnt++;
}
if (cnt >= max)
{
break;
}
intIdx++;
}
Reader.Close();
Connection.Close();
return dt;
}
I have tested with both OLE and ODBC connection, no big difference.
Files are all on local disc.
Does anyone have a good idea for how to make this much faster?
Best regards
Anders
I believe that with that driver (VFPOLEDB), you can change your query to specify the record numbers of interest. That way it would not be necessary to read through a bunch of records to get to the starting point. It would then not be necessary to skip over any records; just read the entire requested result set. The query might look like this:
SELECT * from thetable where recno() >= 5000 and recno() <= 5500
I realized that I have this driver installed and just now tested it and it does work. However, I don't think it "optimizes" that statement. In theory, it could directly compute the record offsets using record numbers, but (based on simple observation of a query on a larger dbf), it seems to do a full table scan. However, with FoxPro, you could create an index on recno(), and then it would be optimized.
Related
I try to get a lot of data from a SQL Server database using C#. I get something like 300K rows of data from the database (I believe it's not far from the worst case that can be.) that can contain hundreds of millions.
I don't think the problem it's about the size of the database because
of the command.ExecuteReader(); takes about nothing and less than a second.
I tried this code:
public List<ResultPulser> GetReportResult(SqlConnection opCon, SqlCommand command,
int minReport,int maxReport,int machineNumber)
{
List<ResultPulser> results = new List<ResultPulser>();
using (DataContext dc = new DataContext(opCon))
{
try
{
command.CommandText = "select * from ResultPulser " +
"where CAST(SUBSTRING([ReportNumber], 0, 8) as int) = #machineNumber and " +
"CAST(SUBSTRING([ReportNumber],8,LEN([ReportNumber])) as int) BETWEEN #minReport AND #maxReport";
command.Parameters.Clear();
command.Parameters.AddWithValue("#minReport", minReport);
command.Parameters.AddWithValue("#maxReport", maxReport);
command.Parameters.AddWithValue("#machineNumber", machineNumber);
Stopwatch SW1 = Stopwatch.StartNew();
SqlDataReader reader = command.ExecuteReader();
SW1.Stop();
DataTable table = new DataTable();
Stopwatch SW2 = Stopwatch.StartNew();
table.Load(reader);
SW2.Stop();
Stopwatch SW3 = Stopwatch.StartNew();
ResultPulser[] report = new ResultPulser[table.Rows.Count];
for (int i = 0; i < table.Rows.Count; i++)
{
DataRow dr = table.Rows[i];
report[i] = new ResultPulser(Convert.ToInt64(dr[0]), dr[1].ToString().Trim(),
dr[2].ToString().Trim(), Convert.ToDateTime(dr[3]), Convert.ToDouble(dr[4]),
Convert.ToDouble(dr[5]), Convert.ToDouble(dr[6]), Convert.ToDouble(dr[7]),
Convert.ToDouble(dr[8]), Convert.ToDouble(dr[9]), Convert.ToInt64(dr[10]),
Convert.ToInt64(dr[11]), Convert.ToInt64(dr[12]), Convert.ToBoolean(dr[13]),
Convert.ToInt32(dr[14]));
}
SW3.Stop();
reader.Close();
return report.ToList();
}
catch (Exception ex)
{
LocalPulserDBManagerInstance.WriteLog(ex.StackTrace, ex.Message);
throw ex;
}
}
}
But the next line table.Load(reader); takes about 20 seconds to complete.
I also tried like this:
public List<ResultPulser> GetReportResult(SqlConnection opCon, SqlCommand command,
int minReport,int maxReport,int machineNumber)
{
List<ResultPulser> results = new List<ResultPulser>();
using (DataContext dc = new DataContext(opCon))
{
try
{
command.CommandText = "select * from ResultPulser " +
"where CAST(SUBSTRING([ReportNumber], 0, 8) as int) = #machineNumber and " +
"CAST(SUBSTRING([ReportNumber],8,LEN([ReportNumber])) as int) BETWEEN #minReport AND #maxReport";
command.Parameters.Clear();
command.Parameters.AddWithValue("#minReport", minReport);
command.Parameters.AddWithValue("#maxReport", maxReport);
command.Parameters.AddWithValue("#machineNumber", machineNumber);
Stopwatch SW1 = Stopwatch.StartNew();
SqlDataReader reader = command.ExecuteReader();
SW1.Stop();
DataTable table = new DataTable();
Stopwatch SW2 = Stopwatch.StartNew();
while (reader.Read())
{
results.Add(new ResultPulser(reader.GetInt64(0), reader.GetString(1).Trim(), reader.GetString(2).Trim(),
reader.GetDateTime(3), reader.GetDouble(4), reader.GetDouble(5), reader.GetDouble(6),
reader.GetDouble(7), reader.GetDouble(8), reader.GetDouble(9), reader.GetInt64(10),
reader.GetInt64(11), reader.GetInt64(12), reader.GetBoolean(13), reader.GetInt32(14)));
}
SW2.Stop();
reader.Close();
return results;
}
catch (Exception ex)
{
LocalPulserDBManagerInstance.WriteLog(ex.StackTrace, ex.Message);
throw ex;
}
}
}
In this case, this code part takes about 17-16 seconds...
while (reader.Read())
{
results.Add(new ResultPulser(reader.GetInt64(0), reader.GetString(1).Trim(), reader.GetString(2).Trim(),
reader.GetDateTime(3), reader.GetDouble(4), reader.GetDouble(5), reader.GetDouble(6),
reader.GetDouble(7), reader.GetDouble(8), reader.GetDouble(9), reader.GetInt64(10),
reader.GetInt64(11), reader.GetInt64(12), reader.GetBoolean(13), reader.GetInt32(14)));
}
How can I optimize my code to do it faster than that?
This is your problem:
"where CAST(SUBSTRING([ReportNumber], 0, 8) as int) = #machineNumber and " +
"CAST(SUBSTRING([ReportNumber],8,LEN([ReportNumber])) as int) BETWEEN #minReport AND #maxReport";
By using these CAST(SUBSTRING(...)) expressions you're forcing full scan on this (presumably large) table.
You need to change it/optimize it so that it contains bare column names in WHERE.
What I assume from the predicates the ReportNumber column has values of the form
MMMMMMMMR+
where MMMMMMMM is the machine number and R+ is one or more digits of the report number.
I'm not sure what is the data type of the ReportNumber column. It can be a bigint or some kind of string. From the query it seems it's a string, but you might as well use implicit conversions.
Anyway there's great potential to optimize this and bring it to elapsed time measured in single milliseconds.
I can only give you a starting point as there's not enough info.
The first part is easy:
"where CAST(SUBSTRING([ReportNumber], 0, 8) as int) = #machineNumber"
becomes
"where ReportNumber LIKE #machineNumber"
so this becomes a prefix search, and you change the C# part to pass the parameter as a prefix search for LIKE:
string machineNumberLikeString = machineNumber.ToString() + "%";
then pass it to the query like so:
command.Parameters.AddWithValue("#machineNumber", machineNumberLikeString);
Then on the SQL side you can create an index to support fast searching data for this query (but only if it's used quite often - which is a subjective and depending on your workload and requirements). Remember that indexes have their cost and the gain must outweigh it.
CREATE INDEX IX_ResultPulser_ReportNumber
ON ResultPulser (ReportNumber);
It will work assuming the column ReportNumber is of some string data type, e.g. VARCHAR or NVARCHAR. If it's numeric, it may or may not work. A modified strategy is required. But I'm out of time and I don't have enough info on the data and the table.
EDIT: It may not help if you're extracting 300 thousand rows in one query. SQL Server would need 300k lookups to fetch remaining columns for your SELECT *. You may consider clustering this table on ReportNumber columns (CREATE CLUSTERED INDEX...).
And the last warning: do not use SELECT * (select star) in production code. Always explicitly specify which columns you require.
Hope that this will help you.
My C# code below works however it takes close to an hour to process 180,000 rows of data. I'm looking for ways to improve performance. Is there a faster or better way to increase the array as new data is read or could the SQL statement be more efficient? Thanks.
int row = 0;
string[,] timeSeriesData = new string[row, colSize];
using (OleDbConnection AccessConn = new OleDbConnection(strAccessConn))
{
OleDbCommand cmdGetData = AccessConn.CreateCommand();
cmdGetData.CommandText = sqlSELECT;
AccessConn.Open();
OleDbDataReader thisReader = cmdGetData.ExecuteReader();
while (thisReader.Read())
{
string[,] tempArray = new string[row + 1, colSize];
Array.Copy(timeSeriesData, tempArray, timeSeriesData.Length);
timeSeriesData = tempArray;
timeSeriesData[row, 0] = thisReader.GetDateTime(0).ToOADate().ToString();
for (int j = 1; j < colSize; j++)
{
if (thisReader.IsDBNull(j))
{
timeSeriesData[row, j] = "-999";
}
else
{
timeSeriesData[row, j] = Convert.ToString(thisReader[j]);
}
}
row++;
}
thisReader.Close();
AccessConn.Close();
}
My SQL statement is usually something like this:
SELECT [TimeStamp], IIF([CH1Avg_Qual] IS NULL OR [CH1Avg_Qual]=0, [CH1Avg], NULL) AS Col1,
IIF([CH2Avg_Qual] IS NULL OR [CH2Avg_Qual]=0, [CH2Avg], NULL) AS Col2,
IIF([CH3Avg_Qual] IS NULL OR [CH3Avg_Qual]=0, [CH3Avg], NULL) AS Col3,
IIF([CH7Avg_Qual] IS NULL OR [CH7Avg_Qual]=0, [CH7Avg], NULL) AS Col4,
IIF([CH9Avg_Qual] IS NULL OR [CH9Avg_Qual]=0, [CH9Avg], NULL) AS Col5,
IIF([CH10Avg_Qual] IS NULL OR [CH10Avg_Qual]=0, [CH10Avg], NULL) AS Col6,
IIF([CH11Avg_Qual] IS NULL OR [CH11Avg_Qual]=0, [CH11Avg], NULL) AS Col7
FROM [DataTable] ORDER BY [TimeStamp]
It's likely that you have a lot of extra overhead in the round trip between the DB and the data reader. You're marshaling the data line by line through the connection, and that's usually the slowest part of the process.
You should be better off using a DataSet, pulling all the records into it, and then iterating the DataSet and building your array. The DataSet will be in local storage so access should be faster.
Another alternative is to build the result you want using Access VBA and store it in a table. Then, just pull the whole table from Access, again using a DataSet. This would be preferable if you were using a small subset of the 180,000 records (I believe that isn't the case here), since you wouldn't have to truck all the records into the DataSet before you iterated them.
Thanks to #BobRodes for pointing me towards DataTable. The code below improved performance on one dataset from 18 min previously to 7 sec using a DataTable :) The code is a lot simpler too.
DataTable timeSeriesDataDT = new DataTable();
using (OleDbConnection AccessConnDT = new OleDbConnection(strAccessConn))
{
using (OleDbCommand cmdGetDT = new OleDbCommand(sqlSELECT, AccessConnDT))
{
AccessConnDT.Open();
OleDbDataAdapter adapter = new OleDbDataAdapter(cmdGetDT);
adapter.Fill(timeSeriesDataDT);
}
}
I have the following code that it works fine. My problem is the insert took more than three hours.
How can I optimize the insert query in sql table?
foreach(var sheetName in GetExcelSheetNames(connectionString)) {
using(OleDbConnection con1 = new OleDbConnection(connectionString)) {
var dt = new DataTable();
string query = string.Format("SELECT * FROM [{0}]", sheetName);
con1.Open();
OleDbDataAdapter adapter = new OleDbDataAdapter(query, con1);
adapter.Fill(dt);
using(SqlConnection con = new SqlConnection(consString)) {
con.Open();
for (int i = 2; i < dt.Rows.Count; i++) {
for (int j = 1; j < dt.Columns.Count; j += 3) {
try {
var s = dt.Rows[i][0].ToString();
var dt1 = DateTime.Parse(s, CultureInfo.GetCultureInfo("fr-FR"));
var s1 = dt.Rows[i][j].ToString();
var s2 = dt.Rows[i][j + 1].ToString();
var s3 = sheetName.Remove(sheetName.Length - 1);
{
SqlCommand command = new SqlCommand("INSERT INTO [Obj CA MPX] ([CA TTC],[VAL MRG TTC],[CA HT],[VAL MRG HT],[Rayon],[Date],[Code Site]) VALUES(#ca,#val,#catHT ,#valHT ,#rayon, #date ,#sheetName )", con);
command.Parameters.Add("#date", SqlDbType.Date).Value = dt1;
command.Parameters.AddWithValue("#ca", s1);
command.Parameters.AddWithValue("#val", s2);
command.Parameters.AddWithValue("#rayon", dt.Rows[0][j].ToString());
command.Parameters.AddWithValue("#sheetName", s3);
command.Parameters.Add("#catHT", DBNull.Value).Value = DBNull.Value;
command.Parameters.Add("#valHT", DBNull.Value).Value = DBNull.Value;
command.ExecuteNonQuery();
}
}
}
maybe you should save it as file and use bulk insert
https://msdn.microsoft.com/de-de/library/ms188365%28v=sql.120%29.aspx
SQL Server has the option of using a Bulk Insert.
Here is a good article on importing a csv.
You should first read this article from Eric Lippert: Which is faster?.
Keep this in mind while trying to optimize your process.
The insert took 3 hours, but have you inserted 10 items or 900.000.000.000 items?
If it's the last one, maybe 3 hours are pretty good.
What is your database? SQL Server 2005 Express? SQL Server 2014 Enterprise?
The advices could differ.
Without more details, we will only be able to give you suggestions, that could or could not apply depending on your configuration.
Here are some on the top of my head:
Is the bottleneck on the DB side? Check the execution plan, add indexes if needed
Beware of AddWithValue, it can prevent the use of indexes in your query
If you are loading a lot of data on a non-live database, you could use a lighter recovery model to prevent having a lot of useless logs (using Bulk load will use automatically BULKED_LOGGED, or you could activate the SIMPLE recovery model (alter database [YourDB] set recovery SIMPLE, don't forget to re-enable the FULL recovery model after)
Are there other alternatives than loading data from an Excel file? Can't you use another database instead or converting the Excel file to a CSV?
What does the performance monitor tells you? Maybe you need better hardware (more ram, faster disks, RAID), or move some heavily used files (mdf, ldf) on separate disks.
You could copy the Excel file several times and use parallelization, load in different tables that will be partitions of your final table.
This list could continue forever.
Here is an interesting article about optimizing data loading: We Loaded 1TB in 30 Minutes with SSIS, and So Can You
This article is focused on SSIS but some advices do not apply only to it.
You can put several (e.g. 100) inserts into a string using a string builder. Use an index for the parameter names. Note that you can have a maximum of 2100 parameters for one query.
StringBuilder batch = new StringBuilder();
for (int i = 0; i < pageSize; i++)
{
batch.AppendFormat(
#"INSERT INTO [Obj CA MPX] ([CA TTC],[VAL MRG TTC], ...) VALUES(#ca{0},#val{0}, ...)"
i);
batch.AppendLine();
batch.AppendLine();
}
SqlCommand command = new SqlCommand(batch.ToString(), con)
// append parameters, using the index
for (int i = 0; i < pageSize; i++)
{
command.Parameters.Add("#date" + i, SqlDbType.Date).Value = dt1[i];
command.Parameters.AddWithValue("#ca" + i, s1[i]);
// ...
}
command.ExecuteNonQuery();
Of course this is not finished, you have to integrate the pages into your existing loops, which may not be too simple.
Alternatively, you do not use parameters and put the arguments directly into the query. This way, you can create much larger batches (I would put 1000 to 10000 inserts into one batch) and it's much easier to implement.
I have the following code to build a large string. it runs well on my development pc, about taking 2 seconds to complete all tasks, but when I put them on production server, it will take 10-20 seconds depending on something unknown. The time is calculated from TimePoint1 to TimePoint2.
BTW: abstractTable.GetWordCountInList(listID) just contains one line: return 100;
All other functions on production server work ok, except for this one.
Oracle DB server and Web Server are on the same machine, and the query returns about 5000 records. Product server's data is same as the development server.
public const string WordXMLItemTemplate = #"<S I=""{0}"" W=""{1}"" L=""{2}"" F=""0"" P=""0"" />";
int totalCounter =0;
int wordCounter = 0;
int listID = 1;
OracleConnection wordLibConn = ConnectionManager.GetNewConnection();
wordLibConn.Open();
try
{
OracleCommand comm = new OracleCommand();
comm.Connection = wordLibConn;
comm.CommandText = "Select WordID from WordTable Order By Lower(Word) ASC";
OracleDataReader dataReader = comm.ExecuteReader();
try
{
// TimePoint 1
while (dataReader.Read())
{
totalCounter++;
wordCounter++;
wordXML = wordXML + string.Format(WordXMLItemTemplate, totalCounter, dataReader[0].ToString(), listID);
if (wordCounter % abstractTable.GetWordCountInList(listID) == 0)
{
wordCounter = 0;
listID++;
}
}
}
finally
{
dataReader.Close();
}
}
finally
{
wordLibConn.Close();
}
// TimePoint 2
I suffered from the problem for months, thanks a lot for any advice.
Create a list for yourself of the potential lines or code fragments which might cause this slowness. Measure all of them and take a look at the results of your measures. Hopefully you will find the exact place where the stalling is caused and you will be closer to the solution.
I want to delete all rows in a datatable.
I use something like this:
foreach (DataRow row in dt.Rows)
{
row.Delete();
}
TableAdapter.Update(dt);
It works good but it takes lots of time if I have much rows.
Is there any way to delete all rows at once?
If you are running your code against a sqlserver database then
use this command
string sqlTrunc = "TRUNCATE TABLE " + yourTableName
SqlCommand cmd = new SqlCommand(sqlTrunc, conn);
cmd.ExecuteNonQuery();
this will be the fastest method and will delete everything from your table and reset the identity counter to zero.
The TRUNCATE keyword is supported also by other RDBMS.
5 years later:
Looking back at this answer I need to add something. The answer above is good only if you are absolutely sure about the source of the value in the yourTableName variable. This means that you shouldn't get this value from your user because he can type anything and this leads to Sql Injection problems well described in this famous comic strip. Always present your user a choice between hard coded names (tables or other symbolic values) using a non editable UI.
This will allow you to clear all the rows and maintain the format of the DataTable.
dt.Rows.Clear();
There is also
dt.Clear();
However, calling Clear() on the DataTable (dt) will remove the Columns and formatting from the DataTable.
Per code found in an MSDN question, an internal method is called by both the DataRowsCollection, and DataTable with a different boolean parameter:
internal void Clear(bool clearAll)
{
if (clearAll) // true is sent from the Data Table call
{
for (int i = 0; i < this.recordCapacity; i++)
{
this.rows[i] = null;
}
int count = this.table.columnCollection.Count;
for (int j = 0; j < count; j++)
{
DataColumn column = this.table.columnCollection[j];
for (int k = 0; k < this.recordCapacity; k++)
{
column.FreeRecord(k);
}
}
this.lastFreeRecord = 0;
this.freeRecordList.Clear();
}
else // False is sent from the DataRow Collection
{
this.freeRecordList.Capacity = this.freeRecordList.Count + this.table.Rows.Count;
for (int m = 0; m < this.recordCapacity; m++)
{
if ((this.rows[m] != null) && (this.rows[m].rowID != -1))
{
int record = m;
this.FreeRecord(ref record);
}
}
}
}
As someone mentioned, just use:
dt.Rows.Clear()
That's the easiest way to delete all rows from the table in dbms via DataAdapter. But if you want to do it in one batch, you can set the DataAdapter's UpdateBatchSize to 0(unlimited).
Another way would be to use a simple SqlCommand with CommandText DELETE FROM Table:
using(var con = new SqlConnection(ConfigurationSettings.AppSettings["con"]))
using(var cmd = new SqlCommand())
{
cmd.CommandText = "DELETE FROM Table";
cmd.Connection = con;
con.Open();
int numberDeleted = cmd.ExecuteNonQuery(); // all rows deleted
}
But if you instead only want to remove the DataRows from the DataTable, you just have to call DataTable.Clear. That would prevent any rows from being deleted in dbms.
Why dont you just do it in SQL?
DELETE FROM SomeTable
Just use dt.Clear()
Also you can set your TableAdapter/DataAdapter to clear before it fills with data.
TableAdapter.Update(dt.Clone());
//or
dt=dt.Clone();
TableAdapter.Update(dt);
//or
dt.Rows.Clear();
dt.AcceptChanges();
TableAdapter.Update(dt);
If you are really concerned about speed and not worried about the data you can do a Truncate. But this is assuming your DataTable is on a database and not just a memory object.
TRUNCATE TABLE tablename
The difference is this removes all rows without logging the row deletes making the transaction faster.
Here we have same question. You can use the following code:
SqlConnection con = new SqlConnection();
con.ConnectionString = ConfigurationManager.ConnectionStrings["yourconnectionstringnamehere"].ConnectionString;
con.Open();
SqlCommand com = new SqlCommand();
com.Connection = con;
com.CommandText = "DELETE FROM [tablenamehere]";
SqlDataReader data = com.ExecuteReader();
con.Close();
But before you need to import following code to your project:
using System.Configuration;
using System.Data.SqlClient;
This is the specific part of the code that can delete all rows is a table:
DELETE FROM [tablenamehere]
This must be your table name:tablenamehere
- This can delete all rows in table: DELETE FROM
I using MDaf just use this code :
DataContext db = new DataContext(ConfigurationManager.ConnectionStrings["con"].ConnectionString);
db.TruncateTable("TABLE_NAME_HERE");
//or
db.Execute("TRUNCATE TABLE TABLE_NAME_HERE ");
Here is a clean and modern way to do it using Entity FW and without SQL Injection or TSQL..
using (Entities dbe = new Entities())
{
dbe.myTable.RemoveRange(dbe.myTable.ToList());
dbe.SaveChanges();
}
Is there a Clear() method on the DataTable class??
I think there is. If there is, use that.
Datatable.clear() method from class DataTable