Inserting to SQL performance issues - c#

I wrote a program some time ago that delimits and reads in pretty big text files. The program works but the problem is it basically freezes the computer and takes long time to finish. On average each text file has around 10K to 15K lines, and each line represents a new row in a SQL table.
Way my program works is I first read all of the lines (this is where delimiting happens) and store them in array, after that I go through each array element and insert them into SQL table. This is all done at once and I suspect is eating up to much memory which is causing the program to freeze the computer.
Here is my code for reading file:
private void readFile()
{
//String that will hold each line read from the file
String line;
//Instantiate new stream reader
System.IO.StreamReader file = new System.IO.StreamReader(txtFilePath.Text);
try
{
while (!file.EndOfStream)
{
line = file.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
{
if (this.meetsCondition(line))
{
badLines++;
continue;
} // end if
else
{
collection.readIn(line);
counter++;
} // end else
} // end if
} // end while
file.Close();
} // end try
catch (Exception exceptionError)
{
//Placeholder
}
Code for inserting:
for (int i = 0; i < counter; i++)
{
//Iterates through the collection array starting at first index and going through until the end
//and inserting each element into our SQL Table
//if (!idS.Contains(collection.getIdItems(i)))
//{
da.InsertCommand.Parameters["#Id"].Value = collection.getIdItems(i);
da.InsertCommand.Parameters["#Date"].Value = collection.getDateItems(i);
da.InsertCommand.Parameters["#Time"].Value = collection.getTimeItems(i);
da.InsertCommand.Parameters["#Question"].Value = collection.getQuestionItems(i);
da.InsertCommand.Parameters["#Details"].Value = collection.getDetailsItems(i);
da.InsertCommand.Parameters["#Answer"].Value = collection.getAnswerItems(i);
da.InsertCommand.Parameters["#Notes"].Value = collection.getNotesItems(i);
da.InsertCommand.Parameters["#EnteredBy"].Value = collection.getEnteredByItems(i);
da.InsertCommand.Parameters["#WhereReceived"].Value = collection.getWhereItems(i);
da.InsertCommand.Parameters["#QuestionType"].Value = collection.getQuestionTypeItems(i);
da.InsertCommand.Parameters["#AnswerMethod"].Value = collection.getAnswerMethodItems(i);
da.InsertCommand.Parameters["#TransactionDuration"].Value = collection.getTransactionItems(i);
da.InsertCommand.ExecuteNonQuery();
//}
//Updates the progress bar using the i in addition to 1
_worker.ReportProgress(i + 1);
} // end for

If you can map your collection to a DataTable then you could use an SqlBulkCopy to import your data. SqlBulkCopy is the fastest way to import data from .Net into SqlServer.

Use SqlBulkCopy class for bulk inserts.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
You will cut down the time to mere seconds.

+1 for SqlBulkCopy as others have stated, but be aware that it requires INSERT permission. If you work in a strictly controlled environment, as I do, where you aren't allowed to use dynamic SQL an alternative approach is to have your stored proc use Table-Valued parameters. That way you can still pass in chunks of records and have the proc do the actual inserting.

As an example how to use the functionaloty of the SqlBulkCopy class, (It is just pseudocode to render the idea)
First change your collection class to host an internal DataTable, and in the constructor define the schema used by your readIn method
public class MyCollection
{
private DataTable loadedData = null;
public MyCollection()
{
loadedData = new DataTable();
loadedData.Columns.Add("Column1", typeof(string));
.... and so on for every field expected
}
// A property to return the collected data
public DataTable GetData
{
get{return loadedData;}
}
public void readIn(string line)
{
// split the line in fields
DataRow r = loadedData.NewRow();
r["Column1"] = splittedLine[0];
.... and so on
loadedData.Rows.Add(r);
}
}
Finally the code that upload the data to your server
using (SqlConnection connection = new SqlConnection(connectionString))
{
connection.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "destinationTable";
try
{
bulkCopy.WriteToServer(collection.GetData());
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}

As mentioned, using SqlBulkCopy will be faster than inserting one-by-one, but there are other things that you could look at:
Is there a clustered index on the table? If so will you be inserting rows with values in the middle of that index? It's much more efficient to add values at the end of a clustered index since otherwise it will have to rearrange data to insert in in the middle (this is only for CLUSTERED indexes). On example I've seen us using SSN as a clustered primary key. Since SSNs will be distributed randomly, you are rearranging the physical structure on virtually every insert. Having a date as part of the clustered key may be OK if you are MOSTLY inserting data at the end (e.g. adding daily records)
Are there a lot of indexes on that table? it may be more efficient to drop the indexes, add the data, and re-add the indexes after the inserts. (or just drop indexes you don't need)

Related

How do I batch 1000 inserts in the given loop scenario?

ORIGINAL QUESTION:
I have some code which looks like this:
for (int i = start_i; i <= i_s; i++)
{
var json2 = JObject.Parse(RequestServer("query_2", new List<JToken>(){json1["result"]}));
foreach (var data_1 in json2["result"]["data_1"])
{
var json3 = JObject.Parse(RequestServer("query_3", new List<JToken>(){data_1, 1}));
foreach (var data_2 in json3["result"]["data_2"])
{
var data_1 = data_2["id"];
var index = data_2["other"];
}
foreach (var other in json3["result"]["other"])
{
var data_3_1 = other["data_3"]["data_3_1"];
var data_4 = other["data_4"];
var data_5 = other["data_5"];
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
insert_data((string)data_3_1); <- very slow
}
}
}
}
This code was able to generate about 5000 WriteLines in less than a minute. However, I now want to insert that data into a database. When I try to do that, the code now takes much much longer to get through the 5000 sets of data.
My question is, how do I batch the database inserts into about 1000 inserts at a time, instead of doing one at a time. I have tried creating the insert statement using a stringbuilder which is fine, what I can't figure out is how to generate 1000 at a time. I have tried using for loops upto 1000, and then trying to break out of the foreach loop, before starting with the next 1000, but it just makes a big mess.
I have looked at questions like this example, but they are no good for my loop scenario. I know how to do bulk inserts at the sql level, I just can't seem to figure out how to generate the bulk sql inserts using the unique loop situation I have above using the those very specific loops in the example code.
The 5000 records was just a test run. The end code will have to deal with millions, if not billions of inserts. Based on rough calculations, the end result will use about 500GB of drive space when inserted into a database, so I will need to batch an optimum amount into RAM before inserting into the database.
UPDATE 1:
This is what happens in insert_data:
public static string insert_data(string data_3_1)
{
string str_conn = #"server=localhost;port=3306;uid=username;password=password;database=database";
MySqlConnection conn = null;
conn = new MySqlConnection(str_conn);
conn.Open();
MySqlCommand cmd = new MySqlCommand();
cmd.Connection = conn;
cmd.CommandText = "INSERT INTO database_table (data_3_1) VALUES (#data_3_1)";
cmd.Prepare();
cmd.Parameters.AddWithValue("#data_3_1", data_3_1);
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
return null;
}
You're correct that doing bulk inserts in batches can be a big throughput win. Here's why it's a win: When you do INSERT operations one at a time, the database server does an implicit COMMIT operation after every insert, and that can be slow. So, if you can wrap every hundred or so INSERTs in a single transaction, you'll reduce that overhead.
Here's an outline of how to do that. I'll try to put it in the context of your code, but you didn't show your MySQLConnection object or query objects, so this solution of mine necessarily will be incomplete.
var batchSize = 100;
var batchCounter = batchSize;
var beginBatch = new MySqlCommand("START TRANSACTION;", conn);
var endBatch = new MySqlCommand("COMMIT;", conn);
beginBatch.ExecuteNonQuery();
for (int i = start_i; i <= i_s; i++)
{
....
foreach (var data_1 in json2["result"]["data_1"])
{
...
foreach (var other in json3["result"]["other"])
{
...
foreach (var data_3_1 in other["data_3"]["data_3_1"])
{
//Console.WriteLine(data_3_1); <- very fast
/****************** batch handling **********************/
if ( --batchCounter <= 0) {
/* commit one batch, start the next */
endBatch.ExecuteNonQuery();
beginBatch.ExecuteNonQuery();
batchCounter = batchSize;
}
insert_data((string)data_3_1); <- very slow
}
}
}
}
/* commit the last batch. It's OK if it contains no records */
endBatch.ExecuteNonQuery();
If you want, you can try different values of batchSize to find a good value. But generally something like the 100 I suggest works well.
Batch sizes of 1000 are also OK. But the larger each transaction gets, the more server RAM it uses before it's committed, and the longer it might block other programs using the same MySQL server.
There's a nice and popular extension called MoreLinq that offers an extension method called Batch(int batchSize). To get an IEnumerable containing up to 1000 elements:
foreach (var upTo1000 in other["data_3"]["data_3_1"].Batch(1000))
{
// Build a query using the (up to) 1000 elements in upTo1000
}
The best approach for me was using LOAD DATA LOCAL INFILE statement. To make it work first you have to turn ON MySQL server parameter local_infile.
I used mysql2 package for NodeJS and query function:
db.query({
sql: "LOAD DATA LOCAL INFILE .......",
infileStreamFactory: <readable stream which provides your data in flat file format>
}, function(err, results) {....});
The trick is to provide a readable stream properly. By default, LOAD DATA expects tab delimited text file. Also LOAD DATA expects some file name and in you case if you provide a stream then file name can be arbitrary string.

String or binary data would be truncated exception when inserting data [duplicate]

I am running data.bat file with the following lines:
Rem Tis batch file will populate tables
cd\program files\Microsoft SQL Server\MSSQL
osql -U sa -P Password -d MyBusiness -i c:\data.sql
The contents of the data.sql file is:
insert Customers
(CustomerID, CompanyName, Phone)
Values('101','Southwinds','19126602729')
There are 8 more similar lines for adding records.
When I run this with start > run > cmd > c:\data.bat, I get this error message:
1>2>3>4>5>....<1 row affected>
Msg 8152, Level 16, State 4, Server SP1001, Line 1
string or binary data would be truncated.
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
<1 row affected>
Also, I am a newbie obviously, but what do Level #, and state # mean, and how do I look up error messages such as the one above: 8152?
From #gmmastros's answer
Whenever you see the message....
string or binary data would be truncated
Think to yourself... The field is NOT big enough to hold my data.
Check the table structure for the customers table. I think you'll find that the length of one or more fields is NOT big enough to hold the data you are trying to insert. For example, if the Phone field is a varchar(8) field, and you try to put 11 characters in to it, you will get this error.
I had this issue although data length was shorter than the field length.
It turned out that the problem was having another log table (for audit trail), filled by a trigger on the main table, where the column size also had to be changed.
In one of the INSERT statements you are attempting to insert a too long string into a string (varchar or nvarchar) column.
If it's not obvious which INSERT is the offender by a mere look at the script, you could count the <1 row affected> lines that occur before the error message. The obtained number plus one gives you the statement number. In your case it seems to be the second INSERT that produces the error.
Just want to contribute with additional information: I had the same issue and it was because of the field wasn't big enough for the incoming data and this thread helped me to solve it (the top answer clarifies it all).
BUT it is very important to know what are the possible reasons that may cause it.
In my case i was creating the table with a field like this:
Select '' as Period, * From Transactions Into #NewTable
Therefore the field "Period" had a length of Zero and causing the Insert operations to fail. I changed it to "XXXXXX" that is the length of the incoming data and it now worked properly (because field now had a lentgh of 6).
I hope this help anyone with same issue :)
Some of your data cannot fit into your database column (small). It is not easy to find what is wrong. If you use C# and Linq2Sql, you can list the field which would be truncated:
First create helper class:
public class SqlTruncationExceptionWithDetails : ArgumentOutOfRangeException
{
public SqlTruncationExceptionWithDetails(System.Data.SqlClient.SqlException inner, DataContext context)
: base(inner.Message + " " + GetSqlTruncationExceptionWithDetailsString(context))
{
}
/// <summary>
/// PArt of code from following link
/// http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
/// </summary>
/// <param name="context"></param>
/// <returns></returns>
static string GetSqlTruncationExceptionWithDetailsString(DataContext context)
{
StringBuilder sb = new StringBuilder();
foreach (object update in context.GetChangeSet().Updates)
{
FindLongStrings(update, sb);
}
foreach (object insert in context.GetChangeSet().Inserts)
{
FindLongStrings(insert, sb);
}
return sb.ToString();
}
public static void FindLongStrings(object testObject, StringBuilder sb)
{
foreach (var propInfo in testObject.GetType().GetProperties())
{
foreach (System.Data.Linq.Mapping.ColumnAttribute attribute in propInfo.GetCustomAttributes(typeof(System.Data.Linq.Mapping.ColumnAttribute), true))
{
if (attribute.DbType.ToLower().Contains("varchar"))
{
string dbType = attribute.DbType.ToLower();
int numberStartIndex = dbType.IndexOf("varchar(") + 8;
int numberEndIndex = dbType.IndexOf(")", numberStartIndex);
string lengthString = dbType.Substring(numberStartIndex, (numberEndIndex - numberStartIndex));
int maxLength = 0;
int.TryParse(lengthString, out maxLength);
string currentValue = (string)propInfo.GetValue(testObject, null);
if (!string.IsNullOrEmpty(currentValue) && maxLength != 0 && currentValue.Length > maxLength)
{
//string is too long
sb.AppendLine(testObject.GetType().Name + "." + propInfo.Name + " " + currentValue + " Max: " + maxLength);
}
}
}
}
}
}
Then prepare the wrapper for SubmitChanges:
public static class DataContextExtensions
{
public static void SubmitChangesWithDetailException(this DataContext dataContext)
{
//http://stackoverflow.com/questions/3666954/string-or-binary-data-would-be-truncated-linq-exception-cant-find-which-fiel
try
{
//this can failed on data truncation
dataContext.SubmitChanges();
}
catch (SqlException sqlException) //when (sqlException.Message == "String or binary data would be truncated.")
{
if (sqlException.Message == "String or binary data would be truncated.") //only for EN windows - if you are running different window language, invoke the sqlException.getMessage on thread with EN culture
throw new SqlTruncationExceptionWithDetails(sqlException, dataContext);
else
throw;
}
}
}
Prepare global exception handler and log truncation details:
protected void Application_Error(object sender, EventArgs e)
{
Exception ex = Server.GetLastError();
string message = ex.Message;
//TODO - log to file
}
Finally use the code:
Datamodel.SubmitChangesWithDetailException();
Another situation in which you can get this error is the following:
I had the same error and the reason was that in an INSERT statement that received data from an UNION, the order of the columns was different from the original table. If you change the order in #table3 to a, b, c, you will fix the error.
select a, b, c into #table1
from #table0
insert into #table1
select a, b, c from #table2
union
select a, c, b from #table3
on sql server you can use SET ANSI_WARNINGS OFF like this:
using (SqlConnection conn = new SqlConnection("Data Source=XRAYGOAT\\SQLEXPRESS;Initial Catalog='Healthy Care';Integrated Security=True"))
{
conn.Open();
using (var trans = conn.BeginTransaction())
{
try
{
using cmd = new SqlCommand("", conn, trans))
{
cmd.CommandText = "SET ANSI_WARNINGS OFF";
cmd.ExecuteNonQuery();
cmd.CommandText = "YOUR INSERT HERE";
cmd.ExecuteNonQuery();
cmd.Parameters.Clear();
cmd.CommandText = "SET ANSI_WARNINGS ON";
cmd.ExecuteNonQuery();
trans.Commit();
}
}
catch (Exception)
{
trans.Rollback();
}
}
conn.Close();
}
I had the same issue. The length of my column was too short.
What you can do is either increase the length or shorten the text you want to put in the database.
Also had this problem occurring on the web application surface.
Eventually found out that the same error message comes from the SQL update statement in the specific table.
Finally then figured out that the column definition in the relating history table(s) did not map the original table column length of nvarchar types in some specific cases.
I had the same problem, even after increasing the size of the problematic columns in the table.
tl;dr: The length of the matching columns in corresponding Table Types may also need to be increased.
In my case, the error was coming from the Data Export service in Microsoft Dynamics CRM, which allows CRM data to be synced to an SQL Server DB or Azure SQL DB.
After a lengthy investigation, I concluded that the Data Export service must be using Table-Valued Parameters:
You can use table-valued parameters to send multiple rows of data to a Transact-SQL statement or a routine, such as a stored procedure or function, without creating a temporary table or many parameters.
As you can see in the documentation above, Table Types are used to create the data ingestion procedure:
CREATE TYPE LocationTableType AS TABLE (...);
CREATE PROCEDURE dbo.usp_InsertProductionLocation
#TVP LocationTableType READONLY
Unfortunately, there is no way to alter a Table Type, so it has to be dropped & recreated entirely. Since my table has over 300 fields (😱), I created a query to facilitate the creation of the corresponding Table Type based on the table's columns definition (just replace [table_name] with your table's name):
SELECT 'CREATE TYPE [table_name]Type AS TABLE (' + STRING_AGG(CAST(field AS VARCHAR(max)), ',' + CHAR(10)) + ');' AS create_type
FROM (
SELECT TOP 5000 COLUMN_NAME + ' ' + DATA_TYPE
+ IIF(CHARACTER_MAXIMUM_LENGTH IS NULL, '', CONCAT('(', IIF(CHARACTER_MAXIMUM_LENGTH = -1, 'max', CONCAT(CHARACTER_MAXIMUM_LENGTH,'')), ')'))
+ IIF(DATA_TYPE = 'decimal', CONCAT('(', NUMERIC_PRECISION, ',', NUMERIC_SCALE, ')'), '')
AS field
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '[table_name]'
ORDER BY ORDINAL_POSITION) AS T;
After updating the Table Type, the Data Export service started functioning properly once again! :)
When I tried to execute my stored procedure I had the same problem because the size of the column that I need to add some data is shorter than the data I want to add.
You can increase the size of the column data type or reduce the length of your data.
A 2016/2017 update will show you the bad value and column.
A new trace flag will swap the old error for a new 2628 error and will print out the column and offending value. Traceflag 460 is available in the latest cumulative update for 2016 and 2017:
https://support.microsoft.com/en-sg/help/4468101/optional-replacement-for-string-or-binary-data-would-be-truncated
Just make sure that after you've installed the CU that you enable the trace flag, either globally/permanently on the server:
...or with DBCC TRACEON:
https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-traceon-trace-flags-transact-sql?view=sql-server-ver15
Another situation, in which this error may occur is in
SQL Server Management Studio. If you have "text" or "ntext" fields in your table,
no matter what kind of field you are updating (for example bit or integer).
Seems that the Studio does not load entire "ntext" fields and also updates ALL fields instead of the modified one.
To solve the problem, exclude "text" or "ntext" fields from the query in Management Studio
This Error Comes only When any of your field length is greater than the field length specified in sql server database table structure.
To overcome this issue you have to reduce the length of the field Value .
Or to increase the length of database table field .
If someone is encountering this error in a C# application, I have created a simple way of finding offending fields by:
Getting the column width of all the columns of a table where we're trying to make this insert/ update. (I'm getting this info directly from the database.)
Comparing the column widths to the width of the values we're trying to insert/ update.
Assumptions/ Limitations:
The column names of the table in the database match with the C# entity fields. For eg: If you have a column like this in database:
You need to have your Entity with the same column name:
public class SomeTable
{
// Other fields
public string SourceData { get; set; }
}
You're inserting/ updating 1 entity at a time. It'll be clearer in the demo code below. (If you're doing bulk inserts/ updates, you might want to either modify it or use some other solution.)
Step 1:
Get the column width of all the columns directly from the database:
// For this, I took help from Microsoft docs website:
// https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.getschema?view=netframework-4.7.2#System_Data_SqlClient_SqlConnection_GetSchema_System_String_System_String___
private static Dictionary<string, int> GetColumnSizesOfTableFromDatabase(string tableName, string connectionString)
{
var columnSizes = new Dictionary<string, int>();
using (var connection = new SqlConnection(connectionString))
{
// Connect to the database then retrieve the schema information.
connection.Open();
// You can specify the Catalog, Schema, Table Name, Column Name to get the specified column(s).
// You can use four restrictions for Column, so you should create a 4 members array.
String[] columnRestrictions = new String[4];
// For the array, 0-member represents Catalog; 1-member represents Schema;
// 2-member represents Table Name; 3-member represents Column Name.
// Now we specify the Table_Name and Column_Name of the columns what we want to get schema information.
columnRestrictions[2] = tableName;
DataTable allColumnsSchemaTable = connection.GetSchema("Columns", columnRestrictions);
foreach (DataRow row in allColumnsSchemaTable.Rows)
{
var columnName = row.Field<string>("COLUMN_NAME");
//var dataType = row.Field<string>("DATA_TYPE");
var characterMaxLength = row.Field<int?>("CHARACTER_MAXIMUM_LENGTH");
// I'm only capturing columns whose Datatype is "varchar" or "char", i.e. their CHARACTER_MAXIMUM_LENGTH won't be null.
if(characterMaxLength != null)
{
columnSizes.Add(columnName, characterMaxLength.Value);
}
}
connection.Close();
}
return columnSizes;
}
Step 2:
Compare the column widths with the width of the values we're trying to insert/ update:
public static Dictionary<string, string> FindLongBinaryOrStringFields<T>(T entity, string connectionString)
{
var tableName = typeof(T).Name;
Dictionary<string, string> longFields = new Dictionary<string, string>();
var objectProperties = GetProperties(entity);
//var fieldNames = objectProperties.Select(p => p.Name).ToList();
var actualDatabaseColumnSizes = GetColumnSizesOfTableFromDatabase(tableName, connectionString);
foreach (var dbColumn in actualDatabaseColumnSizes)
{
var maxLengthOfThisColumn = dbColumn.Value;
var currentValueOfThisField = objectProperties.Where(f => f.Name == dbColumn.Key).First()?.GetValue(entity, null)?.ToString();
if (!string.IsNullOrEmpty(currentValueOfThisField) && currentValueOfThisField.Length > maxLengthOfThisColumn)
{
longFields.Add(dbColumn.Key, $"'{dbColumn.Key}' column cannot take the value of '{currentValueOfThisField}' because the max length it can take is {maxLengthOfThisColumn}.");
}
}
return longFields;
}
public static List<PropertyInfo> GetProperties<T>(T entity)
{
//The DeclaredOnly flag makes sure you only get properties of the object, not from the classes it derives from.
var properties = entity.GetType()
.GetProperties(System.Reflection.BindingFlags.Public
| System.Reflection.BindingFlags.Instance
| System.Reflection.BindingFlags.DeclaredOnly)
.ToList();
return properties;
}
Demo:
Let's say we're trying to insert someTableEntity of SomeTable class that is modeled in our app like so:
public class SomeTable
{
[Key]
public long TicketID { get; set; }
public string SourceData { get; set; }
}
And it's inside our SomeDbContext like so:
public class SomeDbContext : DbContext
{
public DbSet<SomeTable> SomeTables { get; set; }
}
This table in Db has SourceData field as varchar(16) like so:
Now we'll try to insert value that is longer than 16 characters into this field and capture this information:
public void SaveSomeTableEntity()
{
var connectionString = "server=SERVER_NAME;database=DB_NAME;User ID=SOME_ID;Password=SOME_PASSWORD;Connection Timeout=200";
using (var context = new SomeDbContext(connectionString))
{
var someTableEntity = new SomeTable()
{
SourceData = "Blah-Blah-Blah-Blah-Blah-Blah"
};
context.SomeTables.Add(someTableEntity);
try
{
context.SaveChanges();
}
catch (Exception ex)
{
if (ex.GetBaseException().Message == "String or binary data would be truncated.\r\nThe statement has been terminated.")
{
var badFieldsReport = "";
List<string> badFields = new List<string>();
// YOU GOT YOUR FIELDS RIGHT HERE:
var longFields = FindLongBinaryOrStringFields(someTableEntity, connectionString);
foreach (var longField in longFields)
{
badFields.Add(longField.Key);
badFieldsReport += longField.Value + "\n";
}
}
else
throw;
}
}
}
The badFieldsReport will have this value:
'SourceData' column cannot take the value of
'Blah-Blah-Blah-Blah-Blah-Blah' because the max length it can take is
16.
Kevin Pope's comment under the accepted answer was what I needed.
The problem, in my case, was that I had triggers defined on my table that would insert update/insert transactions into an audit table, but the audit table had a data type mismatch where a column with VARCHAR(MAX) in the original table was stored as VARCHAR(1) in the audit table, so my triggers were failing when I would insert anything greater than VARCHAR(1) in the original table column and I would get this error message.
I used a different tactic, fields that are allocated 8K in some places. Here only about 50/100 are used.
declare #NVPN_list as table
nvpn varchar(50)
,nvpn_revision varchar(5)
,nvpn_iteration INT
,mpn_lifecycle varchar(30)
,mfr varchar(100)
,mpn varchar(50)
,mpn_revision varchar(5)
,mpn_iteration INT
-- ...
) INSERT INTO #NVPN_LIST
SELECT left(nvpn ,50) as nvpn
,left(nvpn_revision ,10) as nvpn_revision
,nvpn_iteration
,left(mpn_lifecycle ,30)
,left(mfr ,100)
,left(mpn ,50)
,left(mpn_revision ,5)
,mpn_iteration
,left(mfr_order_num ,50)
FROM [DASHBOARD].[dbo].[mpnAttributes] (NOLOCK) mpna
I wanted speed, since I have 1M total records, and load 28K of them.
This error may be due to less field size than your entered data.
For e.g. if you have data type nvarchar(7) and if your value is 'aaaaddddf' then error is shown as:
string or binary data would be truncated
You simply can't beat SQL Server on this.
You can insert into a new table like this:
select foo, bar
into tmp_new_table_to_dispose_later
from my_table
and compare the table definition with the real table you want to insert the data into.
Sometime it's helpful sometimes it's not.
If you try inserting in the final/real table from that temporary table it may just work (due to data conversion working differently than SSMS for example).
Another alternative is to insert the data in chunks, instead of inserting everything immediately you insert with top 1000 and you repeat the process, till you find a chunk with an error. At least you have better visibility on what's not fitting into the table.

Compare and get difference of 2 CSV files with C#

I'm reading a CSV file a few times a day. It's about 300MB and each time I have to read it through, compare with existing data in the database, add new ones, hide old ones and update existing ones. There are also bunch of data that's not getting touch.
I have access to all files both old and new ones and I'd like to compare new one with the previous one and just update what's changed in the file. I have no idea what to do and I'm using C# to do all my work. The one thing that might be most problematic is that a row in the previous field might be in another location in the second feed even if it's not updated at all. I want to avoid that problem as well, if possible.
Any idea would help.
Use one of the existing CSV parsers
Parse each row to a mapped class object
Override Equals and GetHashCode for your object
Keep a List<T> or HashSet<T> in memory, At the first step initialize them with no contents.
On reading each line from the CSV file, check if the exist in your in-memory collection (List, HashSet)
If the object doesn't exists in your in-memory collection, add it to the collection and insert in database.
If the object exists in your in-memory collection then ignore it (Checking for it would be based on Equals and GetHashCode implementation and then it would be as simple as if(inMemoryCollection.Contains(currentRowObject))
I guess you have a windows service reading CSV files periodically from a file location. You can repeat the above process, every time you read a new CSV file. This way you will be able to maintain an in-memory collection of the previously inserted objects and ignore them, irrespective of their place in the CSV file.
If you have primary key, defined for your data then you can use Dictionary<T,T>, where you Key could be the unique field. This will help you in having more performance for comparison and you can ignore Equals and GetHashCode implementation.
As a backup to this process, your DB writing routine/stored procedure should be defined in a way that it would first check, if the record already exists in the table, in that case Update the table otherwise INSERT new record. This would be UPSERT.
Remember, if you end up maintaining an in-memory collection, then keep clearing it periodically, otherwise you could end up with out of memory exception.
Just curious, why do you have to compare the old file with the new file? Isn't the data from the old file in SQL server already? (When yous say database, you mean SQL server right? I'm assuming SQL server because you use C# .net)
My approach is simple:
Load new CSV file into a staging table
Use stored procs to insert, update, and set inactive files
public static void ProcessCSV(FileInfo file)
{
foreach (string line in ReturnLines(file))
{
//break the lines up and parse the values into parameters
using (SqlConnection conn = new SqlConnection(connectionString))
using (SqlCommand command = conn.CreateCommand())
{
command.CommandType = CommandType.StoredProcedure;
command.CommandText = "[dbo].sp_InsertToStaging";
//some value from the string Line, you need to parse this from the string
command.Parameters.Add("#id", SqlDbType.BigInt).Value = line["id"];
command.Parameters.Add("#SomethingElse", SqlDbType.VarChar).Value = line["something_else"];
//execute
if (conn.State != ConnectionState.Open)
conn.Open();
try
{
command.ExecuteNonQuery();
}
catch (SqlException exc)
{
//throw or do something
}
}
}
}
public static IEnumerable<string> ReturnLines(FileInfo file)
{
using (FileStream stream = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
using (StreamReader reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Now you write stored procs to insert, update, set inactive fields based on Ids. You'll know if a row is updated if Field_x(main_table) != Field_x(staging_table) for a particular Id, and so on.
Here's how you detect changes and updates between your main table and staging table.
/* SECTION: SET INACTIVE */
UPDATE main_table
SET IsActiveTag = 0
WHERE unique_identifier IN
(
SELECT a.unique_identifier
FROM main_table AS a INNER JOIN staging_table AS b
--inner join because you only want existing records
ON a.unique_identifier = b.unique_identifier
--detect any updates
WHERE a.field1 <> b.field2
OR a.field2 <> b.field2
OR a.field3 <> b.field3
--etc
)
/* SECTION: INSERT UPDATED AND NEW */
INSERT INTO main_table
SELECT *
FROM staging_table AS b
LEFT JOIN
(SELECT *
FROM main_table
--only get active records
WHERE IsActiveTag = 1) AS a
ON b.unique_identifier = a.unique_identifier
--select only records available in staging table
WHERE a.unique_identifier IS NULL
How big is the csv file?? if its small try the following
string [] File1Lines = File.ReadAllLines(pathOfFileA);
string [] File2Lines = File.ReadAllLines(pathOfFileB);
List<string> NewLines = new List<string>();
for (int lineNum = 0; lineNo < File1Lines.Length; lineNo++)
{
if(!String.IsNullOrEmpty(File1Lines[lineNum])
String.IsNullOrEmpty(File2Lines[lineNo]))
{
if(String.Compare(File1Lines[lineNo], File2Lines[lineNo]) != 0)
NewLines.Add(File2Lines[lineNo]) ;
}
else if (!String.IsNullOrEmpty(File1Lines[lineNo]))
{
}
else
{
NewLines.Add(File2Lines[lineNo]);
}
}
if (NewLines.Count > 0)
{
File.WriteAllLines(newfilepath, NewLines);
}

Multithreading, DataReaders & Bulk Inserts... Can this app be multithreaded?

Is it possible to extract the code section inside my code and have it run in multiple threads?
The app copies over data from a FoxPro db to our SQL server over a network (the files are quite huge so the bulk copy needs to happen in increments...
It works, but I'd like to bump up the speed a bit.
1) By either having the section I marked run in multiple threads, OR as an alternative,
2) not loop through each column in the datarow,
I went for the second option... (Updated code below)
CODE
private void BulkCopy(OleDbDataReader reader, string tableName, Table table)
{
if (Convert.ToBoolean(ConfigurationManager.AppSettings["CopyData"]))
{
Console.WriteLine(tableName + " BulkCopy Started.");
try
{
DataTable tbl = new DataTable();
foreach (Column col in table.Columns)
{
tbl.Columns.Add(col.Name, ConvertDataTypeToType(col.DataType));
}
int batch = 1;
int counter = 0;
DataRow tblRow = tbl.NewRow();
while (reader.Read())
{
counter++;
////This section changed
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
////**********
tbl.LoadDataRow(tblRow.ItemArray, true);
if (counter == BulkInsertIncrement)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
counter = PerformInsert(tableName, tbl, batch);
batch++;
}
}
if (counter > 0)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
PerformInsert(tableName, tbl, counter);
}
tbl = null;
Console.WriteLine("BulkCopy Success!");
}
catch (Exception)
{
Console.WriteLine("BulkCopy Fail!");
}
finally
{
reader.Close();
reader.Dispose();
}
Console.WriteLine(tableName + " BulkCopy Ended.");
}
}
UPDATE
I went for the second option
I wasn't aware that while inside the while(reader.Read()) loop that i could do the following. I't helped to greatly increase the apps performance
while (reader.Read())
{
object[] obj = tblRow.ItemArray;
reader.GetValues(obj);
tblRow.ItemArray = obj;
tbl.LoadDataRow(tblRow.ItemArray, true);
}
Thre is no need to multithread if you take out the beginner mistakes you do. TONS of slow code everywhere.
tblRow[col.Name] = reader[col.Name];
SLOW. NEVER use name - get the index outside the loop, then use the indices. This line has 2 (!) dictionary lookups for eery row, taking more time than the row procesing.
DataTables / DataSet is dead slow to start with (bad trechnological choice) but code like that really slowy dou down. Use a profiler to see the other bad elements.
This may not be the answer you're after, but have you tried running the console application in release mode first, with just one try statement, and using indexes on the reader? It's probably not going to increase the speed a great deal by making it multi-threaded as SQL Server will be the main bottleneck.
Of course if you don't care too much about data integrity (for example your IDs aren't sequential) you could change the table locking type for inserts and spin up 3-4 threads to read from certain points in the table.
I don't think that your usecase will significantly benefit from a parallel for each. Also it would be pretty hard to implement cause of the OleDbReader that is used in you code.
But what you could do is Schedule the inserts on a new thread that your loop will not block for the time the SQL Server needs to insert your data.
You can use the Task.Factory.StartNew() method for this. But this will make error handling a bit more complex, in terms that when the insert fails you might have processed more data or in the worst case there is already another thread waiting with new inserts for the database.
If you're using .NET 4, you could try using the TPL , and convert the foreach loop into something like
Parallel.ForEach(table.Columns, col => {/*rest of function here */}

SqlBulkCopy Not Working

I have a DataSet populated from Excel Sheet. I wanted to use SQLBulk Copy to Insert Records in Lead_Hdr table where LeadId is PK.
I am having following error while executing the code below:
The given ColumnMapping does not match up with any column in the
source or destination
string ConStr=ConfigurationManager.ConnectionStrings["ConStr"].ToString();
using (SqlBulkCopy s = new SqlBulkCopy(ConStr,SqlBulkCopyOptions.KeepIdentity))
{
if (MySql.State==ConnectionState.Closed)
{
MySql.Open();
}
s.DestinationTableName = "PCRM_Lead_Hdr";
s.NotifyAfter = 10000;
#region Comment
s.ColumnMappings.Clear();
#region ColumnMapping
s.ColumnMappings.Add("ClientID", "ClientID");
s.ColumnMappings.Add("LeadID", "LeadID");
s.ColumnMappings.Add("Company_Name", "Company_Name");
s.ColumnMappings.Add("Website", "Website");
s.ColumnMappings.Add("EmployeeCount", "EmployeeCount");
s.ColumnMappings.Add("Revenue", "Revenue");
s.ColumnMappings.Add("Address", "Address");
s.ColumnMappings.Add("City", "City");
s.ColumnMappings.Add("State", "State");
s.ColumnMappings.Add("ZipCode", "ZipCode");
s.ColumnMappings.Add("CountryId", "CountryId");
s.ColumnMappings.Add("Phone", "Phone");
s.ColumnMappings.Add("Fax", "Fax");
s.ColumnMappings.Add("TimeZone", "TimeZone");
s.ColumnMappings.Add("SicNo", "SicNo");
s.ColumnMappings.Add("SicDesc", "SicDesc");
s.ColumnMappings.Add("SourceID", "SourceID");
s.ColumnMappings.Add("ResearchAnalysis", "ResearchAnalysis");
s.ColumnMappings.Add("BasketID", "BasketID");
s.ColumnMappings.Add("PipeLineStatusId", "PipeLineStatusId");
s.ColumnMappings.Add("SurveyId", "SurveyId");
s.ColumnMappings.Add("NextCallDate", "NextCallDate");
s.ColumnMappings.Add("CurrentRecStatus", "CurrentRecStatus");
s.ColumnMappings.Add("AssignedUserId", "AssignedUserId");
s.ColumnMappings.Add("AssignedDate", "AssignedDate");
s.ColumnMappings.Add("ToValueAmt", "ToValueAmt");
s.ColumnMappings.Add("Remove", "Remove");
s.ColumnMappings.Add("Release", "Release");
s.ColumnMappings.Add("Insert_Date", "Insert_Date");
s.ColumnMappings.Add("Insert_By", "Insert_By");
s.ColumnMappings.Add("Updated_Date", "Updated_Date");
s.ColumnMappings.Add("Updated_By", "Updated_By");
#endregion
#endregion
s.WriteToServer(sourceTable);
s.Close();
MySql.Close();
}
I've encountered the same problem while copying data from access to SQLSERVER 2005 and i found that the column mappings are case sensitive on both data sources regardless of the databases sensitivity.
Well, is it right? Do the column names exist on both sides?
To be honest, I've never bothered with mappings. I like to keep things simple - I tend to have a staging table that looks like the input on the server, then I SqlBulkCopy into the staging table, and finally run a stored procedure to move the table from the staging table into the actual table; advantages:
no issues with live data corruption if the import fails at any point
I can put a transaction just around the SPROC
I can have the bcp work without logging, safe in the knowledge that the SPROC will be logged
it is simple ;-p (no messing with mappings)
As a final thought - if you are dealing with bulk data, you can get better throughput using IDataReader (since this is a streaming API, where-as DataTable is a buffered API). For example, I tend to hook CSV imports up using CsvReader as the source for a SqlBulkCopy. Alternatively, I have written shims around XmlReader to present each first-level element as a row in an IDataReader - very fast.
The answer by Marc would be my recomendation (on using staging table). This ensures that if your source doesn't change, you'll have fewer issues importing in the future.
However, in my experience, you can check the following issues:
Column names match in source and table
That the column types match
If you think you did this and still no success. You can try the following.
1 - Allow nulls in all columns in your table
2 - comment out all column mappings
3 - rerun adding one column at a time until you find where your issue is
That should bring out the bug
One of the reason is that :SqlBukCOpy is case sensitive . Follow steps:
In that Case first you have to find your column in Source Table by
using "Contain" method in C#.
Once your Destination column matched with source column get index of
that column and give its column name in SqlBukCOpy .
For Example:`
//Get Column from Source table
string sourceTableQuery = "Select top 1 * from sourceTable";
DataTable dtSource=SQLHelper.SqlHelper.ExecuteDataset(transaction, CommandType.Text, sourceTableQuery).Tables[0];// i use sql helper for executing query you can use corde sw
for (int i = 0; i < destinationTable.Columns.Count; i++)
{ //check if destination Column Exists in Source table
if (dtSource.Columns.Contains(destinationTable.Columns[i].ToString()))//contain method is not case sensitive
{
int sourceColumnIndex = dtSource.Columns.IndexOf(destinationTable.Columns[i].ToString());//Once column matched get its index
bulkCopy.ColumnMappings.Add(dtSource.Columns[sourceColumnIndex].ToString(), dtSource.Columns[sourceColumnIndex].ToString());//give coluns name of source table rather then destination table so that it would avoid case sensitivity
}
}
bulkCopy.WriteToServer(destinationTable);
bulkCopy.Close();
I would go with the staging idea, however here is my approach to handling the case sensitive nature. Happy to be critiqued on my linq
using (SqlConnection connection = new SqlConnection(conn_str))
{
connection.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = string.Format("[{0}].[{1}].[{2}]", targetDatabase, targetSchema, targetTable);
var targetColumsAvailable = GetSchema(conn_str, targetTable).ToArray();
foreach (var column in dt.Columns)
{
if (targetColumsAvailable.Select(x => x.ToUpper()).Contains(column.ToString().ToUpper()))
{
var tc = targetColumsAvailable.Single(x => String.Equals(x, column.ToString(), StringComparison.CurrentCultureIgnoreCase));
bulkCopy.ColumnMappings.Add(column.ToString(), tc);
}
}
// Write from the source to the destination.
bulkCopy.WriteToServer(dt);
bulkCopy.Close();
}
}
and the helper method
private static IEnumerable<string> GetSchema(string connectionString, string tableName)
{
using (SqlConnection connection = new SqlConnection(connectionString))
using (SqlCommand command = connection.CreateCommand())
{
command.CommandText = "sp_Columns";
command.CommandType = CommandType.StoredProcedure;
command.Parameters.Add("#table_name", SqlDbType.NVarChar, 384).Value = tableName;
connection.Open();
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
yield return (string)reader["column_name"];
}
}
}
}
What I have found is that the columns in the table and the columns in the input must at least match. You can have more columns in the table and the input will still load. If you have less you'll receive the error.
Thought a long time about answering...
Even if column names are case equally, if the data type differs
you get the same error. So check column names and their data type.
P.S.: staging tables are definitive the way to import.

Categories