Issue: Multithread bulkinsert

Issue: Multithread bulkinsert - c#

I tried to do a bulkinsert with many threads. After using linq to read 1000 rows of one dataTable (called 'dt'), it made a new dataTable and did the bulkinsert into the database.
There is the code to initialize the threads:
ManualResetEvent[] doneEvents = new ManualResetEvent[10];
BancoDAO[] fibArray = new BancoDAO[10];
for (int i = 0; i < 10; i++)
{
doneEvents[i] = new ManualResetEvent(false);
BancoDAO bd = new BancoDAO()
{
_doneEvent = doneEvents[i],
dataTable = (dt.AsEnumerable()
.Skip(i * 1000)
.Take(1000)
).CopyToDataTable<DataRow>()
};
fibArray[i] = bd;
ThreadPool.QueueUserWorkItem(bd.ThreadPoolCallback, i);
}
WaitHandle.WaitAll(doneEvents);
As you can see, the insert occurs on BancoDAO class. Here is the code:
public DataTable dataTable = new DataTable();
public ManualResetEvent _doneEvent;
public void ThreadPoolCallback(Object threadContext)
{
int threadIndex = (int)threadContext;
GravaTabelaThread(dataTable);
_doneEvent.Set();
}
public static void GravaTabelaThread(DataTable dt)
{
OracleConnection cteste = new OracleConnection(ConfigurationManager.ConnectionStrings["TesteUpload"].ToString());
cteste.Open();
OracleBulkCopy bcp = new OracleBulkCopy(cteste);
bcp.DestinationTableName = "MAG_T_SORTIMENTO2";
foreach (KeyValuePair<string, string> k in ColumnMappings())
{
bcp.ColumnMappings.Add(k.Key, k.Value);
}
try
{
bcp.WriteToServer(dt);
bcp.Dispose();
}
catch (Exception ex)
{
}
cteste.Close();
//Now I open and close the connection everytime after doing the bulkinsert in the database.
//I'm not using anymore the same connection.
}
The problem is: some threads inserts the values on database...some times this exception blows (I'll translate the oracle message from portuguese into english, so, please keep in mind this is a "free translation"):
{Oracle.DataAccess.Client.OracleException Error in row '1' column '1'
ORA-39776: API fatal error, wrong directoty way when loading the table USR_TRANSF.MAG_T_SORTIMENTO2
ORA-39781: Loads of direct path stream are not allowed after another context loading the same table was to be terminated in Oracle.DataAccess.Client.OracleBulkCopy.PerformBulkCopy()
in Oracle.DataAccess.Client.OracleBulkCopy.WriteDataSourceToServer()
in Oracle.DataAccess.Client.OracleBulkCopy.WriteToServer(DataTable table, DataRowState rowState)
in Oracle.DataAccess.Client.OracleBulkCopy.WriteToServer(DataTable table)
in UploadArquivo.BancoDAO.GravaTabelaThread(DataTable dt) at c:\Users\Rafael.pinho\Desktop\UploadArquivo\UploadArquivo\UploadArquivo\BancoDAO.cs:linha 49}

It appears from the ODP.Net Developer's Guide that OracleBulkCopy class does a direct-path load. If that's the case, then it is not really compatible with a multi-threaded application. Only one session can be doing a direct-path load on a particular object at any point in time. I suppose that you could serialize your threads so that only one thread has an open transaction at any point in time but it seems highly likely that this would defeat the purpose of multithreading on the client. On the other hand, since a direct path insert is the most efficient way to load data, you should basically be able to load data as quickly as you can pump it over the network (assuming, of course, your database can process data that quickly but I would assume that it could).

Related

How to get data from sql database using dapper in parallel/multiple threads?

I am trying to get data from sql server using dapper. I have requirement to export 460K records stored in a Azure sql database. I decided to get data in batches, so I getting record of 10k records in each batch. I have planned to get the records in Parallel, so I added async methods to a list of task and did Task.WhenAll. The code works fine when i run locally but after deployed to k8s cluster, I am getting data read issue for some records. I am new to multi threading and I don't how to handle this issue. I tried to do a lock inside the method but the system crashes, Below is my code, the code might be clumsy because I was trying many solution to fix the issue.
for (int i = 0; i < numberOfPages; i++)
{
tableviewWithCondition.startRow = startRow;
resultData.Add(_tableviewRepository.GetTableviewRowsByPagination(tableviewExportCondition.TableviewName, modelMappingGroups, tableviewWithCondition.startRow, builder, pageSize, appName, i));
startRow += tableviewWithCondition.pageSize;
}
foreach(var task in resultData)
{
if (task != null)
{
dataToExport.AddRange(task.Result);
}
}
This is the method I implemented to get data from azure sql database using dapper.
public async Task<(IEnumerable<int> unprocessedData, IEnumerable<dynamic> rowData)> GetTableviewRowsByPagination(string tableName, IEnumerable<MappingGroup> tableviewAttributeDetails,
int startRow, SqlBuilder builder, int pageSize = 100, AppNameEnum appName = AppNameEnum.OptiSoil, int taskNumber = 1)
{
var _unitOfWork = _unitOfWorkServices.Build(appName.ToString());
List<int> unprocessedData = new List<int>();
try
{
var columns = tableviewAttributeDetails.Select(c => { return $"{c.mapping_group_value} [{c.attribute}]"; });
var joinedColumn = string.Join(",", columns);
builder.Select(joinedColumn);
var selector = builder.AddTemplate($"SELECT /**select**/ FROM {tableName} with (nolock) /**innerjoin**/ /**where**/ /**orderby**/ OFFSET {startRow} ROWS FETCH NEXT {(pageSize == 0 ? 100 : pageSize)} ROWS ONLY");
using (var connection = _unitOfWork.Connection)
{
connection.Open();
var data = await connection.QueryAsync(selector.RawSql, selector.Parameters);
Console.WriteLine($"data completed for task{taskNumber}");
return (unprocessedData, data);
}
}
catch(Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");
if (ex.InnerException != null)
Console.WriteLine($"InnerException: {ex.InnerException.Message}");
Console.WriteLine($"Error in fetching from row {startRow}");
unprocessedData.Add(startRow);
return (unprocessedData, null);
}
finally
{
_unitOfWork.Dispose();
}
}
The above code works fine locally, but in server I am getting below issue.
Exception: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 35 - An internal exception was caught).
InnerException: The WriteAsync method cannot be called when another write operation is pending.
How to avoid this issue when fetch data in parallel tasks?

You're using the same connection and trying to execute multiple commands over it (I'm assuming this because of the naming), also should you be disposing the unit of work?
Rather than :
using (var connection = _unitOfWork.Connection)
{
connection.Open();
var data = await connection.QueryAsync(selector.RawSql, selector.Parameters);
Console.WriteLine($"data completed for task{taskNumber}");
return (unprocessedData, data);
}
Create a new connection for each item, if this is what you truly want to do. I imagine, and this is an educated guess it's working locally because of timing.
Also look into Task.WhenAll it's a better way collect all the results up. Rather than :
foreach(var task in resultData)
{
if (task != null)
{
dataToExport.AddRange(task.Result);
}
}
calling result on a task is usually bad practice.

C# thread safe class and list

The current need is to pull many records from an SQL database and then submit those records in much smaller blocks to an API call. The volume of data pull from SQL is inconsistent based on how much data is loaded by another process and will sometimes be small enough to handle with only one worker. When the data pull is large (max 5k rows) it will require more threads to help query the API to obtain speed. The process works great but when I run one at a time it is slow at large volumes. However, I'm finding the list I pass to the class is changing as multiple threads are launched. How can I achieve thread safety?
I've read about this and tried at length - locks and ConcurrentBag for example but I'm not quite sure where to apply or how to use these to achieve what I am looking for.
Here is what I have:
class Start
{
public void Execute()
{
SQLSupport sqlSupport = new SQLSupport();
List<SQLDataList> sqlDataList = new List<SQLDataList>();
List<SQLDataList> sqlAPIList = new List<SQLDataList>();
sqlDataList = sqlSupport.sqlQueryReturnList<sqlDataList>("SELECT * FROM TABLE");
if (sqlDataList.Count > 200)
{
int iRow = 0;
int iRowSQLCount = 0;
foreach (var item in sqlDataList)
{
if (iRowSQLCount == 100)
{
APIProcess apiProcess= new APIProcess (sqlAPIList);
Thread thr = new Thread(new ThreadStart(apiProcess.Execute));
thr.Start();
sqlAPIList.Clear();
iRowSQLCount = 0;
}
sqlAPIList.Add(sqlDataList[iRow]);
iRowSQLCount++;
iRow++;
}
}
}
}
class APIProcess
{
List<SQLDataList> sqlAPIList= new List<SQLDataList>();
public APIProcess(List<SQLDataList> sqlList)
{
sqlAPIList = sqlList;
}
public void Execute()
{
foreach (var item in sqlAPIList)
{
//loop through the list, interact with the API, update the list and ultimately update SQL with the API data.
}
}

Very slow foreach loop

I am working on an existing application. This application reads data from a huge file and then, after doing some calculations, it stores the data in another table.
But the loop doing this (see below) is taking a really long time. Since the file sometimes contains 1,000s of records, the entire process takes days.
Can I replace this foreach loop with something else? I tried using Parallel.ForEach and it did help. I am new to this, so will appreciate your help.
foreach (record someredord Somereport.r)
{
try
{
using (var command = new SqlCommand("[procname]", sqlConn))
{
command.CommandTimeout = 0;
command.CommandType = CommandType.StoredProcedure;
command.Parameters.Add(…);
IAsyncResult result = command.BeginExecuteReader();
while (!result.IsCompleted)
{
System.Threading.Thread.Sleep(10);
}
command.EndExecuteReader(result);
}
}
catch (Exception e)
{
…
}
}
After reviewing the answers , I removed the Async and used edited the code as below. But this did not improve performance.
using (command = new SqlCommand("[sp]", sqlConn))
{
command.CommandTimeout = 0;
command.CommandType = CommandType.StoredProcedure;
foreach (record someRecord in someReport.)
{
command.Parameters.Clear();
command.Parameters.Add(....)
command.Prepare();
using (dr = command.ExecuteReader())
{
while (dr.Read())
{
if ()
{
}
else if ()
{
}
}
}
}
}

Instead of looping the sql connection so many times, ever consider extracting the whole set of data out from sql server and process the data via the dataset?
Edit: Decided to further explain what i meant..
You can do the following, pseudo code as follow
Use a select * and get all information from the database and store them into a list of the class or dictionary.
Do your foreach(record someRecord in someReport) and do the condition matching as usual.

Step 1: Ditch the try at async. It isn't implemented properly and you're blocking anyway. So just execute the procedure and see if that helps.
Step 2: Move the SqlCommand outside of the loop and reuse it for each iteration. that way you don't incurr the cost of creating and destroying it for every item in your loop.
Warning: Make sure you reset/clear/remove parameters you don't need from the previous iteration. We did something like this with optional parameters and had 'bleed-thru' from the previous iteration because we didn't clean up parameters we didn't need!

Your biggest problem is that you're looping over this:
IAsyncResult result = command.BeginExecuteReader();
while (!result.IsCompleted)
{
System.Threading.Thread.Sleep(10);
}
command.EndExecuteReader(result);
The entire idea of the asynchronous model is that the calling thread (the one doing this loop) should be spinning up ALL of the asynchronous tasks using the Begin method before starting to work with the results with the End method. If you are using Thread.Sleep() within your main calling thread to wait for an asynchronous operation to complete (as you are here), you're doing it wrong, and what ends up happening is that each command, one at a time, is being called and then waited for before the next one starts.
Instead, try something like this:
public void BeginExecutingCommands(Report someReport)
{
foreach (record someRecord in someReport.r)
{
var command = new SqlCommand("[procname]", sqlConn);
command.CommandTimeout = 0;
command.CommandType = CommandType.StoredProcedure;
command.Parameters.Add(…);
command.BeginExecuteReader(ReaderExecuted,
new object[] { command, someReport, someRecord });
}
}
void ReaderExecuted(IAsyncResult result)
{
var state = (object[])result.AsyncState;
var command = state[0] as SqlCommand;
var someReport = state[1] as Report;
var someRecord = state[2] as Record;
try
{
using (SqlDataReader reader = command.EndExecuteReader(result))
{
// work with reader, command, someReport and someRecord to do what you need.
}
}
catch (Exception ex)
{
// handle exceptions that occurred during the async operation here
}
}

In SQL on the other end of a write is a (one) disk. You rarely can write faster in parallel. In fact in parallel often slows it down due to index fragmentation. If you can sort the data by primary (clustered) key prior to loading. In a big load even disable other keys, load data rebuild keys.
Not really sure what are doing in the asynch but for sure it was not doing what you expected as it was waiting on itself.
try
{
using (var command = new SqlCommand("[procname]", sqlConn))
{
command.CommandTimeout = 0;
command.CommandType = CommandType.StoredProcedure;
foreach (record someredord Somereport.r)
{
command.Parameters.Clear()
command.Parameters.Add(…);
using (var rdr = command.ExecuteReader())
{
while (rdr.Read())
{
…
}
}
}
}
}
catch (…)
{
…
}

As we were talking about in the comments, storing this data in memory and working with it there may be a more efficient approach.
So one easy way to do that is to start with Entity Framework. Entity Framework will automatically generate the classes for you based on your database schema. Then you can import a stored procedure which holds your SELECT statement. The reason I suggest importing a stored proc into EF is that this approach is generally more efficient than doing your queries in LINQ against EF.
Then run the stored proc and store the data in a List like this...
var data = db.MyStoredProc().ToList();
Then you can do anything you want with that data. Or as I mentioned, if you're doing a lot of lookups on primary keys then use ToDictionary() something like this...
var data = db.MyStoredProc().ToDictionary(k => k.MyPrimaryKey);
Either way, you'll be working with your data in memory at this point.

It seems executing your SQL command puts lock on some required resources and that's the reason enforced you to use Async methods (my guess).
If the database in not in use, try an exclusive access to it. Even then in there are some internal transactions due to data-model complexity consider consulting to database designer.

Selecting million records from SQL Server

We need to index (in ASP.NET) all our records stored in a SQL Server table. That table has around 2M records with text (nvarchar) data too in each row.
Is it okay to fetch all records in one go as we need to index them (for search)? What is the other option (I want to avoid pagination)?
Note: I am not displaying these records, just need all of them in one go so that I can index them via a background thread.
Do I need to set any long time outs for my query? If yes, what is the most effective method for setting longer time outs if I am running the query from ASP.NET page?

If I needed something like this, just thinking about it from the database side, I'd probably export it to a file. Then that file can get moved around pretty easily. Moving around data sets that large is a huge pain to all involved. You can use SSIS, sqlcmd or even bcp in a batch command to get it done.
Then, you just have to worry about what you're doing with it on the app side, no worries about locking & everything on the database side once you've exported it.

I don't think a page is a good place for this regardless. There should be a different process or program that does this. On a related note maybe something like http://incubator.apache.org/lucene.net/ would help you?

Is it okay to fetch all records in one go as we need to index them
(for search)? What is the other option (I want to avoid pagination)?
Memory Management Issue / Performance Issue
You can face System Out Of Memory Exception in case you are bringing 2 millions of records
As you will be keeping all those records in DataSet and the dataset memory will be in RAM.
Do I need to set any long time outs for my query? If yes, what is the
most effective method for setting longer time outs if I am running the
query from ASP.NET page?
using (System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand())
{
cmd.CommandTimeout = 0;
}
Suggestion
It's better to filter out the record from database level...
Fetch all records from database and save it in a file. Access that file for any intermediate operations.

What you describe in Extract Transform Load (ETL). there are 2 options I'm aware of:
SSIS which is part of sql server
Rhino.ETL
I prefer Rhino.Etl as it's comletely written in C#, you can create scripts in Boo and it's much easier to test and compose ETL Processes. And the library is built to handle large sets of data, so memory management is built in.
One final note: while asp.net might be the entry point to start the indexing process, I wouldn't run the process within asp.net as it could take minutes or hours depending on the amount of records and processing.
instead have asp.net be the entry point to fires off a background task to process the records. Ideally, completely independent of asp.net so you avoid any timeout or shutdown issues.

Process your records in batches. You are going to have two main issues. (1) You need to index all of the existing records. (2) you will want to update the index with records that were added, updated or deleted. It might sound eaiser just to drop the index and recreate it, but it should be avoided if possible. Below is an example of processing the [Production].[TransactionHistory] from the AdventureWorks2008R2 database in batches of 10,000 records. It does not load all of the records into memory. Output on my local computer produces Processed 113443 records in 00:00:00.2282294. Obviously, this doesn't take into consideration remote computer and processing time for each record.
class Program
{
private static string ConnectionString
{
get { return ConfigurationManager.ConnectionStrings["db"].ConnectionString; }
}
static void Main(string[] args)
{
int recordCount = 0;
int lastId = -1;
bool done = false;
Stopwatch timer = Stopwatch.StartNew();
do
{
done = true;
IEnumerable<TransactionHistory> transactionDataRecords = GetTransactions(lastId, 10000);
foreach (TransactionHistory transactionHistory in transactionDataRecords)
{
lastId = transactionHistory.TransactionId;
done = false;
recordCount++;
}
} while (!done);
timer.Stop();
Console.WriteLine("Processed {0} records in {1}", recordCount, timer.Elapsed);
}
/// Get a new open connection
private static SqlConnection GetOpenConnection()
{
SqlConnection connection = new SqlConnection(ConnectionString);
connection.Open();
return connection;
}
private static IEnumerable<TransactionHistory> GetTransactions(int lastTransactionId, int count)
{
const string sql = "SELECT TOP(#count) [TransactionID],[TransactionDate],[TransactionType] FROM [Production].[TransactionHistory] WHERE [TransactionID] > #LastTransactionId ORDER BY [TransactionID]";
return GetData<TransactionHistory>((connection) =>
{
SqlCommand command = new SqlCommand(sql, connection);
command.Parameters.AddWithValue("#count", count);
command.Parameters.AddWithValue("#LastTransactionId", lastTransactionId);
return command;
}, DataRecordToTransactionHistory);
}
// funtion to convert a data record to the TransactionHistory object
private static TransactionHistory DataRecordToTransactionHistory(IDataRecord record)
{
TransactionHistory transactionHistory = new TransactionHistory();
transactionHistory.TransactionId = record.GetInt32(0);
transactionHistory.TransactionDate = record.GetDateTime(1);
transactionHistory.TransactionType = record.GetString(2);
return transactionHistory;
}
private static IEnumerable<T> GetData<T>(Func<SqlConnection, SqlCommand> commandBuilder, Func<IDataRecord, T> dataFunc)
{
using (SqlConnection connection = GetOpenConnection())
{
using (SqlCommand command = commandBuilder(connection))
{
using (IDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
T record = dataFunc(reader);
yield return record;
}
}
}
}
}
}
public class TransactionHistory
{
public int TransactionId { get; set; }
public DateTime TransactionDate { get; set; }
public string TransactionType { get; set; }
}

Need to know if my threading lock does what it is supposed to in .Net?

I have an application that, before is creates a thread it calls the database to pull X amount of records. When the records are retrieved from the database a locked flag is set so those records are not pulled again.
Once a thread has completed it will pull some more records form that database. When I call the database from a thread should I set a lock on that section of code so it is called only by that thread at that time? Here is an exmaple of my code (I commented in the area where I have the lock):
private void CreateThreads()
{
for(var i = 1; i <= _threadCount; i++)
{
var adapter = new Dystopia.DataAdapter();
var records = adapter.FindAllWithLocking(_recordsPerThread,_validationId,_validationDateTime);
if(records != null && records.Count > 0)
{
var paramss = new ArrayList { i, records };
ThreadPool.QueueUserWorkItem(ThreadWorker, paramss);
}
this.Update();
}
}
private void ThreadWorker(object paramList)
{
try
{
var parms = (ArrayList) paramList;
var stopThread = false;
var threadCount = (int) parms[0];
var records = (List<Candidates>) parms[1];
var runOnce = false;
var adapter = new Dystopia.DataAdapter();
var lastCount = records.Count;
var runningCount = 0;
while (_stopThreads == false)
{
if (records.Count > 0)
{
foreach (var record in records)
{
var proc = new ProcRecords();
proc.Validate(ref rec);
adapter.Update(rec);
if (_stopThreads)
{
break;
}
}
//This is where I think I may need to sync the threads.
//Is this correct?
lock(this){
records = adapter.FindAllWithLocking;
}
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
SQL to Pull records:
WITH cte AS (
SELECT TOP (#topCount) *
FROM Candidates WITH (READPAST)
WHERE
isLocked = 0 and
isTested = 0 and
validated = 0
)
UPDATE cte
SET
isLocked = 1,
validationID = #validationId,
validationDateTime = #validationDateTime
OUTPUT INSERTED.*;

You shouldn't need to lock your threads as the database should be doing this on the request for you.

I see a few issues.
First, you are testing _stopThreads == false, but you have not revealed whether this a volatile read. Read the second of half this answer for a good description of what I am talking about.
Second, the lock is pointless because adapter is a local reference to a non-shared object and records is a local reference which just being replaced. I am assuming that the adapter makes a separate connection to the database, but if it shares an existing connection then some type of synchronization may need to take place since ADO.NET connection objects are not typically thread-safe.
Now, you probably will need locking somewhere to publish the results from the work item. I do not see where the results are being published to the main thread so I cannot offer any guidance here.
By the way, I would avoid showing a message box from a ThreadPool thread. The reason being that this will hang that thread until the message box closes.

You shouldn't lock(this) since its really easy for you to create deadlocks you should create a separate lock object. if you search for "lock(this)" you can find numerous articles on why.
Here's an SO question on lock(this)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Issue: Multithread bulkinsert - c#

Related

How to get data from sql database using dapper in parallel/multiple threads?

C# thread safe class and list

Very slow foreach loop

Selecting million records from SQL Server

Need to know if my threading lock does what it is supposed to in .Net?

Categories

Resources