C# thread safe class and list - c#

The current need is to pull many records from an SQL database and then submit those records in much smaller blocks to an API call. The volume of data pull from SQL is inconsistent based on how much data is loaded by another process and will sometimes be small enough to handle with only one worker. When the data pull is large (max 5k rows) it will require more threads to help query the API to obtain speed. The process works great but when I run one at a time it is slow at large volumes. However, I'm finding the list I pass to the class is changing as multiple threads are launched. How can I achieve thread safety?
I've read about this and tried at length - locks and ConcurrentBag for example but I'm not quite sure where to apply or how to use these to achieve what I am looking for.
Here is what I have:
class Start
{
public void Execute()
{
SQLSupport sqlSupport = new SQLSupport();
List<SQLDataList> sqlDataList = new List<SQLDataList>();
List<SQLDataList> sqlAPIList = new List<SQLDataList>();
sqlDataList = sqlSupport.sqlQueryReturnList<sqlDataList>("SELECT * FROM TABLE");
if (sqlDataList.Count > 200)
{
int iRow = 0;
int iRowSQLCount = 0;
foreach (var item in sqlDataList)
{
if (iRowSQLCount == 100)
{
APIProcess apiProcess= new APIProcess (sqlAPIList);
Thread thr = new Thread(new ThreadStart(apiProcess.Execute));
thr.Start();
sqlAPIList.Clear();
iRowSQLCount = 0;
}
sqlAPIList.Add(sqlDataList[iRow]);
iRowSQLCount++;
iRow++;
}
}
}
}
class APIProcess
{
List<SQLDataList> sqlAPIList= new List<SQLDataList>();
public APIProcess(List<SQLDataList> sqlList)
{
sqlAPIList = sqlList;
}
public void Execute()
{
foreach (var item in sqlAPIList)
{
//loop through the list, interact with the API, update the list and ultimately update SQL with the API data.
}
}

Related

Should I use a ConcurrentQueue this way or individual threads

I'm doing what amounts to a glorified mail merge and then file conversion to PDF... Based on .Net 4.5 I see a couple ways I can do the threading. The one using a thread safe queue seems interesting (Plan A), but I can see a potential problem. What do you think? I'll try to keep it short, but put in what is needed.
This works on the assumption that it will take far more time to do the database processing than the PDF conversion.
In both cases, the database processing for each file is done in its own thread/task, but PDF conversion could be done in many single threads/tasks (Plan B) or it can be done in a single long running thread (Plan A). It is that PDF conversion I am wondering about. It is all in a try/catch statement, but that thread must not fail or all fails (Plan A). Do you think that is a good idea? Any suggestions would be appreciated.
/* A class to process a file: */
public class c_FileToConvert
{
public string InFileName { get; set; }
public int FileProcessingState { get; set; }
public string ErrorMessage { get; set; }
public List<string> listData = null;
c_FileToConvert(string inFileName)
{
InFileName = inFileName;
FileProcessingState = 0;
ErrorMessage = ""; // yah, yah, yah - String.Empty
listData = new List<string>();
}
public void doDbProcessing()
{
// get the data from database and put strings in this.listData
DAL.getDataForFile(this.InFileName, this.ErrorMessage); // static function
if(this.ErrorMessage != "")
this.FileProcessingState = -1; //fatal error
else // Open file and append strings to it
{
foreach(string s in this.listData}
...
FileProcessingState = 1; // enum DB_WORK_COMPLETE ...
}
}
public void doPDFProcessing()
{
PDFConverter cPDFConverter = new PDFConverter();
cPDFConverter.convertToPDF(InFileName, InFileName + ".PDF");
FileProcessingState = 2; // enum PDF_WORK_COMPLETE ...
}
}
/*** These only for Plan A ***/
public ConcurrentQueue<c_FileToConvert> ConncurrentQueueFiles = new ConcurrentQueue<c_FileToConvert>();
public bool bProcessPDFs;
public void doProcessing() // This is the main thread of the Windows Service
{
List<c_FileToConvert> listcFileToConvert = new List<c_FileToConvert>();
/*** Only for Plan A ***/
bProcessPDFs = true;
Task task1 = new Task(new Action(startProcessingPDFs)); // Start it and forget it
task1.Start();
while(1 == 1)
{
List<string> listFileNamesToProcess = new List<string>();
DAL.getFileNamesToProcessFromDb(listFileNamesToProcess);
foreach(string s in listFileNamesToProcess)
{
c_FileToConvert cFileToConvert = new c_FileToConvert(s);
listcFileToConvert.Add(cFileToConvert);
}
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 0)
Thread t = new Thread(new ParameterizedThreadStart(c.doDbProcessing));
/** This is Plan A - throw it on single long running PDF processing thread **/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
ConncurrentQueueFiles.Enqueue(c);
/*** This is Plan B - traditional thread for each file conversion ***/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
Thread t = new Thread(new ParameterizedThreadStart(c.doPDFProcessing));
int iCount = 0;
for(int iCount = 0; iCount < c_FileToConvert.Count; iCount++;)
{
if((c.FileProcessingState == -1) || (c.FileProcessingState == 2))
{
DAL.updateProcessingState(c.FileProcessingState)
listcFileToConvert.RemoveAt(iCount);
}
}
sleep(1000);
}
}
public void startProcessingPDFs() /*** Only for Plan A ***/
{
while (bProcessPDFs == true)
{
if (ConncurrentQueueFiles.IsEmpty == false)
{
try
{
c_FileToConvert cFileToConvert = null;
if (ConncurrentQueueFiles.TryDequeue(out cFileToConvert) == true)
cFileToConvert.doPDFProcessing();
}
catch(Exception e)
{
cFileToConvert.FileProcessingState = -1;
cFileToConvert.ErrorMessage = e.message;
}
}
}
}
Plan A seems like a nice solution, but what if the Task fails somehow? Yes, the PDF conversion can be done with individual threads, but I want to reserve them for the database processing.
This was written in a text editor as the simplest code I could, so there may be something, but I think I got the idea across.
How many files are you working with? 10? 100,000? If the number is very large, using 1 thread to run the DB queries for each file is not a good idea.
Threads are a very low-level control flow construct, and I advise you try to avoid a lot of messy and detailed thread spawning, joining, synchronizing, etc. etc. in your application code. Keep it stupidly simple if you can.
How about this: put the data you need for each file in a thread-safe queue. Create another thread-safe queue for results. Spawn some number of threads which repeatedly pull items from the input queue, run the queries, convert to PDF, then push the output into the output queue. The threads should share absolutely nothing but the input and output queues.
You can pick any number of worker threads which you like, or experiment to see what a good number is. Don't create 1 thread for each file -- just pick a number which allows for good CPU and disk utilization.
OR, if your language/libraries have a parallel map operator, use that. It will save you a lot of messing around.

Issue: Multithread bulkinsert

I tried to do a bulkinsert with many threads. After using linq to read 1000 rows of one dataTable (called 'dt'), it made a new dataTable and did the bulkinsert into the database.
There is the code to initialize the threads:
ManualResetEvent[] doneEvents = new ManualResetEvent[10];
BancoDAO[] fibArray = new BancoDAO[10];
for (int i = 0; i < 10; i++)
{
doneEvents[i] = new ManualResetEvent(false);
BancoDAO bd = new BancoDAO()
{
_doneEvent = doneEvents[i],
dataTable = (dt.AsEnumerable()
.Skip(i * 1000)
.Take(1000)
).CopyToDataTable<DataRow>()
};
fibArray[i] = bd;
ThreadPool.QueueUserWorkItem(bd.ThreadPoolCallback, i);
}
WaitHandle.WaitAll(doneEvents);
As you can see, the insert occurs on BancoDAO class. Here is the code:
public DataTable dataTable = new DataTable();
public ManualResetEvent _doneEvent;
public void ThreadPoolCallback(Object threadContext)
{
int threadIndex = (int)threadContext;
GravaTabelaThread(dataTable);
_doneEvent.Set();
}
public static void GravaTabelaThread(DataTable dt)
{
OracleConnection cteste = new OracleConnection(ConfigurationManager.ConnectionStrings["TesteUpload"].ToString());
cteste.Open();
OracleBulkCopy bcp = new OracleBulkCopy(cteste);
bcp.DestinationTableName = "MAG_T_SORTIMENTO2";
foreach (KeyValuePair<string, string> k in ColumnMappings())
{
bcp.ColumnMappings.Add(k.Key, k.Value);
}
try
{
bcp.WriteToServer(dt);
bcp.Dispose();
}
catch (Exception ex)
{
}
cteste.Close();
//Now I open and close the connection everytime after doing the bulkinsert in the database.
//I'm not using anymore the same connection.
}
The problem is: some threads inserts the values on database...some times this exception blows (I'll translate the oracle message from portuguese into english, so, please keep in mind this is a "free translation"):
{Oracle.DataAccess.Client.OracleException Error in row '1' column '1'
ORA-39776: API fatal error, wrong directoty way when loading the table USR_TRANSF.MAG_T_SORTIMENTO2
ORA-39781: Loads of direct path stream are not allowed after another context loading the same table was to be terminated in Oracle.DataAccess.Client.OracleBulkCopy.PerformBulkCopy()
in Oracle.DataAccess.Client.OracleBulkCopy.WriteDataSourceToServer()
in Oracle.DataAccess.Client.OracleBulkCopy.WriteToServer(DataTable table, DataRowState rowState)
in Oracle.DataAccess.Client.OracleBulkCopy.WriteToServer(DataTable table)
in UploadArquivo.BancoDAO.GravaTabelaThread(DataTable dt) at c:\Users\Rafael.pinho\Desktop\UploadArquivo\UploadArquivo\UploadArquivo\BancoDAO.cs:linha 49}
It appears from the ODP.Net Developer's Guide that OracleBulkCopy class does a direct-path load. If that's the case, then it is not really compatible with a multi-threaded application. Only one session can be doing a direct-path load on a particular object at any point in time. I suppose that you could serialize your threads so that only one thread has an open transaction at any point in time but it seems highly likely that this would defeat the purpose of multithreading on the client. On the other hand, since a direct path insert is the most efficient way to load data, you should basically be able to load data as quickly as you can pump it over the network (assuming, of course, your database can process data that quickly but I would assume that it could).

C# OutOfMemory, Mapped Memory File or Temp Database

Seeking some advice, best practice etc...
Technology: C# .NET4.0, Winforms, 32 bit
I am seeking some advice on how I can best tackle large data processing in my C# Winforms application which experiences high memory usage (working set) and the occasional OutOfMemory exception.
The problem is that we perform a large amount of data processing "in-memory" when a "shopping-basket" is opened. In simplistic terms when a "shopping-basket" is loaded we perform the following calculations;
For each item in the "shopping-basket" retrieve it's historical price going all the way back to the date the item first appeared in-stock (could be two months, two years or two decades of data). Historical price data is retrieved from text files, over the internet, any format which is supported by a price plugin.
For each item, for each day since it first appeared in-stock calculate various metrics which builds a historical profile for each item in the shopping-basket.
The result is that we can potentially perform hundreds, thousand and/or millions of calculations depending upon the number of items in the "shopping-basket". If the basket contains too many items we run the risk of hitting a "OutOfMemory" exception.
A couple of caveats;
This data needs to be calculated for each item in the "shopping-basket" and the data is kept until the "shopping-basket" is closed.
Even though we perform steps 1 and 2 in a background thread, speed is important as the number of items in the "shopping-basket" can greatly effect overall calculation speed.
Memory is salvaged by the .NET garbage collector when a "shopping-basket" is closed. We have profiled our application and ensure that all references are correctly disposed and closed when a basket is closed.
After all the calculations are completed the resultant data is stored in a IDictionary. "CalculatedData is a class object whose properties are individual metrics calculated by the above process.
Some ideas I've thought about;
Obviously my main concern is to reduce the amount of memory being used by the calculations however the volume of memory used can only be reduced if I
1) reduce the number of metrics being calculated for each day or
2) reduce the number of days used for the calculation.
Both of these options are not viable if we wish to fulfill our business requirements.
Memory Mapped Files
One idea has been to use memory mapped files which will store the data dictionary. Would this be possible/feasible and how can we put this into place?
Use a temporary database
The idea is to use a separate (not in-memory) database which can be created for the life-cycle of the application. As "shopping-baskets" are opened we can persist the calculated data to the database for repeated use, alleviating the requirement to recalculate for the same "shopping-basket".
Are there any other alternatives that we should consider? What is best practice when it comes to calculations on large data and performing them outside of RAM?
Any advice is appreciated....
The easiest solution is a database, perhaps SQLite. Memory mapped files don't automatically become dictionaries, you would have to code all the memory management yourself, and thereby fight with the .net GC system itself for ownership of he data.
If you're interested in trying the memory mapped file approach, you can try it now. I wrote a small native .NET package called MemMapCache that in essence creates a key/val database backed by MemMappedFiles. It's a bit of a hacky concept, but the program MemMapCache.exe keeps all references to the memory mapped files so that if your application crashes, you don't have to worry about losing the state of your cache.
It's very simple to use and you should be able to drop it in your code without too many modifications. Here is an example using it: https://github.com/jprichardson/MemMapCache/blob/master/TestMemMapCache/MemMapCacheTest.cs
Maybe it'd be of some use to you to at least further figure out what you need to do for an actual solution.
Please let me know if you do end up using it. I'd be interested in your results.
However, long-term, I'd recommend Redis.
As an update for those stumbling upon this thread...
We ended up using SQLite as our caching solution. The SQLite database we employ exists separate to the main data store used by the application. We persist calculated data to the SQLite (diskCache) as it's required and have code controlling cache invalidation etc. This was a suitable solution for us as we were able to achieve write speeds up and around 100,000 records per second.
For those interested, this is the code that controls inserts into the diskCache. Full credit for this code goes to JP Richardson (shown answering a question here) for his excellent blog post.
internal class SQLiteBulkInsert
{
#region Class Declarations
private SQLiteCommand m_cmd;
private SQLiteTransaction m_trans;
private readonly SQLiteConnection m_dbCon;
private readonly Dictionary<string, SQLiteParameter> m_parameters = new Dictionary<string, SQLiteParameter>();
private uint m_counter;
private readonly string m_beginInsertText;
#endregion
#region Constructor
public SQLiteBulkInsert(SQLiteConnection dbConnection, string tableName)
{
m_dbCon = dbConnection;
m_tableName = tableName;
var query = new StringBuilder(255);
query.Append("INSERT INTO ["); query.Append(tableName); query.Append("] (");
m_beginInsertText = query.ToString();
}
#endregion
#region Allow Bulk Insert
private bool m_allowBulkInsert = true;
public bool AllowBulkInsert { get { return m_allowBulkInsert; } set { m_allowBulkInsert = value; } }
#endregion
#region CommandText
public string CommandText
{
get
{
if(m_parameters.Count < 1) throw new SQLiteException("You must add at least one parameter.");
var sb = new StringBuilder(255);
sb.Append(m_beginInsertText);
foreach(var param in m_parameters.Keys)
{
sb.Append('[');
sb.Append(param);
sb.Append(']');
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(") VALUES (");
foreach(var param in m_parameters.Keys)
{
sb.Append(m_paramDelim);
sb.Append(param);
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(")");
return sb.ToString();
}
}
#endregion
#region Commit Max
private uint m_commitMax = 25000;
public uint CommitMax { get { return m_commitMax; } set { m_commitMax = value; } }
#endregion
#region Table Name
private readonly string m_tableName;
public string TableName { get { return m_tableName; } }
#endregion
#region Parameter Delimiter
private const string m_paramDelim = ":";
public string ParamDelimiter { get { return m_paramDelim; } }
#endregion
#region AddParameter
public void AddParameter(string name, DbType dbType)
{
var param = new SQLiteParameter(m_paramDelim + name, dbType);
m_parameters.Add(name, param);
}
#endregion
#region Flush
public void Flush()
{
try
{
if (m_trans != null) m_trans.Commit();
}
catch (Exception ex)
{
throw new Exception("Could not commit transaction. See InnerException for more details", ex);
}
finally
{
if (m_trans != null) m_trans.Dispose();
m_trans = null;
m_counter = 0;
}
}
#endregion
#region Insert
public void Insert(object[] paramValues)
{
if (paramValues.Length != m_parameters.Count)
throw new Exception("The values array count must be equal to the count of the number of parameters.");
m_counter++;
if (m_counter == 1)
{
if (m_allowBulkInsert) m_trans = m_dbCon.BeginTransaction();
m_cmd = m_dbCon.CreateCommand();
foreach (var par in m_parameters.Values)
m_cmd.Parameters.Add(par);
m_cmd.CommandText = CommandText;
}
var i = 0;
foreach (var par in m_parameters.Values)
{
par.Value = paramValues[i];
i++;
}
m_cmd.ExecuteNonQuery();
if(m_counter != m_commitMax)
{
// Do nothing
}
else
{
try
{
if(m_trans != null) m_trans.Commit();
}
catch(Exception)
{ }
finally
{
if(m_trans != null)
{
m_trans.Dispose();
m_trans = null;
}
m_counter = 0;
}
}
}
#endregion
}

Need to know if my threading lock does what it is supposed to in .Net?

I have an application that, before is creates a thread it calls the database to pull X amount of records. When the records are retrieved from the database a locked flag is set so those records are not pulled again.
Once a thread has completed it will pull some more records form that database. When I call the database from a thread should I set a lock on that section of code so it is called only by that thread at that time? Here is an exmaple of my code (I commented in the area where I have the lock):
private void CreateThreads()
{
for(var i = 1; i <= _threadCount; i++)
{
var adapter = new Dystopia.DataAdapter();
var records = adapter.FindAllWithLocking(_recordsPerThread,_validationId,_validationDateTime);
if(records != null && records.Count > 0)
{
var paramss = new ArrayList { i, records };
ThreadPool.QueueUserWorkItem(ThreadWorker, paramss);
}
this.Update();
}
}
private void ThreadWorker(object paramList)
{
try
{
var parms = (ArrayList) paramList;
var stopThread = false;
var threadCount = (int) parms[0];
var records = (List<Candidates>) parms[1];
var runOnce = false;
var adapter = new Dystopia.DataAdapter();
var lastCount = records.Count;
var runningCount = 0;
while (_stopThreads == false)
{
if (records.Count > 0)
{
foreach (var record in records)
{
var proc = new ProcRecords();
proc.Validate(ref rec);
adapter.Update(rec);
if (_stopThreads)
{
break;
}
}
//This is where I think I may need to sync the threads.
//Is this correct?
lock(this){
records = adapter.FindAllWithLocking;
}
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
SQL to Pull records:
WITH cte AS (
SELECT TOP (#topCount) *
FROM Candidates WITH (READPAST)
WHERE
isLocked = 0 and
isTested = 0 and
validated = 0
)
UPDATE cte
SET
isLocked = 1,
validationID = #validationId,
validationDateTime = #validationDateTime
OUTPUT INSERTED.*;
You shouldn't need to lock your threads as the database should be doing this on the request for you.
I see a few issues.
First, you are testing _stopThreads == false, but you have not revealed whether this a volatile read. Read the second of half this answer for a good description of what I am talking about.
Second, the lock is pointless because adapter is a local reference to a non-shared object and records is a local reference which just being replaced. I am assuming that the adapter makes a separate connection to the database, but if it shares an existing connection then some type of synchronization may need to take place since ADO.NET connection objects are not typically thread-safe.
Now, you probably will need locking somewhere to publish the results from the work item. I do not see where the results are being published to the main thread so I cannot offer any guidance here.
By the way, I would avoid showing a message box from a ThreadPool thread. The reason being that this will hang that thread until the message box closes.
You shouldn't lock(this) since its really easy for you to create deadlocks you should create a separate lock object. if you search for "lock(this)" you can find numerous articles on why.
Here's an SO question on lock(this)

Database operation via Threadpool or Asyn programming

i have database with larger number of records.(database values are updated via a webservice)
Each record/row is of type (id,xmlfilename,operation,parameters,operationId,status) ie.
we have perform operation as specified by'operation' on a xmlfile
specified by "xmlfilename" with parameters for operation specified by "parameters"..
status part is "intially" free as is updated as when reqd.
I have to do these (each row) operation using Threads..ie one thread per 'id'
number of entries.as long as there are 'id' rows in db fetch and perform "operation"
how can i do this efficiently with maximum parallelism and/or concurrency.
Is there a better way?
What is best suited Threadpool or custom threads or asyn programming?
Edit Added:Here is pseudo code i tried
string operationId=null;
while(operationId = DBWorker.getNextFreeOperationId ())
//how do i check this?another query??untill there are operations with status "Free"
//,keep selecting operationids
//note db is updating asynchronously.
{
//retrieve all rows with operationid=operationId eg:800 . 1 thread/qid???? and
// status="free" ...
//there are multiple operations with same operationIds
DataRowCollection dbRows=DBWorker.retrieveQueuedEntries(operationId);
MyWorkItem workItem = new DBWorker.MyWorkItem();
workItem.DataRows = dbRows;
workItem.Event = new AutoResetEvent(false);
//MyWorkItem.DoWork will do the necessary "Operation"
ThreadPool.QueueUserWorkItem(new WaitCallback(workItem.DoWork),workItem);
}
--------------
MyWorkItem.DoWork(obj x){
//for brevity
for each DataRow row in this.DataRows
{
performOperation(row);//use row["operation"] ..
}
---------------
bool performOperation(DataRow row)
{
string operation = (string)row["operation"];//retrieve other similarly
switch(operation)
{
case Operations.Add: //call Add operation ..
...
}
}
------------
above code is using.net2.0.. doesnt achieve multithreading ..
scenario may be the database is updating from one end asynchronously,while above code will be running as windows service at the same time.
thx
Amit
If you're not using .net 4.0 - Use ThreadPool
I would create a function that would receive the row (event better if you convert the rows to strongly type objects in DAL) it has to process, and do what it has to do with it - something like
using System;
using System.Threading;
namespace SmallConsoleAppForTests
{
class Program
{
private static AutoResetEvent[] events;
static void Main(string[] args)
{
int dataLength = 3;
// creating array of AutoResetEvent for signalling that the processing is done
AutoResetEvent[] events = new AutoResetEvent[dataLength];
// Initializing the AutoResetEvent array to "not-set" values;
for (int i = 0; i < dataLength; i++)
events[i] = new AutoResetEvent(false);
//Processing the data
for (int i = 0; i < dataLength; i++)
{
var data = new MyWorkItem { Event = events[i], Data = new MyDataClass() };
ThreadPool.QueueUserWorkItem(x =>
{
var workItem = (MyWorkItem)x;
try
{
// process the data
}
catch (Exception e)
{
//exception handling
}
finally
{
workItem.Event.Set();
}
}, data);
}
//Wait untill all the threads finish
WaitHandle.WaitAll(events);
}
}
public class MyWorkItem
{
public AutoResetEvent Event { get; set; }
public MyDataClass Data { get; set; }
}
// You can also use DataRow instead
public class MyDataClass
{
//data
//
}
}
If you are using .net 4.0 - look into Tasks and Parallel Extensions (PLINQ)

Categories