I have inherited a WCF web service application that requires to have much better error tracking. What we do is query data from one system (AcuODBC), and send that data to another system (Salesforce). This query will return 10's of thousands of complex objects as a List<T>. We then process this List<T> in batches of 200 records at a time to map the fields to another object type, then send that batch to Salesforce. After this is completed, the next batch starts. Here's a brief example:
int intStart = 0, intEnd = 200;
//done in a loop, snipped for brevity
var leases = from i in trleases.GetAllLeases(branch).Skip(intStart).Take(intEnd)
select new sforceObject.SFDC_Lease() {
LeaseNumber = i.LeaseNumber.ToString(),
AccountNumber = i.LeaseCustomer,
Branch = i.Branch
(...)//about 150 properties
//do stuff with list and increment to next batch
intStart += 200;
However, the problem is if one object has a bad field mapping (Invalid Cast Exception), I would like to print out the object that failed to a log.
Question
Is there any way I can decipher which object of the 200 threw the exception? I could forgo the batch concept that was given to me, but I'd rather avoid that if possible for performance reasons.
This should accomplish what you are looking for with very minor code changes:
int intStart = 0, intEnd = 200, count = 0;
List<SDFC_Lease> leases = new List<SDFC_Lease>();
//done in a loop, snipped for brevity
foreach(var i in trleases.GetAllLeases(branch).Skip(intStart).Take(intEnd)) {
try {
count++;
leases.Add(new sforceObject.SFDC_Lease() {
LeaseNumber = i.LeaseNumber.ToString(),
AccountNumber = i.LeaseCustomer,
Branch = i.Branch
(...)//about 150 properties);
} catch (Exception ex) {
// you now have you culprit either as 'i' or from the index 'count'
}
}
//do stuff with 'leases' and increment to next batch
intStart += 200;
I think that you could use a flag in each set method of the properties of the class SFDC_Lease, and use a static property for this like:
public class SFDC_Lease
{
public static string LastPropertySetted;
public string LeaseNumber
{
get;
set
{
LastPropertySetted = "LeaseNumber";
LeaseNumber = value;
}
}
}
Plz, feel free to improve this design.
Related
I'm doing what amounts to a glorified mail merge and then file conversion to PDF... Based on .Net 4.5 I see a couple ways I can do the threading. The one using a thread safe queue seems interesting (Plan A), but I can see a potential problem. What do you think? I'll try to keep it short, but put in what is needed.
This works on the assumption that it will take far more time to do the database processing than the PDF conversion.
In both cases, the database processing for each file is done in its own thread/task, but PDF conversion could be done in many single threads/tasks (Plan B) or it can be done in a single long running thread (Plan A). It is that PDF conversion I am wondering about. It is all in a try/catch statement, but that thread must not fail or all fails (Plan A). Do you think that is a good idea? Any suggestions would be appreciated.
/* A class to process a file: */
public class c_FileToConvert
{
public string InFileName { get; set; }
public int FileProcessingState { get; set; }
public string ErrorMessage { get; set; }
public List<string> listData = null;
c_FileToConvert(string inFileName)
{
InFileName = inFileName;
FileProcessingState = 0;
ErrorMessage = ""; // yah, yah, yah - String.Empty
listData = new List<string>();
}
public void doDbProcessing()
{
// get the data from database and put strings in this.listData
DAL.getDataForFile(this.InFileName, this.ErrorMessage); // static function
if(this.ErrorMessage != "")
this.FileProcessingState = -1; //fatal error
else // Open file and append strings to it
{
foreach(string s in this.listData}
...
FileProcessingState = 1; // enum DB_WORK_COMPLETE ...
}
}
public void doPDFProcessing()
{
PDFConverter cPDFConverter = new PDFConverter();
cPDFConverter.convertToPDF(InFileName, InFileName + ".PDF");
FileProcessingState = 2; // enum PDF_WORK_COMPLETE ...
}
}
/*** These only for Plan A ***/
public ConcurrentQueue<c_FileToConvert> ConncurrentQueueFiles = new ConcurrentQueue<c_FileToConvert>();
public bool bProcessPDFs;
public void doProcessing() // This is the main thread of the Windows Service
{
List<c_FileToConvert> listcFileToConvert = new List<c_FileToConvert>();
/*** Only for Plan A ***/
bProcessPDFs = true;
Task task1 = new Task(new Action(startProcessingPDFs)); // Start it and forget it
task1.Start();
while(1 == 1)
{
List<string> listFileNamesToProcess = new List<string>();
DAL.getFileNamesToProcessFromDb(listFileNamesToProcess);
foreach(string s in listFileNamesToProcess)
{
c_FileToConvert cFileToConvert = new c_FileToConvert(s);
listcFileToConvert.Add(cFileToConvert);
}
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 0)
Thread t = new Thread(new ParameterizedThreadStart(c.doDbProcessing));
/** This is Plan A - throw it on single long running PDF processing thread **/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
ConncurrentQueueFiles.Enqueue(c);
/*** This is Plan B - traditional thread for each file conversion ***/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
Thread t = new Thread(new ParameterizedThreadStart(c.doPDFProcessing));
int iCount = 0;
for(int iCount = 0; iCount < c_FileToConvert.Count; iCount++;)
{
if((c.FileProcessingState == -1) || (c.FileProcessingState == 2))
{
DAL.updateProcessingState(c.FileProcessingState)
listcFileToConvert.RemoveAt(iCount);
}
}
sleep(1000);
}
}
public void startProcessingPDFs() /*** Only for Plan A ***/
{
while (bProcessPDFs == true)
{
if (ConncurrentQueueFiles.IsEmpty == false)
{
try
{
c_FileToConvert cFileToConvert = null;
if (ConncurrentQueueFiles.TryDequeue(out cFileToConvert) == true)
cFileToConvert.doPDFProcessing();
}
catch(Exception e)
{
cFileToConvert.FileProcessingState = -1;
cFileToConvert.ErrorMessage = e.message;
}
}
}
}
Plan A seems like a nice solution, but what if the Task fails somehow? Yes, the PDF conversion can be done with individual threads, but I want to reserve them for the database processing.
This was written in a text editor as the simplest code I could, so there may be something, but I think I got the idea across.
How many files are you working with? 10? 100,000? If the number is very large, using 1 thread to run the DB queries for each file is not a good idea.
Threads are a very low-level control flow construct, and I advise you try to avoid a lot of messy and detailed thread spawning, joining, synchronizing, etc. etc. in your application code. Keep it stupidly simple if you can.
How about this: put the data you need for each file in a thread-safe queue. Create another thread-safe queue for results. Spawn some number of threads which repeatedly pull items from the input queue, run the queries, convert to PDF, then push the output into the output queue. The threads should share absolutely nothing but the input and output queues.
You can pick any number of worker threads which you like, or experiment to see what a good number is. Don't create 1 thread for each file -- just pick a number which allows for good CPU and disk utilization.
OR, if your language/libraries have a parallel map operator, use that. It will save you a lot of messing around.
I have a problem I am stuck for two days now. Maybe someone of you guys can help me.
I try to get the starttime of a window passed to a user defined aggregate. Unfortunately I don't know how to do this. The way I thought it should work looks like that:
var tot = from row in tumblingWin
select new
{
value = row.UserDefinedAggregate<Dataclass, Total2, double>(new StartBoundsConfig
{
Winstart = row.WinStart().Ticks
}) * processinginterval,
};
And the UDA looks like this:
public class Total2: CepAggregate<Dataclass,double>
{
private Dataclass lastone; //keep it, if needed for next window
private StartBoundsConfig _conf;
public Total2(StartBoundsConfig config)
{
_conf = config;
}
public override double GenerateOutput(IEnumerable<Dataclass> events)
{
//TODO check if value on window start => if not use last from previous as starting value
bool checkfirst = true;
long result = 0;
long tsone = 0;
foreach (var evts in events)
{
if (checkfirst == true)
{
tsone = evts.Gentime.Ticks;
checkfirst = false;
}
else
{
long tstwo = evts.Gentime.Ticks;
long delta = tstwo - tsone;
long value = (long) evts.Value;
result += delta*value;
tsone = tstwo;
}
lastone = evts;
}
return result;
}
}
I tried to pass the window start to the config of the UDA and read it from there.
Has anyone an idea why this doesn't work that way and how I could get the starttime of the window passed to the UDA to use it for calculation there?
I am very grateful for any hint.
Joe
You need a time-sensitive UDA. Inherit from CepTimeSensitiveAggregate (see http://technet.microsoft.com/en-us/library/ee842915.aspx) and your generate output method will have the WindowDescriptor as a part of the signature. You'll also get, as a bonus, the temporal headers for the events so you won't need to enqueue this information as a part of your payload. While there are some edge use cases that would require this, in most cases, you don't need it.
I have a queue that processes objects in a while loop. They are added asynchronously somewhere.. like this:
myqueue.pushback(String value);
And they are processed like this:
while(true)
{
String path = queue.pop();
if(process(path))
{
Console.WriteLine("Good!");
}
else
{
queue.pushback(path);
}
}
Now, the thing is that I'd like to modify this to support a TTL-like (time to live) flag, so the file path would be added o more than n times.
How could I do this, while keeping the bool process(String path) function signature? I don't want to modify that.
I thought about holding a map, or a list that counts how many times the process function returned false for a path and drop the path from the list at the n-th return of false. I wonder how can this be done more dynamically, and preferably I'd like the TTL to automatically decrement itself at each new addition to the process. I hope I am not talking trash.
Maybe using something like this
class JobData
{
public string path;
public short ttl;
public static implicit operator String(JobData jobData) {jobData.ttl--; return jobData.path;}
}
I like the idea of a JobData class, but there's already an answer demonstrating that, and the fact that you're working with file paths give you another possible advantage. Certain characters are not valid in file paths, and so you could choose one to use as a delimiter. The advantage here is that the queue type remains a string, and so you would not have to modify any of your existing asynchronous code. You can see a list of reserved path characters here:
http://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
For our purposes, I'll use the percent (%) character. Then you can modify your code as follows, and nothing else needs to change:
const int startingTTL = 100;
const string delimiter = "%";
while(true)
{
String[] path = queue.pop().Split(delimiter.ToCharArray());
int ttl = path.Length > 1?--int.Parse(path[1]):startingTTL;
if(process(path[0]))
{
Console.WriteLine("Good!");
}
else if (ttl > 0)
{
queue.pushback(string.Format("{0}{1}{2}", path[0], delimiter,ttl));
}
else
{
Console.WriteLine("TTL expired for path: {0}" path[0]);
}
}
Again, from a pure architecture standpoint, a class with two properties is a better design... but from a practical standpoint, YAGNI: this option means you can avoid going back and changing other asynchronous code that pushes into the queue. That code still only needs to know about the strings, and will work with this unmodified.
One more thing. I want to point out that this is a fairly tight loop, prone to running away with a cpu core. Additionally, if this is the .Net queue type and your tight loop gets ahead of your asynchronous produces to empty the queue, you'll throw an exception, which would break out of the while(true) block. You can solve both issues with code like this:
while(true)
{
try
{
String[] path = queue.pop().Split(delimiter.ToCharArray());
int ttl = path.Length > 1?--int.Parse(path[1]):startingTTL;
if(process(path[0]))
{
Console.WriteLine("Good!");
}
else if (ttl > 0)
{
queue.pushback(string.Format("{0}{1}{2}", path[0], delimiter,ttl));
}
else
{
Console.WriteLine("TTL expired for path: {0}" path[0]);
}
}
catch(InvalidOperationException ex)
{
//Queue.Dequeue throws InvalidOperation if the queue is empty... sleep for a bit before trying again
Thread.Sleep(100);
}
}
If the constraint is that bool process(String path) cannot be touched/changed then put the functionality into myqueue. You can keep its public signatures of void pushback(string path) and string pop(), but internally you can track your TTL. You can either wrap the string paths in a JobData-like class that gets added to the internal queue, or you can have a secondary Dictionary keyed by path. Perhaps even something as simple as saving the last poped path and if the subsequent push is the same path you can assume it was a rejected/failed item. Also, in your pop method you can even discard a path that has been rejected too many time and internally fetch the next path so the calling code is blissfully unaware of the issue.
You could abstract/encapsulate the functionality of the "job manager". Hide the queue and implementation from the caller so you can do whatever you want without the callers caring. Something like this:
public static class JobManager
{
private static Queue<JobData> _queue;
static JobManager() { Task.Factory.StartNew(() => { StartProcessing(); }); }
public static void AddJob(string value)
{
//TODO: validate
_queue.Enqueue(new JobData(value));
}
private static StartProcessing()
{
while (true)
{
if (_queue.Count > 0)
{
JobData data = _queue.Dequeue();
if (!process(data.Path))
{
data.TTL--;
if (data.TTL > 0)
_queue.Enqueue(data);
}
}
else
{
Thread.Sleep(1000);
}
}
}
private class JobData
{
public string Path { get; set; }
public short TTL { get; set; }
public JobData(string value)
{
this.Path = value;
this.TTL = DEFAULT_TTL;
}
}
}
Then your processing loop can handle the TTL value.
Edit - Added a simple processing loop. This code isn't thread safe, but should hopefully give you an idea.
Seeking some advice, best practice etc...
Technology: C# .NET4.0, Winforms, 32 bit
I am seeking some advice on how I can best tackle large data processing in my C# Winforms application which experiences high memory usage (working set) and the occasional OutOfMemory exception.
The problem is that we perform a large amount of data processing "in-memory" when a "shopping-basket" is opened. In simplistic terms when a "shopping-basket" is loaded we perform the following calculations;
For each item in the "shopping-basket" retrieve it's historical price going all the way back to the date the item first appeared in-stock (could be two months, two years or two decades of data). Historical price data is retrieved from text files, over the internet, any format which is supported by a price plugin.
For each item, for each day since it first appeared in-stock calculate various metrics which builds a historical profile for each item in the shopping-basket.
The result is that we can potentially perform hundreds, thousand and/or millions of calculations depending upon the number of items in the "shopping-basket". If the basket contains too many items we run the risk of hitting a "OutOfMemory" exception.
A couple of caveats;
This data needs to be calculated for each item in the "shopping-basket" and the data is kept until the "shopping-basket" is closed.
Even though we perform steps 1 and 2 in a background thread, speed is important as the number of items in the "shopping-basket" can greatly effect overall calculation speed.
Memory is salvaged by the .NET garbage collector when a "shopping-basket" is closed. We have profiled our application and ensure that all references are correctly disposed and closed when a basket is closed.
After all the calculations are completed the resultant data is stored in a IDictionary. "CalculatedData is a class object whose properties are individual metrics calculated by the above process.
Some ideas I've thought about;
Obviously my main concern is to reduce the amount of memory being used by the calculations however the volume of memory used can only be reduced if I
1) reduce the number of metrics being calculated for each day or
2) reduce the number of days used for the calculation.
Both of these options are not viable if we wish to fulfill our business requirements.
Memory Mapped Files
One idea has been to use memory mapped files which will store the data dictionary. Would this be possible/feasible and how can we put this into place?
Use a temporary database
The idea is to use a separate (not in-memory) database which can be created for the life-cycle of the application. As "shopping-baskets" are opened we can persist the calculated data to the database for repeated use, alleviating the requirement to recalculate for the same "shopping-basket".
Are there any other alternatives that we should consider? What is best practice when it comes to calculations on large data and performing them outside of RAM?
Any advice is appreciated....
The easiest solution is a database, perhaps SQLite. Memory mapped files don't automatically become dictionaries, you would have to code all the memory management yourself, and thereby fight with the .net GC system itself for ownership of he data.
If you're interested in trying the memory mapped file approach, you can try it now. I wrote a small native .NET package called MemMapCache that in essence creates a key/val database backed by MemMappedFiles. It's a bit of a hacky concept, but the program MemMapCache.exe keeps all references to the memory mapped files so that if your application crashes, you don't have to worry about losing the state of your cache.
It's very simple to use and you should be able to drop it in your code without too many modifications. Here is an example using it: https://github.com/jprichardson/MemMapCache/blob/master/TestMemMapCache/MemMapCacheTest.cs
Maybe it'd be of some use to you to at least further figure out what you need to do for an actual solution.
Please let me know if you do end up using it. I'd be interested in your results.
However, long-term, I'd recommend Redis.
As an update for those stumbling upon this thread...
We ended up using SQLite as our caching solution. The SQLite database we employ exists separate to the main data store used by the application. We persist calculated data to the SQLite (diskCache) as it's required and have code controlling cache invalidation etc. This was a suitable solution for us as we were able to achieve write speeds up and around 100,000 records per second.
For those interested, this is the code that controls inserts into the diskCache. Full credit for this code goes to JP Richardson (shown answering a question here) for his excellent blog post.
internal class SQLiteBulkInsert
{
#region Class Declarations
private SQLiteCommand m_cmd;
private SQLiteTransaction m_trans;
private readonly SQLiteConnection m_dbCon;
private readonly Dictionary<string, SQLiteParameter> m_parameters = new Dictionary<string, SQLiteParameter>();
private uint m_counter;
private readonly string m_beginInsertText;
#endregion
#region Constructor
public SQLiteBulkInsert(SQLiteConnection dbConnection, string tableName)
{
m_dbCon = dbConnection;
m_tableName = tableName;
var query = new StringBuilder(255);
query.Append("INSERT INTO ["); query.Append(tableName); query.Append("] (");
m_beginInsertText = query.ToString();
}
#endregion
#region Allow Bulk Insert
private bool m_allowBulkInsert = true;
public bool AllowBulkInsert { get { return m_allowBulkInsert; } set { m_allowBulkInsert = value; } }
#endregion
#region CommandText
public string CommandText
{
get
{
if(m_parameters.Count < 1) throw new SQLiteException("You must add at least one parameter.");
var sb = new StringBuilder(255);
sb.Append(m_beginInsertText);
foreach(var param in m_parameters.Keys)
{
sb.Append('[');
sb.Append(param);
sb.Append(']');
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(") VALUES (");
foreach(var param in m_parameters.Keys)
{
sb.Append(m_paramDelim);
sb.Append(param);
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(")");
return sb.ToString();
}
}
#endregion
#region Commit Max
private uint m_commitMax = 25000;
public uint CommitMax { get { return m_commitMax; } set { m_commitMax = value; } }
#endregion
#region Table Name
private readonly string m_tableName;
public string TableName { get { return m_tableName; } }
#endregion
#region Parameter Delimiter
private const string m_paramDelim = ":";
public string ParamDelimiter { get { return m_paramDelim; } }
#endregion
#region AddParameter
public void AddParameter(string name, DbType dbType)
{
var param = new SQLiteParameter(m_paramDelim + name, dbType);
m_parameters.Add(name, param);
}
#endregion
#region Flush
public void Flush()
{
try
{
if (m_trans != null) m_trans.Commit();
}
catch (Exception ex)
{
throw new Exception("Could not commit transaction. See InnerException for more details", ex);
}
finally
{
if (m_trans != null) m_trans.Dispose();
m_trans = null;
m_counter = 0;
}
}
#endregion
#region Insert
public void Insert(object[] paramValues)
{
if (paramValues.Length != m_parameters.Count)
throw new Exception("The values array count must be equal to the count of the number of parameters.");
m_counter++;
if (m_counter == 1)
{
if (m_allowBulkInsert) m_trans = m_dbCon.BeginTransaction();
m_cmd = m_dbCon.CreateCommand();
foreach (var par in m_parameters.Values)
m_cmd.Parameters.Add(par);
m_cmd.CommandText = CommandText;
}
var i = 0;
foreach (var par in m_parameters.Values)
{
par.Value = paramValues[i];
i++;
}
m_cmd.ExecuteNonQuery();
if(m_counter != m_commitMax)
{
// Do nothing
}
else
{
try
{
if(m_trans != null) m_trans.Commit();
}
catch(Exception)
{ }
finally
{
if(m_trans != null)
{
m_trans.Dispose();
m_trans = null;
}
m_counter = 0;
}
}
}
#endregion
}
I have an application that, before is creates a thread it calls the database to pull X amount of records. When the records are retrieved from the database a locked flag is set so those records are not pulled again.
Once a thread has completed it will pull some more records form that database. When I call the database from a thread should I set a lock on that section of code so it is called only by that thread at that time? Here is an exmaple of my code (I commented in the area where I have the lock):
private void CreateThreads()
{
for(var i = 1; i <= _threadCount; i++)
{
var adapter = new Dystopia.DataAdapter();
var records = adapter.FindAllWithLocking(_recordsPerThread,_validationId,_validationDateTime);
if(records != null && records.Count > 0)
{
var paramss = new ArrayList { i, records };
ThreadPool.QueueUserWorkItem(ThreadWorker, paramss);
}
this.Update();
}
}
private void ThreadWorker(object paramList)
{
try
{
var parms = (ArrayList) paramList;
var stopThread = false;
var threadCount = (int) parms[0];
var records = (List<Candidates>) parms[1];
var runOnce = false;
var adapter = new Dystopia.DataAdapter();
var lastCount = records.Count;
var runningCount = 0;
while (_stopThreads == false)
{
if (records.Count > 0)
{
foreach (var record in records)
{
var proc = new ProcRecords();
proc.Validate(ref rec);
adapter.Update(rec);
if (_stopThreads)
{
break;
}
}
//This is where I think I may need to sync the threads.
//Is this correct?
lock(this){
records = adapter.FindAllWithLocking;
}
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
SQL to Pull records:
WITH cte AS (
SELECT TOP (#topCount) *
FROM Candidates WITH (READPAST)
WHERE
isLocked = 0 and
isTested = 0 and
validated = 0
)
UPDATE cte
SET
isLocked = 1,
validationID = #validationId,
validationDateTime = #validationDateTime
OUTPUT INSERTED.*;
You shouldn't need to lock your threads as the database should be doing this on the request for you.
I see a few issues.
First, you are testing _stopThreads == false, but you have not revealed whether this a volatile read. Read the second of half this answer for a good description of what I am talking about.
Second, the lock is pointless because adapter is a local reference to a non-shared object and records is a local reference which just being replaced. I am assuming that the adapter makes a separate connection to the database, but if it shares an existing connection then some type of synchronization may need to take place since ADO.NET connection objects are not typically thread-safe.
Now, you probably will need locking somewhere to publish the results from the work item. I do not see where the results are being published to the main thread so I cannot offer any guidance here.
By the way, I would avoid showing a message box from a ThreadPool thread. The reason being that this will hang that thread until the message box closes.
You shouldn't lock(this) since its really easy for you to create deadlocks you should create a separate lock object. if you search for "lock(this)" you can find numerous articles on why.
Here's an SO question on lock(this)