Executing part of code exactly 1 time inside Parallel.ForEach - c#

I have to query in my company's CRM Solution(Oracle's Right Now) for our 600k users, and update them there if they exist or create them in case they don't. To know if the user already exists in Right Now, I consume a third party WS. And with 600k users this can be a real pain due to the time it takes each time to get a response(around 1 second). So I managed to change my code to use Parallel.ForEach, querying each record in just 0,35 seconds, and adding it to a List<User> of records to be created or to be updated (Right Now is kinda dumb so I need to separate them in 2 lists and call 2 distinct WS methods).
My code used to run perfectly before multithread, but took too long. The problem is that I can't make a batch too large or I get a timeout when I try to update or create via Web Service. So I'm sending them around 500 records at once, and when it runs the critical code part, it executes many times.
Parallel.ForEach(boDS.USERS.AsEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = -1 }, row =>
{
...
user = null;
user = QueryUserById(row["USER_ID"].Trim());
if (user == null)
{
isUpdate = false;
gObject.ID = new ID();
}
else
{
isUpdate = true;
gObject.ID = user.ID;
}
... fill user attributes as generic fields ...
gObject.GenericFields = listGenericFields.ToArray();
if (isUpdate)
listUserUpdate.Add(gObject);
else
listUserCreate.Add(gObject);
if (i == batchSize - 1 || i == (boDS.USERS.Rows.Count - 1))
{
UpdateProcessingOptions upo = new UpdateProcessingOptions();
CreateProcessingOptions cpo = new CreateProcessingOptions();
upo.SuppressExternalEvents = false;
upo.SuppressRules = false;
cpo.SuppressExternalEvents = false;
cpo.SuppressRules = false;
RNObject[] results = null;
// <Critical_code>
if (listUserCreate.Count > 0)
{
results = _service.Create(_clientInfoHeader, listUserCreate.ToArray(), cpo);
}
if (listUserUpdate.Count > 0)
{
_service.Update(_clientInfoHeader, listUserUpdate.ToArray(), upo);
}
// </Critical_code>
listUserUpdate = new List<RNObject>();
listUserCreate = new List<RNObject>();
}
i++;
});
I thought about using lock or mutex, but it isn't gonna help me, since they will just wait to execute afterwards. I need some solution to execute only ONCE in only ONE thread that part of code. Is it possible? Can anyone share some light?
Thanks and kind regards,
Leandro

As you stated in the comments you're declaring the variables outside of the loop body. That's where your race conditions originate from.
Let's take variable listUserUpdate for example. It's accessed randomly by parallel executing threads. While one thread is still adding to it, e.g. in listUserUpdate.Add(gObject); another thread could already be resetting the lists in listUserUpdate = new List<RNObject>(); or enumerating it in listUserUpdate.ToArray().
You really need to refactor that code to
make each loop run as independent from each other as you can by moving variables inside the loop body and
access data in a synchronizing way using locks and/or concurrent collections

You can use the Double-checked locking pattern. This is usually used for singletons, but you're not making a singleton here so generic singletons like Lazy<T> do not apply.
It works like this:
Separate out your shared data into some sort of class:
class QuerySharedData {
// All the write-once-read-many fields that need to be shared between threads
public QuerySharedData() {
// Compute all the write-once-read-many fields. Or use a static Create method if that's handy.
}
}
In your outer class add the following:
object padlock;
volatile QuerySharedData data
In your thread's callback delegate, do this:
if (data == null)
{
lock (padlock)
{
if (data == null)
{
data = new QuerySharedData(); // this does all the work to initialize the shared fields
}
}
}
var localData = data
Then use the shared query data from localData By grouping the shared query data into a subordinate class you avoid the necessity of making its individual fields volatile.
More about volatile here: Part 4: Advanced Threading.
Update my assumption here is that all the classes and fields held by QuerySharedData are read-only once initialized. If this is not true, for instance if you initialize a list once but add to it in many threads, this pattern will not work for you. You will have to consider using things like Thread-Safe Collections.

Related

Chance of hitting the same function at the same time by two Threads/Tasks

Assuming the following case:
public HashTable map = new HashTable();
public void Cache(String fileName) {
if (!map.ContainsKey(fileName))
{
map.Add(fileName, new Object());
_Cache(fileName);
}
}
}
private void _Cache(String fileName) {
lock (map[fileName])
{
if (File Already Cached)
return;
else {
cache file
}
}
}
When having the following consumers:
Task.Run(()=> {
Cache("A");
});
Task.Run(()=> {
Cache("A");
});
Would it be possible in any ways that the Cache method would throw a Duplicate key exception meaning that both tasks would hit the map.add method and try to add the same key??
Edit:
Would using the following data structure solve this concurrency problem?
public class HashMap<Key, Value>
{
private HashSet<Key> Keys = new HashSet<Key>();
private List<Value> Values = new List<Value>();
public int Count => Keys.Count;
public Boolean Add(Key key, Value value) {
int oldCount = Keys.Count;
Keys.Add(key);
if (oldCount != Keys.Count) {
Values.Add(value);
return true;
}
return false;
}
}
Yes, of course it would be possible. Consider the following fragment:
if (!map.ContainsKey(fileName))
{
map.Add(fileName, new Object());
Thread 1 may execute if (!map.ContainsKey(fileName)) and find that the map does not contain the key, so it will proceed to add it, but before it gets the chance to add it, Thread 2 may also execute if (!map.ContainsKey(fileName)), at which point it will also find that the map does not contain the key, so it will also proceed to add it. Of course, that will fail.
EDIT (after clarifications)
So, the problem seems to be how to keep the main map locked for as little as possible, and how to prevent cached objects from being initialized twice.
This is a complex problem, so I cannot give you a ready-to-run answer that will work, (especially since I do not currently even have a C# development environment handy,) but generally speaking, I think that you should proceed as follows:
Fully guard your map with lock().
Keep your map locked as little as possible; when an object is not found to be in the map, add an empty object to the map and exit the lock immediately. This will ensure that this map will not become a point of contention for all requests coming in to the web server.
After the check-if-present-and-add-if-not fragment, you are holding an object which is guaranteed to be in the map. However, this object may and may not be initialized at this point. That's fine. We will take care of that next.
Repeat the lock-and-check idiom, this time with the cached object: every single incoming request interested in that specific object will need to lock it, check whether it is initialized, and if not, initialize it. Of course, only the first request will suffer the penalty of initialization. Also, any requests that arrive before the object has been fully initialized will have to wait on their lock until the object is initialized. But that's all very fine, that's exactly what you want.

How to implement locking in a shared cachecontroller?

I have a static class which handles the cache read/write for frequently used data.
The code is this:
public static T GetFromCache<T>(double seconds, string cacheId, Func<T> method) where T : class
{
HttpContext ctx = HttpContext.Current;
object temp = null;
temp = ctx.Cache[cacheId];
if (temp == null)
{
lock (Sync)
{
temp = ctx.Cache[cacheId];
if (temp == null)
{
temp = method.Invoke();
AddToCache(temp as T, seconds, cacheId);
return temp as T;
}
}
}
if (temp is T)
{
return (T)temp;
}
return null;
}
The code is used by various callers to read data from and write data to the cache.
Now I have a Sync object (private static readonly object Sync = new object();) which gets locked when data gets written to the cache.
As this code is called by multiple callers, I would like to create a List of Sync objects, one for each caller. (with caller I don't mean the user, but calling code. I then would identify a caller by the signature of the parameter method)
The reason I want this is that every piece of calling code can have it's own lock object; otherwise (I think) every call to this cachecontroller from different callers will use the same lock object. Then, the caching for the list of countries will also lock the caching of the list of states, and with two different lock objects, they will not be in each others way.
I would then use the CacheItemRemovedCallback method to remove the lockitems from the list.
The question is this: How can I do that?
By having one Sync object for each user will defy the purpose of synchronization as each use will hold its own lock and there will be a chance that for the same cacheId you will end up invoking the method multiple times. This might result in data becoming inconsistent.
If you wish to keep one Sync object per user then it's good to make use of session variables or per user cache or something similar.. otherwise each user will virtually end up messing up with each other's cacheId results.
If you have a scenario when there can be Multiple readers of data but at a time a single user can write it then try using ReaderWriterLockSlim. This is very fast compared to lock in a multi user scenario.
Update1
Considering the cacheId is unique and not common among the callers. You can use the following code.
No lock is needed here. Reason, HttpContext.Cache is ThreadSafe. Meaning, you can read/save values to Cache. But, if the value reference itself is being shared among more than one concurrent calls, then please synchronize it.
public static T GetFromCache<T>(double seconds, string cacheId, Func<T> method) where T : class
{
HttpContext ctx = HttpContext.Current;
object temp = null;
temp = ctx.Cache[cacheId];
if (temp == null)
{
temp = method.Invoke();
AddToCache(temp as T, seconds, cacheId);
return temp as T;
}
}
Regards

How to handle lock statement if there is an external api call in between

I have the following code:
private static HashSet<SoloUser> soloUsers = new HashSet<SoloUser>();
public void findNewPartner(string School, string Major)
{
lock (soloUsers)
{
SoloUser soloUser = soloUsers.FirstOrDefault(s => (s.School == School) && (s.Major == Major));
MatchConnection matchConn;
if (soloUser != null)
{
if (soloUser.ConnectionId != Context.ConnectionId)
{
soloUsers.Remove(soloUser);
}
}
else
{ string sessionId = TokenHelper.GenerateSession();
soloUser = new SoloUser
{
Major = Major,
School = School,
SessionId = sessionId,
ConnectionId = Context.ConnectionId
};
soloUsers.Add(soloUser);
}
}
}
TokenHelper.GenerateToken(soloUser.Session) and TokenHelper.GenerateModeratorToken(session); could be hazardous because they may take a moment to generate a token. This will lock all users out for a that moment which could be a problem? Are there any workarounds to this logic so that I can still keep everything threadsafe?
EDIT:
I removed the TokenHelper.GenerateToken(soloUser.Session) and TokenHelper.GenerateModeratorToken(session) because I realized they can happen outside of the lock, but each SoloUser has a property called SessionId and this is generated for each user. the GenerateSession method would also be a method that takes a moment. Each user needs to have one of these SessionIds before being added to the collection
You can move the GenerateSession out of the lock if you can afford to take the lock twice and if it's ok if occasionally a sessionId is generated but never used.
Something like this:
public void findNewPartner(string School, string Major)
{
SoloUser soloUser = null;
lock (soloUsers)
{
soloUser = soloUsers.FirstOrDefault(s => (s.School == School) && (s.Major == Major));
}
string sessionId = null;
// will we be creating a new soloUser?
if (soloUser == null)
{
// then we'll need a new session for that new user
sessionId = TokenHelper.GenerateSession();
}
lock (soloUsers)
{
soloUser = soloUsers.FirstOrDefault(s => (s.School == School) && (s.Major == Major));
if (soloUser != null)
{
// woops! Guess we don't need that sessionId after all. Oh well! Carry on...
if (soloUser.ConnectionId != Context.ConnectionId)
{
soloUsers.Remove(soloUser);
}
}
else
{
// use the sessionid computed earlier
soloUser = new SoloUser
{
Major = Major,
School = School,
SessionId = sessionId,
ConnectionId = Context.ConnectionId
};
soloUsers.Add(soloUser);
}
}
This basically does a quick lock to see if a new soloUser needs to be constructed and if so, then we need to generate a new session. Generating the new session happens outside the lock. We then reaquire the lock and perform the original set of operations. When constructing a new soloUser, it uses the sessionId that was constructed outside the lock.
This pattern could generate sessionIds that are never used. If two threads execute this function at the same time with the same school and major, both threads will generated session ids, but only one of the threads will successfully create a new soloUser and add it to the collection. The losing thread will find the soloUser in the collection and remove it from the collection - and not use the sessionId it just generated. At this point, both threads will be referring to the same soloUser with the same sessionId, which appears to be the goal.
If sessionIds have resources associated with them (such as an entry in a database) but these resources will be cleaned up when the sessionId ages out, then collisions like this will produce a little extra noise but overall should not impact the system.
If the generated sessionIds have nothing associated with them that would require clean up or aging out, then you might consider losing the first lock in my example and just always generate a sessionId, whether it's needed or not. This is probably not a likely scenario, but I have used this sort of "promiscuous" trick in specialized cases before to avoid hopping in and out of high traffic locks. If it's cheap to create and expensive to lock, then create with abandon and be careful with the locks.
Make sure the cost of GenerateSession is high enough to justify this extra run-around. If GenerateSession takes nanoseconds to complete, you don't need all this - just leave it in the lock as originally written. If GenerateSession takes "a long time" (a second or more? 500ms or more? can't say), then moving it out of the lock is a good idea to prevent other uses of the shared list from having to wait.
The best solution is probably to use ConcurrentBag<T>.
See http://msdn.microsoft.com/en-us/library/dd381779
I am assuming you are using .NET 4.
Note that this is a bag, not a set. So you'll have to code around the "no duplicates" yourself, and do so in a thread-safe way.

Quartz.Net - update/delete jobs/triggers

I'm using Quartz to pull latest tasks (from another source), it then adds it in as a job, creates triggers etc per each task. - Easy.
However, sometimes tasks change (therefore they already exist). Therefore I would like to change its (lets say to keep it simple Description. Code below updates specific task's description with given date.
private static void SetLastPull(DateTime lastPullDateTime)
{
var lastpull = sched.GetJobDetail("db_pull", "Settings");
if(lastpull != null)
{
lastpull.Description = lastPullDateTime.ToString();
}
else
{
var newLastPull = new JobDetail("db_pull", "Settings", typeof(IJob));
newLastPull.Description = lastPullDateTime.ToString();
var newLastPullTrigger = new CronTrigger("db_pull", "Settings", "0 0 0 * 12 ? 2099");
sched.ScheduleJob(newLastPull, newLastPullTrigger);
}
}
I'm assuming after I do lastpull.Description = lastPullDateTime.ToString(); I should call something to save changes to database. Is there a way to do it in Quartz or do I have to go to using other means and update it?
You can't change (update) a job once it has been scheduled. You can only re-schedule it (with any changes you might want to make) or delete it and create a new one.

Dual-queue producer-consumer in .NET (forcing member variable flush)

I have a thread which produces data in the form of simple object (record). The thread may produce a thousand records for each one that successfully passes a filter and is actually enqueued. Once the object is enqueued it is read-only.
I have one lock, which I acquire once the record has passed the filter, and I add the item to the back of the producer_queue.
On the consumer thread, I acquire the lock, confirm that the producer_queue is not empty,
set consumer_queue to equal producer_queue, create a new (empty) queue, and set it on producer_queue. Without any further locking I process consumer_queue until it's empty and repeat.
Everything works beautifully on most machines, but on one particular dual-quad server I see in ~1/500k iterations an object that is not fully initialized when I read it out of consumer_queue. The condition is so fleeting that when I dump the object after detecting the condition the fields are correct 90% of the time.
So my question is this: how can I assure that the writes to the object are flushed to main memory when the queue is swapped?
Edit:
On the producer thread:
(producer_queue above is m_fillingQueue; consumer_queue above is m_drainingQueue)
private void FillRecordQueue() {
while (!m_done) {
int count;
lock (m_swapLock) {
count = m_fillingQueue.Count;
}
if (count > 5000) {
Thread.Sleep(60);
} else {
DataRecord rec = GetNextRecord();
if (rec == null) break;
lock (m_swapLock) {
m_fillingQueue.AddLast(rec);
}
}
}
}
In the consumer thread:
private DataRecord Next(bool remove) {
bool drained = false;
while (!drained) {
if (m_drainingQueue.Count > 0) {
DataRecord rec = m_drainingQueue.First.Value;
if (remove) m_drainingQueue.RemoveFirst();
if (rec.Time < FIRST_VALID_TIME) {
throw new InvalidOperationException("Detected invalid timestamp in Next(): " + rec.Time + " from record " + rec);
}
return rec;
} else {
lock (m_swapLock) {
m_drainingQueue = m_fillingQueue;
m_fillingQueue = new LinkedList<DataRecord>();
if (m_drainingQueue.Count == 0) drained = true;
}
}
}
return null;
}
The consumer is rate-limited, so it can't get ahead of the consumer.
The behavior I see is that sometimes the Time field is reading as DateTime.MinValue; by the time I construct the string to throw the exception, however, it's perfectly fine.
Have you tried the obvious: is microcode update applied on the fancy 8-core box(via BIOS update)? Did you run Windows Updates to get the latest processor driver?
At the first glance, it looks like you're locking your containers. So I am recommending the systems approach, as it sound like you're not seeing this issue on a good-ol' dual core box.
Assuming these are in fact the only methods that interact with the m_fillingQueue variable, and that DataRecord cannot be changed after GetNextRecord() creates it (read-only properties hopefully?), then the code at least on the face of it appears to be correct.
In which case I suggest that GregC's answer would be the first thing to check; make sure the failing machine is fully updated (OS / drivers / .NET Framework), becasue the lock statement should involve all the required memory barriers to ensure that the rec variable is fully flushed out of any caches before the object is added to the list.

Categories