All,
I am using the Parallels.ForEach as follows
private void fillEventDifferencesParallels(IProducerConsumerCollection<IEvent> events, Dictionary<string, IEvent> originalEvents)
{
Parallel.ForEach<IEvent>(events, evt =>
{
IEvent originalEventInfo = originalEvents[evt.EventID];
evt.FillDifferences(originalEventInfo);
});
}
Ok, so the problem I'm having is I have a list of 28 of these (a test sample, this should be able to scale to 200+) and the FillDifferences method is quite time consuming (about 4s per call). So the Average time for this to run in a normal ForEach has been around 100-130s. When I run the same thing in Parallel, it takes the same amount of time and Spikes my CPU (Intel I5, 2 Core, 2 Threads per Core) causing the app to become sluggish while this query is running (this is running on a thread that was spawned by the GUI thread).
So my question is, what am I doing wrong that is causing this to take the same amount of time? I read that List wasn't thread safe so I rewrote this to use the IProducerConsumerCollection. Is there any other pitfalls that may be causing this?
The FillDifferences Method calls a static class that uses reflection to find out how many differences there are between the original and the modified object. The static object has no 'global' variables, just ones local to the methods being invoked.
Some of you wanted to see what the FillDifferences() method called. This is where it ends up ultimately:
public List<IDifferences> ShallowCompare(object orig, object changed, string currentName)
{
List<IDifferences> differences = new List<IDifferences>();
foreach (MemberInfo m in orig.GetType().GetMembers())
{
List<IDifferences> temp = null;
//Go through all MemberInfos until you find one that is a Property.
if (m.MemberType == MemberTypes.Property)
{
PropertyInfo p = (PropertyInfo)m;
string newCurrentName = "";
if (currentName != null && currentName.Length > 0)
{
newCurrentName = currentName + ".";
}
newCurrentName += p.Name;
object propertyOrig = null;
object propertyChanged = null;
//Find the property Information from the orig object
if (orig != null)
{
propertyOrig = p.GetValue(orig, null);
}
//Find the property Information from the changed object
if (changed != null)
{
propertyChanged = p.GetValue(changed, null);
}
//Send the property to find the differences, if any. This is a SHALLOW compare.
temp = objectComparator(p, propertyOrig, propertyChanged, true, newCurrentName);
}
if (temp != null && temp.Count > 0)
{
foreach (IDifferences difference in temp)
{
addDifferenceToList(differences, difference);
}
}
}
return differences;
}
I believe you may be running into the cost of thread context switching. Since these tasks are long running I can imagine many threads are being created on the ThreadPool to handle them.
0ms == 1 thread
500ms == 2 threads
1000 ms == 3 threads
1500 ms == 4 threads
2000 ms == 5 threads
2500 ms == 6 threads
3000 ms == 7 threads
3500 ms == 8 threads
4000 ms == 9 threads
By 4000ms only the first task has been completed so this process will continue. A possible solution is as follows.
System.Threading.ThreadPool.SetMaxThreads(4, 4);
Looking at what it's doing, the only time your threads aren't doing anything is when the OS switches them out to give another thread a go, so you've got the gain of being able to run on an other core - the cost of all the context switches.
You'd have to chuck some logging in to find out for definite, but I suspect the bottle neck is physical threads, unless you have one somewhere else you' haven't posted.
If that's true, I'd be tempted to rejig the code. Have two threads one for finding properties to compare, and one for comparing them and a common queue. May be another one to throw classes in the list and collate the results.
Could be me old time batch processing head though.
Related
I have the following code in C#:
(_StoreQueue is a ConcurrentQueue)
var S = _StoreQueue.FirstOrDefault(_ => _.TimeStamp == T);
if (S == null)
{
lock (_QueueLock)
{
// try again
S = _StoreQueue.FirstOrDefault(_ => _.TimeStamp == T);
if (S == null)
{
S = new Store(T);
_StoreQueue.Enqueue(S);
}
}
}
The system is collecting data in real time (fairly high frequency, around 300-400 calls / second) and puts it in bins (Store objects) that represent a 5 second interval. These bins are in a queue as they get written and the queue gets emptied as data is processed and written.
So, when data is arriving, a check is done to see if there is a bin for that timestamp (rounded by 5 seconds), if not, one is created.
Since this is quite heavily multi-threaded, the system goes with the following logic:
If there is a bin, it is used to put data.
If there is no bin, a lock gets initiated and within that lock, the check is done again to make sure it wasn't created by another thread in the meantime. and if there is still no bin, one gets created.
With this system, the lock is roughly used once every 2k calls
I am trying to see if there is a way to remove the lock, but it is mostly because I'm thinking there has to be a better solution that the double check.
An alternative I have been thinking about is to create empty bins ahead of time and that would entirely remove the need for any locks, but the search for the right bin would become slower as it would have to scan the list pre-built bins to find the proper one.
Using a ConcurrentDictionary can fix the issue you are having. Here i assumed a type double for your TimeStamp property but it can be anything, as long as you make the ConcurrentDictionary key match the type.
class Program
{
ConcurrentDictionary<double, Store> _StoreQueue = new ConcurrentDictionary<double, Store>();
static void Main(string[] args)
{
var T = 17d;
// try to add if not exit the store with 17
_StoreQueue.GetOrAdd(T, new Store(T));
}
public class Store
{
public double TimeStamp { get; set; }
public Store(double timeStamp)
{
TimeStamp = timeStamp;
}
}
}
I have to query in my company's CRM Solution(Oracle's Right Now) for our 600k users, and update them there if they exist or create them in case they don't. To know if the user already exists in Right Now, I consume a third party WS. And with 600k users this can be a real pain due to the time it takes each time to get a response(around 1 second). So I managed to change my code to use Parallel.ForEach, querying each record in just 0,35 seconds, and adding it to a List<User> of records to be created or to be updated (Right Now is kinda dumb so I need to separate them in 2 lists and call 2 distinct WS methods).
My code used to run perfectly before multithread, but took too long. The problem is that I can't make a batch too large or I get a timeout when I try to update or create via Web Service. So I'm sending them around 500 records at once, and when it runs the critical code part, it executes many times.
Parallel.ForEach(boDS.USERS.AsEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = -1 }, row =>
{
...
user = null;
user = QueryUserById(row["USER_ID"].Trim());
if (user == null)
{
isUpdate = false;
gObject.ID = new ID();
}
else
{
isUpdate = true;
gObject.ID = user.ID;
}
... fill user attributes as generic fields ...
gObject.GenericFields = listGenericFields.ToArray();
if (isUpdate)
listUserUpdate.Add(gObject);
else
listUserCreate.Add(gObject);
if (i == batchSize - 1 || i == (boDS.USERS.Rows.Count - 1))
{
UpdateProcessingOptions upo = new UpdateProcessingOptions();
CreateProcessingOptions cpo = new CreateProcessingOptions();
upo.SuppressExternalEvents = false;
upo.SuppressRules = false;
cpo.SuppressExternalEvents = false;
cpo.SuppressRules = false;
RNObject[] results = null;
// <Critical_code>
if (listUserCreate.Count > 0)
{
results = _service.Create(_clientInfoHeader, listUserCreate.ToArray(), cpo);
}
if (listUserUpdate.Count > 0)
{
_service.Update(_clientInfoHeader, listUserUpdate.ToArray(), upo);
}
// </Critical_code>
listUserUpdate = new List<RNObject>();
listUserCreate = new List<RNObject>();
}
i++;
});
I thought about using lock or mutex, but it isn't gonna help me, since they will just wait to execute afterwards. I need some solution to execute only ONCE in only ONE thread that part of code. Is it possible? Can anyone share some light?
Thanks and kind regards,
Leandro
As you stated in the comments you're declaring the variables outside of the loop body. That's where your race conditions originate from.
Let's take variable listUserUpdate for example. It's accessed randomly by parallel executing threads. While one thread is still adding to it, e.g. in listUserUpdate.Add(gObject); another thread could already be resetting the lists in listUserUpdate = new List<RNObject>(); or enumerating it in listUserUpdate.ToArray().
You really need to refactor that code to
make each loop run as independent from each other as you can by moving variables inside the loop body and
access data in a synchronizing way using locks and/or concurrent collections
You can use the Double-checked locking pattern. This is usually used for singletons, but you're not making a singleton here so generic singletons like Lazy<T> do not apply.
It works like this:
Separate out your shared data into some sort of class:
class QuerySharedData {
// All the write-once-read-many fields that need to be shared between threads
public QuerySharedData() {
// Compute all the write-once-read-many fields. Or use a static Create method if that's handy.
}
}
In your outer class add the following:
object padlock;
volatile QuerySharedData data
In your thread's callback delegate, do this:
if (data == null)
{
lock (padlock)
{
if (data == null)
{
data = new QuerySharedData(); // this does all the work to initialize the shared fields
}
}
}
var localData = data
Then use the shared query data from localData By grouping the shared query data into a subordinate class you avoid the necessity of making its individual fields volatile.
More about volatile here: Part 4: Advanced Threading.
Update my assumption here is that all the classes and fields held by QuerySharedData are read-only once initialized. If this is not true, for instance if you initialize a list once but add to it in many threads, this pattern will not work for you. You will have to consider using things like Thread-Safe Collections.
As always, im quite the noob, as im sure you will see from both my code and question. For practice im currently writing an Xamarin.Android app for a game called Eve Online. People there mine resources from planets to make cash. These mines have to be reset at different intervals, and the real pros can have up to 30 characters doing it. Each character can have 5 planets, usually there are at least 2 mines (extractors) on each. So there could be 300 timers going on.
In my app you save your characters in an sqlite db, and every hour a intentservice runs through the API and checks your times and if their expired or not. This is how i do that:
public async Task PullPlanets(long KeyID, long CharacterID, string VCode, string CharName)
{
XmlReader lesern = XmlReader.Create("https://api.eveonline.com/char/PlanetaryColonies.xml.aspx?keyID=" + KeyID + "&vCode=" + VCode + "&characterID=" + CharacterID);
while (lesern.Read())
{
long planet = 0;
string planetName;
planet = Convert.ToInt64(lesern.GetAttribute("planetID"));
planetName = lesern.GetAttribute("planetName");
if ((planet != 0) && (planetName != null))
{
planets.Add(planet);
planetNames.Add(planetName);
await GetExpirationTimes(CharName, planet, planetName, KeyID, CharacterID, VCode);
}
}
lesern.Close ();
}
public async Task GetExpirationTimes(string CharName, long planetID, string planetName, long KeyID, long CharacterID, string VCode)
{
string planet = planetID.ToString();
XmlReader lesern = XmlReader.Create("https://api.eveonline.com/char/PlanetaryPins.xml.aspx?keyID=" + KeyID + "&vCode=" + VCode + "&characterID=" + CharacterID + "&planetID=" + planet);
while (lesern.Read())
{
string expTime;
expTime = lesern.GetAttribute("expiryTime");
if ((expTime != null) && (expTime != "0001-01-01 00:00:00"))
{
allInfo.Add (new AllInfo (CharName, planetName, Convert.ToDateTime (expTime)));
}
}
lesern.Close ();
SendOrderedBroadcast (stocksIntent, null);
}
}
After this, it sends the times back to my Activity, where they get added to an extractor. It seems to work pretty fine, although ive only been able to test with 2 characters with a total of 14 extractors so far. An alarmmanger in activity calls the service every hour, and it sends a notification. When user opens the activity, it pulls the list from service, sorts it, and displays it. I would welcome input on if this is the way to do it.
I do see a problem in the horizon, though. The Eve API blocks if an app surpases 30 API-calls per second. Im pretty sure someone with 30 characters would do that. So, im wondering if i should add something to delay each call if a certain number is passed? This is how i call the first XML call.
var table = db.Table<CharsList> ();
foreach (var e in table) {
long KeyIDOut = Convert.ToInt64(e.KeyID);
long CharIDOut = Convert.ToInt64(e.CharacterID);
string VCodeOut = e.VCode.ToString();
string navnOut = e.Name.ToString();
PullPlanets(KeyIDOut, CharIDOut, VCodeOut, navnOut);
}
CheckTimes ();
}
Is it viable to add a
if (table.Count > 10) {
foreach (var e in table) {
//start the first characters call
Thread.Sleep(100)
}
The service is intentservice and not on UI thread. I guess this would bring the calls under 30 a sec, but i have never used Thread.Sleep and fear what else could happen in my code. Are there other things that could help me not blow the limit? Can this code handle 300 extractors?
I believe you are generally right in your approach. I had to do a similar thing for a reddit client I was writing, except their limits is once a second or so.
The only problem I see with your setup is that assume that Thread.Sleep does sleep for the amount of time you give it. Spurious wakeups are possible in some cases, so what I would suggest is that you give it a smaller value, save the last time you accessed the service and then put a loop around the sleep call that terminates once enough time has passed.
Finally if you are going to be firing up a lot of intent services for a relatively short amount of work, you might want to have a normal service with a thread to handle the work - that way it will only have to be created once but it is still of the UI thread.
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.
I have a thread which produces data in the form of simple object (record). The thread may produce a thousand records for each one that successfully passes a filter and is actually enqueued. Once the object is enqueued it is read-only.
I have one lock, which I acquire once the record has passed the filter, and I add the item to the back of the producer_queue.
On the consumer thread, I acquire the lock, confirm that the producer_queue is not empty,
set consumer_queue to equal producer_queue, create a new (empty) queue, and set it on producer_queue. Without any further locking I process consumer_queue until it's empty and repeat.
Everything works beautifully on most machines, but on one particular dual-quad server I see in ~1/500k iterations an object that is not fully initialized when I read it out of consumer_queue. The condition is so fleeting that when I dump the object after detecting the condition the fields are correct 90% of the time.
So my question is this: how can I assure that the writes to the object are flushed to main memory when the queue is swapped?
Edit:
On the producer thread:
(producer_queue above is m_fillingQueue; consumer_queue above is m_drainingQueue)
private void FillRecordQueue() {
while (!m_done) {
int count;
lock (m_swapLock) {
count = m_fillingQueue.Count;
}
if (count > 5000) {
Thread.Sleep(60);
} else {
DataRecord rec = GetNextRecord();
if (rec == null) break;
lock (m_swapLock) {
m_fillingQueue.AddLast(rec);
}
}
}
}
In the consumer thread:
private DataRecord Next(bool remove) {
bool drained = false;
while (!drained) {
if (m_drainingQueue.Count > 0) {
DataRecord rec = m_drainingQueue.First.Value;
if (remove) m_drainingQueue.RemoveFirst();
if (rec.Time < FIRST_VALID_TIME) {
throw new InvalidOperationException("Detected invalid timestamp in Next(): " + rec.Time + " from record " + rec);
}
return rec;
} else {
lock (m_swapLock) {
m_drainingQueue = m_fillingQueue;
m_fillingQueue = new LinkedList<DataRecord>();
if (m_drainingQueue.Count == 0) drained = true;
}
}
}
return null;
}
The consumer is rate-limited, so it can't get ahead of the consumer.
The behavior I see is that sometimes the Time field is reading as DateTime.MinValue; by the time I construct the string to throw the exception, however, it's perfectly fine.
Have you tried the obvious: is microcode update applied on the fancy 8-core box(via BIOS update)? Did you run Windows Updates to get the latest processor driver?
At the first glance, it looks like you're locking your containers. So I am recommending the systems approach, as it sound like you're not seeing this issue on a good-ol' dual core box.
Assuming these are in fact the only methods that interact with the m_fillingQueue variable, and that DataRecord cannot be changed after GetNextRecord() creates it (read-only properties hopefully?), then the code at least on the face of it appears to be correct.
In which case I suggest that GregC's answer would be the first thing to check; make sure the failing machine is fully updated (OS / drivers / .NET Framework), becasue the lock statement should involve all the required memory barriers to ensure that the rec variable is fully flushed out of any caches before the object is added to the list.