Understanding non blocking thread synchronization and Thread.MemoryBarrier

Understanding non blocking thread synchronization and Thread.MemoryBarrier - c#

In this threading online book: http://www.albahari.com/threading/part4.aspx
theres an example of Thread.MemoryBarrier()
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B()
{
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine (_answer);
}
}
}
We got a discussion whether there is any thread blocking going on or not?
Im thinking there is some, especially given that
A full fence takes around ten nanoseconds on a 2010-era desktop.
On other hand, full fence is only supposed to disable instructions reodering and caching which by its sound doesn't qualify as thread blocking, (unlike lock where its clear that thread waits for other to release lock before it continues, and is blocked during that time)
About that thread 'block state'. im talking not in terms of whether thread is put into blocked state or not, but whether there is some thread synchronization happening, which means one thread is not able to run while other isn't letting it to do so, by means of MemoryBarrier in this case.
Also Id like to get clear understanding what each barrier achieves. For example Barrier 2 - how exactly it provides freshness guarantee and how is it connected to barrier 3? If someone would explain in detail whats each barrier purpose here( what could possibly go wrong if 1 or 2 or 3 or 4 weren't there) I think id improve my understanding of this greatly.
EDIT: its mostly clear now what 1, 2, and 3 do. However what 4 does that 3 doesn't is still unclear.

The fact that instructions take time to execute does not imply that a thread is blocked. A thread is blocked when it is specifically put into a blocked state, which MemoryBarrier() does not do.
The processor instructions that actually prevent instruction reordering and cache flushing take time, because they must wait for the caches to become coherent again. During that time, the thread is still considered running.
Update: So let's take a look at what's actually happening in the example, and what each memory barrier actually does.
As the link says, 1 and 4 ensure that the correct answers are produced. That's because 1 ensures that the answers are flushed into memory, and 4 ensures that the read caches are flushed prior to retrieving the variables.
2 and 3 ensure that if A runs first, then B will always print the answers. Barrier 2 ensures that the write of true is flushed to memory, and barrier 3 ensures that the read cahces are flushed before testing _complete's value.
The cache and memory flushing should be clear enough, so let's look at instruction reordering. The way the compiler, CLR and CPU know they can reorder instructions is by analyzing a set of instructions in sequence. When they see the barrier instruction in the middle of a sequence, they know that instructions can't move across that boundary. That ensures that in addition to cache freshness, the instructions occur in the correct order.

Related

Is Interlocked.CompareExchange really faster than a simple lock?

I came across a ConcurrentDictionary implementation for .NET 3.5 (I'm so sorry I could find the link right now) that uses this approach for locking:
var current = Thread.CurrentThread.ManagedThreadId;
while (Interlocked.CompareExchange(ref owner, current, 0) != current) { }
// PROCESS SOMETHING HERE
if (current != Interlocked.Exchange(ref owner, 0))
throw new UnauthorizedAccessException("Thread had access to cache even though it shouldn't have.");
Instead of the traditional lock:
lock(lockObject)
{
// PROCESS SOMETHING HERE
}
The question is: Is there any real reason for doing this? Is it faster or have some hidden benefit?
PS: I know there's a ConcurrentDictionary in some latest version of .NET but I can't use for a legacy project.
Edit:
In my specific case, what I'm doing is just manipulating an internal Dictionary class in such a way that it's thread safe.
Example:
public bool RemoveItem(TKey key)
{
// open lock
var current = Thread.CurrentThread.ManagedThreadId;
while (Interlocked.CompareExchange(ref owner, current, 0) != current) { }
// real processing starts here (entries is a regular `Dictionary` class.
var found = entries.Remove(key);
// verify lock
if (current != Interlocked.Exchange(ref owner, 0))
throw new UnauthorizedAccessException("Thread had access to cache even though it shouldn't have.");
return found;
}
As #doctorlove suggested, this is the code: https://github.com/miensol/SimpleConfigSections/blob/master/SimpleConfigSections/Cache.cs

There is no definitive answer to your question. I would answer: it depends.
What the code you've provided is doing is:
wait for an object to be in a known state (threadId == 0 == no current work)
do work
set back the known state to the object
another thread now can do work too, because it can go from step 1 to step 2
As you've noted, you have a loop in the code that actually does the "wait" step. You don't block the thread until you can access to your critical section, you just burn CPU instead. Try to replace your processing (in your case, a call to Remove) by Thread.Sleep(2000), you'll see the other "waiting" thread consuming all of one of your CPUs for 2s in the loop.
Which means, which one is better depends on several factors. For example: how many concurrent accesses are there? How long the operation takes to complete? How many CPUs do you have?
I would use lock instead of Interlocked because it's way easier to read and maintain. The exception would be the case you've got a piece of code called millions of times, and a particular use case you're sure Interlocked is faster.
So you'll have to measure by yourself both approaches. If you don't have time for this, then you probably don't need to worry about performances, and you should use lock.

Your CompareExchange sample code doesn't release the lock if an exception is thrown by "PROCESS SOMETHING HERE".
For this reason as well as the simpler, more readable code, I would prefer the lock statement.
You could rectify the problem with a try/finally, but this makes the code even uglier.
The linked ConcurrentDictionary implementation has a bug: it will fail to release the lock if the caller passes a null key, potentially leaving other threads spinning indefinitely.
As for efficiency, your CompareExchange version is essentially a Spinlock, which can be efficient if threads are only likely to be blocked for short periods. But inserting into a managed dictionary can take a relatively long time, since it may be necessary to resize the dictionary. Therefore, IMHO, this isn't a good candidate for a spinlock - which can be wasteful, especially on single-processor system.

A little bit late... I have read your sample but in short:
Fastest to slowest MT sync:
Interlocked.* => This is a CPU atomic instruction. Can't be beat if it is sufficient for your need.
SpinLock => Uses Interlocked behind and is really fast. Uses CPU when wait. Do not use for code that wait long time (it is usually used to prevent thread switching for lock that do quick action). If you often have to wait more than one thread cycle, I would suggest to go with "Lock"
Lock => The slowest but easier to use and read than SpinLock. The instruction itself is very fast but if it can't acquire the lock it will relinquish the cpu. Behind the scene, it will do a WaitForSingleObject on a kernel objet (CriticalSection) and then Window will give cpu time to the thread only when the lock will be freed by the thread that acquired it.
Have fun with MT!

The docs for the Interlocked class tell us it
"Provides atomic operations for variables that are shared by multiple threads. "
The theory is an atomic operation can be faster than locks. Albahari gives some further details on interlocked operations stating they are faster.
Note that Interlocked provides a "smaller" interface than Lock - see previous question here

Yes.
The Interlocked class offer atomic operations which means they do not block other code like a lock because they don't really need to.
When you lock a block of code you want to make sure no 2 threads are in it at the same time, that means that when a thread is inside all other threads wait to get in, which uses resources (cpu time and idle threads).
The atomic operations on the other hand do not need to block other atomic operations because they are atomic. It's conceptually a one CPU operation, the next ones just go in after the previous, and you're not wasting threads on just waiting. (By the way, that's why it's limited to very basic operations like Increment, Exchange etc.)
I think a lock (which is a Monitor underneath) uses interlocked to know if the lock is already taken, but it can't know that the actions inside it can be atomic.
In most cases, though, the difference is not critical. But you need to verify that for your specific case.

Interlocked is faster - already explained in other comments and you can also define the logic of how the wait is implemented e.g. spinWait.spin(), spinUntil, Thread.sleep etc once the lock fails the first time.. Also, if your code within the lock is expected to run without possibility of crash (custom code/delegates/resource resolution or allocation/events/unexpected code executed during the lock) unless you are going to be catching the exception to allow your software to continue execution, "try" "finally" is also skipped so extra speed there. lock(something) makes sure if you catch the exception from outside to unlock that something, just like "using" makes sure (C#) when the execution exits the execution block for whatever reason to dispose the "used" disposable object.

One important difference between lock and interlock.CompareExhange is how it can be used in async environments.
async operations cannot be awaited inside a lock, as they can easily occur in deadlocks if the thread that continues execution after the await is not the same one that originally acquired the lock.
This is not a problem with interlocked however, because nothing is "acquired" by a thread.
Another solution for asynchronous code that may provide better readability than interlocked may be semaphore as described in this blog post:
https://blog.cdemi.io/async-waiting-inside-c-sharp-locks/

C# memory model and non volatile variable initialized before the other thread creation

I have a question related to the C# memory model and threads. I am not sure if the following code is correct without the volatile keyword.
public class A {
private int variableA = 0;
public A() {
variableA = 1;
Thread B = new Thread(new ThreadStart(() => printA())).Start();
}
private void printA() {
System.Console.WriteLine(variableA);
}
}
My concern is if it is guaranteed that the Thread B will see variableA with value 1 without using volatile? In the main thread I am only assigning 1 to variableA in the constructor. After that I am not touching variableA, it is used only in the Thread B, so locking is probably not necessary.
But, is it guaranteed that the main thread will flush his cache and write the variableA contents to the main memory, so the second thread can read the newly assigned value?
Additionally, is it guaranteed that the second thread will read the variableA contents from the main memory? May some compiler optimizations occur and the Thread B can read the variableA contents from the cache instead of the main memory? It may happen when the order of the instructions is changed.
For sure, adding volatile to the variableA declaration will make the code correct. But, is it neccessary? I am asking because I wrote some code with some non volatile variables initialization in the constructor, and the variables are used later by some Timer threads, and I am not sure if it is totally correct.
What about the same code in Java?
Thanks, Michal

There are a lot of places where implicit memory barriers are created. This is one of them. Starting threads create full barriers. So the write to variableA will get committed before the thread starts and the first reads will be acquired from main memory. Of course, in Microsoft's implementation of the CLR that is somewhat of a moot point because writes already have volatile semantics. But the same guarentee is not made in the ECMA specification so it is theorectically possible that the Mono implemenation could behave differently in this regard.
My concern is if it is guaranteed that
the Thread B will see variableA with
value 1 without using volatile?
In this case...yes. However, if you continue to use variableA in the second thread there is no guarentee after the first read that it will see updates.
But, is it guaranteed that the main
thread will flush his cache and write
the variableA contents to the main
memory, so the second thread can read
the newly assigned value?
Yes.
Additionally, is it guaranteed that
the second thread will read the
variableA contents from the main
memory?
Yes, but only on the first read.
For sure, adding volatile to the
variableA declaration will make the
code correct. But, is it neccessary?
In this very specific and narrow case...no. But, in general it is advised that you use the volatile keyword in these scenarios. Not only will it make your code thread-safe as the scenario gets more complicated, but it also helps to document the fact that the field is going to be used by more than one thread and that you have considered the implications of using a lock-free strategy.

The same code in Java is definitely okay - the creation of a new thread acts as a sort of barrier, effectively. (All actions earlier in the program text than the thread creation "happen before" the new thread starts.)
I don't know what's guaranteed in .NET with respect to new thread creation, however. Even more worrying is the possibility of a delayed read when using Control.BeginInvoke and the like... I haven't seen any guarantees around memory barriers for those situations.
To be honest, I suspect it's fine. I suspect that anything which needs to coordinate between threads like this (either creating a new one or marshalling a call onto an existing one) will use a full memory barrier on both of the threads involved. However, you're absolutely right to be concerned, and I'm hoping that you'll get a more definitive answer from someone smarter than me. You might want to email Joe Duffy to get his point of view on this...

But, is it guaranteed that the main thread will flush his cache and write the variableA contents to the main memory,
Yes, this is guaranteed by the MS CLR memory model. Not necessarily so for other implementations of the CLI (ie, I'm not sure about Mono). The ECMA standard does not require it.
so the second thread can read the newly assigned value?
That requires that the cache has been refreshed. It is probably guaranteed by the creation of the Thread (like Jon Skeet said). It is however not guaranteed by the previous point. The cache is flushed on each write but not on each read.
You could make very sure by using VolatileRead(ref variableA) but it is recommended (Jeffrey Richter) to use the Interlocked class. Note that VolatileWrite() is superfluous in MS.NET.

Why do I need a memory barrier?

C# 4 in a Nutshell (highly recommended btw) uses the following code to demonstrate the concept of MemoryBarrier (assuming A and B were run on different threads):
class Foo{
int _answer;
bool complete;
void A(){
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B(){
Thread.MemoryBarrier(); // Barrier 3;
if(_complete){
Thread.MemoryBarrier(); // Barrier 4;
Console.WriteLine(_answer);
}
}
}
they mention that Barriers 1 & 4 prevent this example from writing 0 and Barriers 2 & 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true.
I'm not really getting it. I think I understand why Barriers 1 & 4 are necessary: we don't want the write to _answer to be optimized and placed after the write to _complete (Barrier 1) and we need to make sure that _answer is not cached (Barrier 4). I also think I understand why Barrier 3 is necessary: if A ran until just after writing _complete = true, B would still need to refresh _complete to read the right value.
I don't understand though why we need Barrier 2! Part of me says that it's because perhaps Thread 2 (running B) already ran until (but not including) if(_complete) and so we need to insure that _complete is refreshed.
However, I don't see how this helps. Isn't it still possible that _complete will be set to true in A but yet the B method will see a cached (false) version of _complete? Ie, if Thread 2 ran method B until after the first MemoryBarrier and then Thread 1 ran method A until _complete = true but no further, and then Thread 1 resumed and tested if(_complete) -- could that if not result in false?

Barrier #2 guarentees that the write to _complete gets committed immediately. Otherwise it could remain in a queued state meaning that the read of _complete in B would not see the change caused by A even though B effectively used a volatile read.
Of course, this example does not quite do justice to the problem because A does nothing more after writing to _complete which means that the write will be comitted immediately anyway since the thread terminates early.
The answer to your question of whether the if could still evaluate to false is yes for exactly the reasons you stated. But, notice what the author says regarding this point.
Barriers 1 and 4 prevent this example
from writing “0”. Barriers 2 and 3
provide a freshness guarantee: they
ensure that if B ran after A, reading
_complete would evaluate to true.
The emphasis on "if B ran after A" is mine. It certainly could be the case that the two threads interleave. But, the author was ignoring this scenario presumably to make his point regarding how Thread.MemoryBarrier works simpler.
By the way, I had a hard time contriving an example on my machine where barriers #1 and #2 would have altered the behavior of the program. This is because the memory model regarding writes was strong in my environment. Perhaps, if I had a multiprocessor machine, was using Mono, or had some other different setup I could have demonstrated it. Of course, it was easy to demonstrate that removing barriers #3 and #4 had an impact.

The example is unclear for two reasons:
It is too simple to fully show what's happening with the fences.
Albahari is including requirements for non-x86 architectures. See MSDN: "MemoryBarrier is required only on multiprocessor systems with weak memory ordering (for example, a system employing multiple Intel Itanium processors [which Microsoft no longer supports]).".
If you consider the following, it becomes clearer:
A memory barrier (full barriers here - .Net doesn't provide a half barrier) prevents read / write instructions from jumping the fence (due to various optimisations). This guarantees us the code after the fence will execute after the code before the fence.
"This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible." See here.
x86 CPUs have a strong memory model and guarantee writes appear consistent to all threads / cores (therefore barriers #2 & #3 are unneeded on x86). But, we are not guaranteed that reads and writes will remain in coded sequence, hence the need for barriers #1 and #4.
Memory barriers are inefficient and needn't be used (see the same MSDN article). I personally use Interlocked and volatile (make sure you know how to use it correctly!!), which work efficiently and are easy to understand.
Ps. This article explains the inner workings of x86 nicely.

Thread.VolatileRead Implementation

I'm looking at the implementation of the VolatileRead/VolatileWrite methods (using Reflector), and i'm puzzled by something.
This is the implementation for VolatileRead:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int VolatileRead(ref int address)
{
int num = address;
MemoryBarrier();
return num;
}
How come the memory barrier is placed after reading the value of "address"? dosen't it supposed to be the opposite? (place before reading the value, so any pending writes to "address" will be completed by the time we make the actual read.
The same thing goes to VolatileWrite, where the memory barrier is place before the assignment of the value. Why is that?
Also, why does these methods have the NoInlining attribute? what could happen if they were inlined?

I thought that until recently. Volatile reads aren't what you think they are - they're not about guaranteeing that they get the most recent value; they're about making sure that no read which is later in the program code is moved to before this read. That's what the spec guarantees - and likewise for volatile writes, it guarantees that no earlier write is moved to after the volatile one.
You're not alone in suspecting this code, but Joe Duffy explains it better than I can :)
My answer to this is to give up on lock-free coding other than by using things like PFX which are designed to insulate me from it. The memory model is just too hard for me - I'll leave it to the experts, and stick with things that I know are safe.
One day I'll update my threading article to reflect this, but I think I need to be able to discuss it more sensibly first...
(I don't know about the no-inlining part, btw. I suspect that inlining could introduce some other optimizations which aren't meant to happen around volatile reads/writes, but I could easily be wrong...)

Maybe I am oversimplifying, but I think the explanations about reordering and cache coherency and so on give too much details.
So, why the MemoryBarrier comes after the actual read?
I will try to explain this with an example that uses object instead of int.
One may think the correct is:
Thread 1 creates the object (initializes its inner data).
Thread 1 then puts the object into a variable.
Then it "does a fence" and all threads see the new value.
Then, the read is something like this:
Thread 2 "does a fence".
Thread 2 reads the object instance.
Thread 2 is sure that it has all the inner data of that instance (as it started with a fence).
The biggest problem with this is:
Thread 1 creates the object and initializes it.
Thread 1 then puts the object into a variable.
Before the Thread flushes the cache, the CPU itself flushes part of the cache... it commits only the address of the variable (not the contents of that variable).
At that moment, Thread 2 had already flushed its cache. So it is going to read everything from the main memory.
So, it reads the variable (it is there).
Then it reads the content (it is not there).
Finally, after all this, the CPU 1 executes the Thread 1 that does the fence.
So, what happens with the volatile write and read?
The volatile write makes the contents of the object go to the memory immediately (starts by the fence), then they set the variable (with may not go immediatelly to the real memory).
Then, the volatile read will first clear the cache. Then it reads the field. If it receives a value when reading the field, it is certain that the contents pointed by that reference are really there.
By those little things, yes, it is possible that you do a VolatileWrite(1) and another thread still see the value of zero. But as soon other threads see the value of 1 (using a volatile read), all other items needed that may be referenced are already there. You can't really tell it as when reading the old value (0 or null) you may simple not progress considering that you don't still have everything that you need.
I already saw some discussions that, even if that flushes the caches twice, the right pattern will be:
MemoryBarrier - will flush other variables changed before this call
Write
MemoryBarrier - will guarantee that the write was flushed
The Read will then need the same:
MemoryBarrier
Read - Guarantees that we see the latest info... maybe one that was put AFTER our memory barrier.
As something may have appeared after our MemoryBarrier and was already read, we must put another MemoryBarrier to access the contents.
Those could be two Write-Fences or two Read-Fences if that existed in .Net.
I am not sure on everything I said... that is a "compilation" of many information I got and it really explains why the VolatileRead and VolatileWrite appear to be reversed, but it also guarantees that no invalid values are read when using them.

How does memory fences affect "freshness" of data?

I have a question about the following code sample (taken from: http://www.albahari.com/threading/part4.aspx#_NonBlockingSynch)
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B()
{
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine (_answer);
}
}
}
This is followed with the following explantion:
"Barriers 1 and 4 prevent this example from writing “0”. Barriers 2 and 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true."
I understand how using the memory barriers affect the instruction reording, but what is this "freshness gurarantee" that is mentioned?
Later in the article, the following example is also used:
static void Main()
{
bool complete = false;
var t = new Thread (() =>
{
bool toggle = false;
while (!complete)
{
toggle = !toggle;
// adding a call to Thread.MemoryBarrier() here fixes the problem
}
});
t.Start();
Thread.Sleep (1000);
complete = true;
t.Join(); // Blocks indefinitely
}
This example is followed with this explanation:
"This program never terminates because the complete variable is cached in a CPU register. Inserting a call to Thread.MemoryBarrier inside the while-loop (or locking around reading complete) fixes the error."
So again ... what happens here?

In the first case, Barrier 1 ensures _answer is written BEFORE _complete. Regardless of how the code is written, or how the compiler or CLR instructs the CPU, the memory bus read/write queues can reorder the requests. The Barrier basically says "flush the queue before continuing". Similarly, Barrier 4 makes sure _answer is read AFTER _complete. Otherwise CPU2 could reorder things and see an old _answer with a "new" _complete.
Barriers 2 and 3 are, in some sense, useless. Note that the explanation contains the word "after": ie "... if B ran after A, ...". What's it mean for B to run after A? If B and A are on the same CPU, then sure, B can be after. But in that case, same CPU means no memory barrier problems.
So consider B and A running on different CPUs. Now, very much like Einstein's relativity, the concept of comparing times at different locations/CPUs doesn't really make sense.
Another way of thinking about it - can you write code that can tell whether B ran after A? If so, well you probably used memory barriers to do that. Otherwise, you can't tell, and it doesn't make sense to ask. It's also similar to Heisenburg's Principle - if you can observe it, you've modified the experiment.
But leaving physics aside, let's say you could open the hood of your machine, and see that the actually memory location of _complete was true (because A had run). Now run B. without Barrier 3, CPU2 might STILL NOT see _complete as true. ie not "fresh".
But you probably can't open your machine and look at _complete. Nor communicate your findings to B on CPU2. Your only communication is what the CPUs themselves are doing. So if they can't determine BEFORE/AFTER without barriers, asking "what happens to B if it runs after A, without barriers" makes no sense.
By the way, I'm not sure what you have available in C#, but what is typically done, and what is really needed for Code sample # 1 is a single release barrier on write, and a single acquire barrier on read:
void A()
{
_answer = 123;
WriteWithReleaseBarrier(_complete, true); // "publish" values
}
void B()
{
if (ReadWithAcquire(_complete)) // subscribe
{
Console.WriteLine (_answer);
}
}
The word "subscribe" isn't often used to describe the situation, but "publish" is. I suggest you read Herb Sutter's articles on threading.
This puts the barriers in exactly the right places.
For Code sample #2, this isn't really a memory barrier problem, it is a compiler optimization issue - it is keeping complete in a register. A memory barrier would force it out, as would volatile, but probably so would calling an external function - if the compiler can't tell whether that external function modified complete or not, it will re-read it from memory. ie maybe pass the address of complete to some function (defined somewhere where the compiler can't examine its details):
while (!complete)
{
some_external_function(&complete);
}
even if the function doesn't modify complete, if the compiler isn't sure, it will need to reload its registers.
ie the difference between code 1 and code 2 is that code 1 only has problems when A and B are running on separate threads. code 2 could have problems even on a single threaded machine.
Actually, the other question would be - can the compiler completely remove the while loop? If it thinks complete is unreachable by other code, why not? ie if it decided to move complete into a register, it might as well remove the loop completely.
EDIT: To answer the comment from opc (my answer is too big for comment block):
Barrier 3 forces the CPU to flush any pending read (and write) requests.
So imagine if there was some other reads before reading _complete:
void B {}
{
int x = a * b + c * d; // read a,b,c,d
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
...
Without the barrier, the CPU might have all of these 5 read requests 'pending':
a,b,c,d,_complete
Without the barrier, the processor could reorder these requests to optimize memory access (ie if _complete and 'a' were on the same cache line or something).
With the barrier, the CPU gets a,b,c,d back from memory BEFORE _complete is even put in as a request. ENSURING 'b' (for example) is read BEFORE _complete - ie no reordering.
The question is - what difference does it make?
If a,b,c,d are independent from _complete, then it doesn't matter. All the barrier does is SLOW THINGS DOWN. So yeah, _complete is read later. So the data is fresher. Putting a sleep(100) or some busy-wait for-loop in there before the read would make it 'fresher' as well! :-)
So the point is - keep it relative. Does the data need to be read/written BEFORE/AFTER relative to some other data or not? That's the question.
And to not put down the author of the article - he does mention "if B ran after A...". It just isn't exactly clear whether he is imagining that B after A is crucial to the code, observable by to code, or just inconsequential.

Code sample #1:
Each processor core contains a cache with a copy of a portion of memory. It may take a bit of time for the cache to be updated. The memory barriers guarantee that the caches are synchronized with main memory. For example, if you didn't have barriers 2 and 3 here, consider this situation:
Processor 1 runs A(). It writes the new value of _complete to its cache (but not necessarily to main memory yet).
Processor 2 runs B(). It reads the value of _complete. If this value was previously in its cache, it may not be fresh (i.e., not synchronized with main memory), so it would not get the updated value.
Code sample #2:
Normally, variables are stored in memory. However, suppose a value is read multiple times in a single function: As an optimization, the compiler may decide to read it into a CPU register once, and then access the register each time it is needed. This is much faster, but prevents the function from detecting changes to the variable from another thread.
The memory barrier here forces the function to re-read the variable value from memory.

Calling Thread.MemoryBarrier() immediately refreshes the register caches with the actual values for variables.
In the first example, the "freshness" for _complete is provided by calling the method right after setting it and right before using it. In the second example, the initial false value for the variable complete will be cached in the thread's own space and needs to be resynchronized in order to immediately see the actual "outside" value from "inside" the running thread.

The "freshness" guarantee simply means that Barriers 2 and 3 force the values of _complete to be visible as soon as possible as opposed to whenever they happen to be written to memory.
It's actually unnecessary from a consistency point of view, since Barriers 1 and 4 ensure that answer will be read after reading complete.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.