How does memory fences affect "freshness" of data?

How does memory fences affect "freshness" of data? - c#

I have a question about the following code sample (taken from: http://www.albahari.com/threading/part4.aspx#_NonBlockingSynch)
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B()
{
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine (_answer);
}
}
}
This is followed with the following explantion:
"Barriers 1 and 4 prevent this example from writing “0”. Barriers 2 and 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true."
I understand how using the memory barriers affect the instruction reording, but what is this "freshness gurarantee" that is mentioned?
Later in the article, the following example is also used:
static void Main()
{
bool complete = false;
var t = new Thread (() =>
{
bool toggle = false;
while (!complete)
{
toggle = !toggle;
// adding a call to Thread.MemoryBarrier() here fixes the problem
}
});
t.Start();
Thread.Sleep (1000);
complete = true;
t.Join(); // Blocks indefinitely
}
This example is followed with this explanation:
"This program never terminates because the complete variable is cached in a CPU register. Inserting a call to Thread.MemoryBarrier inside the while-loop (or locking around reading complete) fixes the error."
So again ... what happens here?

In the first case, Barrier 1 ensures _answer is written BEFORE _complete. Regardless of how the code is written, or how the compiler or CLR instructs the CPU, the memory bus read/write queues can reorder the requests. The Barrier basically says "flush the queue before continuing". Similarly, Barrier 4 makes sure _answer is read AFTER _complete. Otherwise CPU2 could reorder things and see an old _answer with a "new" _complete.
Barriers 2 and 3 are, in some sense, useless. Note that the explanation contains the word "after": ie "... if B ran after A, ...". What's it mean for B to run after A? If B and A are on the same CPU, then sure, B can be after. But in that case, same CPU means no memory barrier problems.
So consider B and A running on different CPUs. Now, very much like Einstein's relativity, the concept of comparing times at different locations/CPUs doesn't really make sense.
Another way of thinking about it - can you write code that can tell whether B ran after A? If so, well you probably used memory barriers to do that. Otherwise, you can't tell, and it doesn't make sense to ask. It's also similar to Heisenburg's Principle - if you can observe it, you've modified the experiment.
But leaving physics aside, let's say you could open the hood of your machine, and see that the actually memory location of _complete was true (because A had run). Now run B. without Barrier 3, CPU2 might STILL NOT see _complete as true. ie not "fresh".
But you probably can't open your machine and look at _complete. Nor communicate your findings to B on CPU2. Your only communication is what the CPUs themselves are doing. So if they can't determine BEFORE/AFTER without barriers, asking "what happens to B if it runs after A, without barriers" makes no sense.
By the way, I'm not sure what you have available in C#, but what is typically done, and what is really needed for Code sample # 1 is a single release barrier on write, and a single acquire barrier on read:
void A()
{
_answer = 123;
WriteWithReleaseBarrier(_complete, true); // "publish" values
}
void B()
{
if (ReadWithAcquire(_complete)) // subscribe
{
Console.WriteLine (_answer);
}
}
The word "subscribe" isn't often used to describe the situation, but "publish" is. I suggest you read Herb Sutter's articles on threading.
This puts the barriers in exactly the right places.
For Code sample #2, this isn't really a memory barrier problem, it is a compiler optimization issue - it is keeping complete in a register. A memory barrier would force it out, as would volatile, but probably so would calling an external function - if the compiler can't tell whether that external function modified complete or not, it will re-read it from memory. ie maybe pass the address of complete to some function (defined somewhere where the compiler can't examine its details):
while (!complete)
{
some_external_function(&complete);
}
even if the function doesn't modify complete, if the compiler isn't sure, it will need to reload its registers.
ie the difference between code 1 and code 2 is that code 1 only has problems when A and B are running on separate threads. code 2 could have problems even on a single threaded machine.
Actually, the other question would be - can the compiler completely remove the while loop? If it thinks complete is unreachable by other code, why not? ie if it decided to move complete into a register, it might as well remove the loop completely.
EDIT: To answer the comment from opc (my answer is too big for comment block):
Barrier 3 forces the CPU to flush any pending read (and write) requests.
So imagine if there was some other reads before reading _complete:
void B {}
{
int x = a * b + c * d; // read a,b,c,d
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
...
Without the barrier, the CPU might have all of these 5 read requests 'pending':
a,b,c,d,_complete
Without the barrier, the processor could reorder these requests to optimize memory access (ie if _complete and 'a' were on the same cache line or something).
With the barrier, the CPU gets a,b,c,d back from memory BEFORE _complete is even put in as a request. ENSURING 'b' (for example) is read BEFORE _complete - ie no reordering.
The question is - what difference does it make?
If a,b,c,d are independent from _complete, then it doesn't matter. All the barrier does is SLOW THINGS DOWN. So yeah, _complete is read later. So the data is fresher. Putting a sleep(100) or some busy-wait for-loop in there before the read would make it 'fresher' as well! :-)
So the point is - keep it relative. Does the data need to be read/written BEFORE/AFTER relative to some other data or not? That's the question.
And to not put down the author of the article - he does mention "if B ran after A...". It just isn't exactly clear whether he is imagining that B after A is crucial to the code, observable by to code, or just inconsequential.

Code sample #1:
Each processor core contains a cache with a copy of a portion of memory. It may take a bit of time for the cache to be updated. The memory barriers guarantee that the caches are synchronized with main memory. For example, if you didn't have barriers 2 and 3 here, consider this situation:
Processor 1 runs A(). It writes the new value of _complete to its cache (but not necessarily to main memory yet).
Processor 2 runs B(). It reads the value of _complete. If this value was previously in its cache, it may not be fresh (i.e., not synchronized with main memory), so it would not get the updated value.
Code sample #2:
Normally, variables are stored in memory. However, suppose a value is read multiple times in a single function: As an optimization, the compiler may decide to read it into a CPU register once, and then access the register each time it is needed. This is much faster, but prevents the function from detecting changes to the variable from another thread.
The memory barrier here forces the function to re-read the variable value from memory.

Calling Thread.MemoryBarrier() immediately refreshes the register caches with the actual values for variables.
In the first example, the "freshness" for _complete is provided by calling the method right after setting it and right before using it. In the second example, the initial false value for the variable complete will be cached in the thread's own space and needs to be resynchronized in order to immediately see the actual "outside" value from "inside" the running thread.

The "freshness" guarantee simply means that Barriers 2 and 3 force the values of _complete to be visible as soon as possible as opposed to whenever they happen to be written to memory.
It's actually unnecessary from a consistency point of view, since Barriers 1 and 4 ensure that answer will be read after reading complete.

Related

Does a MemoryBarrier guarantee memory visibility for all memory?

If I understand correctly, in C#, a lock block guarantees exclusive access to a set of instructions, but it also guarantees that any reads from memory reflect the latest version of that memory in any CPU cache. We think of lock blocks as protecting the variables read and modified within the block, which means:
Assuming you've properly implemented locking where necessary, those variables can only be read and written to by one thread at a time, and
Reads within the lock block see the latest versions of a variable and writes within the lock block become visible to all threads.
(Right?)
This second point is what interests me. Is there some magic by which only variables read and written in code protected by the lock block are guaranteed fresh, or do the memory barriers employed in the implementation of lock guarantee that all memory is now equally fresh for all threads? Pardon my mental fuzziness here about how caches work, but I've read that caches hold several multi-byte "lines" of data. I think what I'm asking is, does a memory barrier force synchronization of all "dirty" cache lines or just some, and if just some, what determines which lines get synchronized?

If I understand correctly, in C#, a lock block guarantees exclusive access to a set of instructions...
Right. The specification guarantees that.
but it also guarantees that any reads from memory reflect the latest version of that memory in any CPU cache.
The C# specification says nothing whatsoever about "CPU cache". You've left the realm of what is guaranteed by the specification, and entered the realm of implementation details. There is no requirement that an implementation of C# execute on a CPU that has any particular cache architecture.
Is there some magic by which only variables read and written in code protected by the lock block are guaranteed fresh, or do the memory barriers employed in the implementation of lock guarantee that all memory is now equally fresh for all threads?
Rather than try to parse your either-or question, let's say what is actually guaranteed by the language. A special effect is:
Any write to a variable, volatile or not
Any read of a volatile field
Any throw
The order of special effects is preserved at certain special points:
Reads and writes of volatile fields
locks
thread creation and termination
The runtime is required to ensure that special effects are ordered consistently with special points. So, if there is a read of a volatile field before a lock, and a write after, then the read can't be moved after the write.
So, how does the runtime achieve this? Beats the heck out of me. But the runtime is certainly not required to "guarantee that all memory is fresh for all threads". The runtime is required to ensure that certain reads, writes and throws happen in chronological order with respect to special points, and that's all.
The runtime is in particular not required that all threads observe the same order.
Finally, I always end these sorts of discussions by pointing you here:
http://blog.coverity.com/2014/03/26/reordering-optimizations/
After reading that, you should have an appreciation for the sorts of horrid things that can happen even on x86 when you act casual about eliding locks.

Reads within the lock block see the latest versions of a variable and writes within the lock block are visible to all threads.
No, that's definitely a harmful oversimplification.
When you enter the lock statement, there a memory fence which sort of means that you'll always read "fresh" data. When you exit the lock state, there's a memory fence which sort of means that all the data you've written is guaranteed to be written to main memory and available to other threads.
The important point is that if multiple threads only ever read/write memory when they "own" a particular lock, then by definition one of them will have exited the lock before the next one enters it... so all those reads and writes will be simple and correct.
If you have code which reads and writes a variable without taking a lock, then there's no guarantee that it will "see" data written by well-behaved code (i.e. code using the lock), or that well-behaved threads will "see" the data written by that bad code.
For example:
private readonly object padlock = new object();
private int x;
public void A()
{
lock (padlock)
{
// Will see changes made in A and B; may not see changes made in C
x++;
}
}
public void B()
{
lock (padlock)
{
// Will see changes made in A and B; may not see changes made in C
x--;
}
}
public void C()
{
// Might not see changes made in A, B, or C. Changes made here
// might not be visible in other threads calling A, B or C.
x = x + 10;
}
Now it's more subtle than that, but that's why using a common lock to protect a set of variables works.

What I do not understand about volatile and Memory-Barrier is

Loop hoisting a volatile read
I have read many places that a volatile variable can not be hoisted from a loop or if, but I cannot find this mentioned any places in the C# spec. Is this a hidden feature?
All writes are volatile in C#
Does this mean that all writes have the same properties without, as with the volatile keyword? Eg ordinary writes in C# has release semantics? and all writes flushes the store buffer of the processor?
Release semantics
Is this a formal way of saying that the store buffer of a processor is emptied when a volatile write is done?
Acquire semantics
Is this a formal way of saying that is should not load a variable into a register, but fetch it from memory every time?
In this article, Igoro speaks of "thread cache". I perfectly understand that this is imaginary, but is he in fact referring to:
Processor store buffer
loading variables into registers instead of fetching from memory every time
Some sort of processor cache (is this L1 and L2 etc)
Or is this just my imagination?
Delayed writing
I have read many places that writes can be delayed. Is this because of the reordering, and the store buffer?
Memory.Barrier
I understand that a side-effect is a call to "lock or" when JIT is transforming IL to asm, and this is why a Memory.Barrier can solve the delayed write to main memory (in the while loop) in fx this example:
static void Main()
{
bool complete = false;
var t = new Thread (() =>
{
bool toggle = false;
while (!complete) toggle = !toggle;
});
t.Start();
Thread.Sleep (1000);
complete = true;
t.Join(); // Blocks indefinitely
}
But is this always the case? Will a call to Memory.Barrier always flush the store buffer fetch updated values into the processor cache? I understand that the complete variable is not hoisted into a register and is fetched from a processor cache, every time, but the processor cache is updated because of the call to Memory.Barrier.
Am I on thin ice here, or have I some sort of understand of volatile and Memory.Barrier?

That's a mouthful..
I'm gonna start with a few of your questions, and update my answer.
Loop hoisting a volatile
I have read many places that a volatile variable can not be hoisted from a loop or if, but I cannot find this mentioned any places in the C# spec. Is this a hidden feature?
MSDN says "Fields that are declared volatile are not subject to compiler optimizations that assume access by a single thread". This is kind of a broad statement, but it includes hoisting or "lifting" variables out of a loop.
All writes are volatile in C#
Does this mean that all writes have the same properties without, as with the volatile keyword? Eg ordinary writes in C# has release semantics? and all writes flushes the store buffer of the processor?
Regular writes are not volatile. They do have release semantics, but they don't flush the CPU's write-buffer. At least, not according to the spec.
From Joe Duffy's CLR 2.0 Memory Model
Rule 2: All stores have release semantics, i.e. no load or store may move after one.
I've read a few articles stating that all writes are volatile in C# (like the one you linked to), but this is a common misconception. From the horse's mouth (The C# Memory Model in Theory and Practice, Part 2):
Consequently, the author might say something like, “In the .NET 2.0 memory model, all writes are volatile—even those to non-volatile fields.” (...) This behavior isn’t guaranteed by the ECMA C# spec, and, consequently, might not hold in future versions of the .NET Framework and on future architectures (and, in fact, does not hold in the .NET Framework 4.5 on ARM).
Release semantics
Is this a formal way of saying that the store buffer of a processor is emptied when a volatile write is done?
No, those are two different things. If an instruction has "release semantics", then no store/load instruction will ever be moved below said instruction. The definition says nothing regarding flushing the write-buffer. It only concerns instruction re-ordering.
Delayed writing
I have read many places that writes can be delayed. Is this because of the reordering, and the store buffer?
Yes. Write instructions can be delayed/reordered by either the compiler, the jitter or the CPU itself.
So a volatile write has two properties: release semantics, and store buffer flushing.
Sort of. I prefer to think of it this way:
The C# Specification of the volatile keyword guarantees one property: that reads have acquire-semantics and writes have release-semantics. This is done by emitting the necessary release/acquire fences.
The actual Microsoft's C# implementation adds another property: reads will be fresh, and writes will be flushed to memory immediately and be made visible to other processors. To accomplish this, the compiler emits an OpCodes.Volatile, and the jitter picks this up and tells the processor not to store this variable on its registers.
This means that a different C# implementation that doesn't guarantee immediacy will be a perfectly valid implementation.
Memory Barrier
bool complete = false;
var t = new Thread (() =>
{
bool toggle = false;
while (!complete) toggle = !toggle;
});
t.Start();
Thread.Sleep(1000);
complete = true;
t.Join(); // blocks
But is this always the case? Will a call to Memory.Barrier always flush the store buffer fetch updated values into the processor cache?
Here's a tip: try to abstract yourself away from concepts like flushing the store buffer, or reading straight from memory. The concept of a memory barrier (or a full-fence) is in no way related to the two former concepts.
A memory barrier has one sole purpose: ensure that store/load instructions below the fence are not moved above the fence, and vice-versa. If C#'s Thread.MemoryBarrier just so happens to flush pending writes, you should think about it as a side-effect, not the main intent.
Now, let's get to the point. The code you posted (which blocks when compiled in Release mode and ran without a debugger) could be solved by introducing a full fence anywhere inside the while block. Why? Let's first unroll the loop. Here's how the first few iterations would look like:
if(complete) return;
toggle = !toggle;
if(complete) return;
toggle = !toggle;
if(complete) return;
toggle = !toggle;
...
Because complete is not marked as volatile and there are no fences, the compiler and the cpu are allowed to move the read of the complete field.
In fact, the CLR's Memory Model (see rule 6) allows loads to be deleted (!) when coalescing adjacent loads. So, this could happen:
if(complete) return;
toggle = !toggle;
toggle = !toggle;
toggle = !toggle;
...
Notice that this is logically equivalent to hoisting the read out of the loop, and that's exactly what the compiler may do.
By introducing a full-fence either before or after toggle = !toggle, you'd prevent the compiler from moving the reads up and merging them together.
if(complete) return;
toggle = !toggle;
#FENCE
if(complete) return;
toggle = !toggle;
#FENCE
if(complete) return;
toggle = !toggle;
#FENCE
...
In conclusion, the key to solving these issues is ensuring that the instructions will be executed in the correct order. It has nothing to do with how long it takes for other processors to see one processor's writes.

Understanding non blocking thread synchronization and Thread.MemoryBarrier

In this threading online book: http://www.albahari.com/threading/part4.aspx
theres an example of Thread.MemoryBarrier()
class Foo
{
int _answer;
bool _complete;
void A()
{
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B()
{
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
{
Thread.MemoryBarrier(); // Barrier 4
Console.WriteLine (_answer);
}
}
}
We got a discussion whether there is any thread blocking going on or not?
Im thinking there is some, especially given that
A full fence takes around ten nanoseconds on a 2010-era desktop.
On other hand, full fence is only supposed to disable instructions reodering and caching which by its sound doesn't qualify as thread blocking, (unlike lock where its clear that thread waits for other to release lock before it continues, and is blocked during that time)
About that thread 'block state'. im talking not in terms of whether thread is put into blocked state or not, but whether there is some thread synchronization happening, which means one thread is not able to run while other isn't letting it to do so, by means of MemoryBarrier in this case.
Also Id like to get clear understanding what each barrier achieves. For example Barrier 2 - how exactly it provides freshness guarantee and how is it connected to barrier 3? If someone would explain in detail whats each barrier purpose here( what could possibly go wrong if 1 or 2 or 3 or 4 weren't there) I think id improve my understanding of this greatly.
EDIT: its mostly clear now what 1, 2, and 3 do. However what 4 does that 3 doesn't is still unclear.

The fact that instructions take time to execute does not imply that a thread is blocked. A thread is blocked when it is specifically put into a blocked state, which MemoryBarrier() does not do.
The processor instructions that actually prevent instruction reordering and cache flushing take time, because they must wait for the caches to become coherent again. During that time, the thread is still considered running.
Update: So let's take a look at what's actually happening in the example, and what each memory barrier actually does.
As the link says, 1 and 4 ensure that the correct answers are produced. That's because 1 ensures that the answers are flushed into memory, and 4 ensures that the read caches are flushed prior to retrieving the variables.
2 and 3 ensure that if A runs first, then B will always print the answers. Barrier 2 ensures that the write of true is flushed to memory, and barrier 3 ensures that the read cahces are flushed before testing _complete's value.
The cache and memory flushing should be clear enough, so let's look at instruction reordering. The way the compiler, CLR and CPU know they can reorder instructions is by analyzing a set of instructions in sequence. When they see the barrier instruction in the middle of a sequence, they know that instructions can't move across that boundary. That ensures that in addition to cache freshness, the instructions occur in the correct order.

Why do I need a memory barrier?

C# 4 in a Nutshell (highly recommended btw) uses the following code to demonstrate the concept of MemoryBarrier (assuming A and B were run on different threads):
class Foo{
int _answer;
bool complete;
void A(){
_answer = 123;
Thread.MemoryBarrier(); // Barrier 1
_complete = true;
Thread.MemoryBarrier(); // Barrier 2
}
void B(){
Thread.MemoryBarrier(); // Barrier 3;
if(_complete){
Thread.MemoryBarrier(); // Barrier 4;
Console.WriteLine(_answer);
}
}
}
they mention that Barriers 1 & 4 prevent this example from writing 0 and Barriers 2 & 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true.
I'm not really getting it. I think I understand why Barriers 1 & 4 are necessary: we don't want the write to _answer to be optimized and placed after the write to _complete (Barrier 1) and we need to make sure that _answer is not cached (Barrier 4). I also think I understand why Barrier 3 is necessary: if A ran until just after writing _complete = true, B would still need to refresh _complete to read the right value.
I don't understand though why we need Barrier 2! Part of me says that it's because perhaps Thread 2 (running B) already ran until (but not including) if(_complete) and so we need to insure that _complete is refreshed.
However, I don't see how this helps. Isn't it still possible that _complete will be set to true in A but yet the B method will see a cached (false) version of _complete? Ie, if Thread 2 ran method B until after the first MemoryBarrier and then Thread 1 ran method A until _complete = true but no further, and then Thread 1 resumed and tested if(_complete) -- could that if not result in false?

Barrier #2 guarentees that the write to _complete gets committed immediately. Otherwise it could remain in a queued state meaning that the read of _complete in B would not see the change caused by A even though B effectively used a volatile read.
Of course, this example does not quite do justice to the problem because A does nothing more after writing to _complete which means that the write will be comitted immediately anyway since the thread terminates early.
The answer to your question of whether the if could still evaluate to false is yes for exactly the reasons you stated. But, notice what the author says regarding this point.
Barriers 1 and 4 prevent this example
from writing “0”. Barriers 2 and 3
provide a freshness guarantee: they
ensure that if B ran after A, reading
_complete would evaluate to true.
The emphasis on "if B ran after A" is mine. It certainly could be the case that the two threads interleave. But, the author was ignoring this scenario presumably to make his point regarding how Thread.MemoryBarrier works simpler.
By the way, I had a hard time contriving an example on my machine where barriers #1 and #2 would have altered the behavior of the program. This is because the memory model regarding writes was strong in my environment. Perhaps, if I had a multiprocessor machine, was using Mono, or had some other different setup I could have demonstrated it. Of course, it was easy to demonstrate that removing barriers #3 and #4 had an impact.

The example is unclear for two reasons:
It is too simple to fully show what's happening with the fences.
Albahari is including requirements for non-x86 architectures. See MSDN: "MemoryBarrier is required only on multiprocessor systems with weak memory ordering (for example, a system employing multiple Intel Itanium processors [which Microsoft no longer supports]).".
If you consider the following, it becomes clearer:
A memory barrier (full barriers here - .Net doesn't provide a half barrier) prevents read / write instructions from jumping the fence (due to various optimisations). This guarantees us the code after the fence will execute after the code before the fence.
"This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible." See here.
x86 CPUs have a strong memory model and guarantee writes appear consistent to all threads / cores (therefore barriers #2 & #3 are unneeded on x86). But, we are not guaranteed that reads and writes will remain in coded sequence, hence the need for barriers #1 and #4.
Memory barriers are inefficient and needn't be used (see the same MSDN article). I personally use Interlocked and volatile (make sure you know how to use it correctly!!), which work efficiently and are easy to understand.
Ps. This article explains the inner workings of x86 nicely.

Thread.VolatileRead Implementation

I'm looking at the implementation of the VolatileRead/VolatileWrite methods (using Reflector), and i'm puzzled by something.
This is the implementation for VolatileRead:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int VolatileRead(ref int address)
{
int num = address;
MemoryBarrier();
return num;
}
How come the memory barrier is placed after reading the value of "address"? dosen't it supposed to be the opposite? (place before reading the value, so any pending writes to "address" will be completed by the time we make the actual read.
The same thing goes to VolatileWrite, where the memory barrier is place before the assignment of the value. Why is that?
Also, why does these methods have the NoInlining attribute? what could happen if they were inlined?

I thought that until recently. Volatile reads aren't what you think they are - they're not about guaranteeing that they get the most recent value; they're about making sure that no read which is later in the program code is moved to before this read. That's what the spec guarantees - and likewise for volatile writes, it guarantees that no earlier write is moved to after the volatile one.
You're not alone in suspecting this code, but Joe Duffy explains it better than I can :)
My answer to this is to give up on lock-free coding other than by using things like PFX which are designed to insulate me from it. The memory model is just too hard for me - I'll leave it to the experts, and stick with things that I know are safe.
One day I'll update my threading article to reflect this, but I think I need to be able to discuss it more sensibly first...
(I don't know about the no-inlining part, btw. I suspect that inlining could introduce some other optimizations which aren't meant to happen around volatile reads/writes, but I could easily be wrong...)

Maybe I am oversimplifying, but I think the explanations about reordering and cache coherency and so on give too much details.
So, why the MemoryBarrier comes after the actual read?
I will try to explain this with an example that uses object instead of int.
One may think the correct is:
Thread 1 creates the object (initializes its inner data).
Thread 1 then puts the object into a variable.
Then it "does a fence" and all threads see the new value.
Then, the read is something like this:
Thread 2 "does a fence".
Thread 2 reads the object instance.
Thread 2 is sure that it has all the inner data of that instance (as it started with a fence).
The biggest problem with this is:
Thread 1 creates the object and initializes it.
Thread 1 then puts the object into a variable.
Before the Thread flushes the cache, the CPU itself flushes part of the cache... it commits only the address of the variable (not the contents of that variable).
At that moment, Thread 2 had already flushed its cache. So it is going to read everything from the main memory.
So, it reads the variable (it is there).
Then it reads the content (it is not there).
Finally, after all this, the CPU 1 executes the Thread 1 that does the fence.
So, what happens with the volatile write and read?
The volatile write makes the contents of the object go to the memory immediately (starts by the fence), then they set the variable (with may not go immediatelly to the real memory).
Then, the volatile read will first clear the cache. Then it reads the field. If it receives a value when reading the field, it is certain that the contents pointed by that reference are really there.
By those little things, yes, it is possible that you do a VolatileWrite(1) and another thread still see the value of zero. But as soon other threads see the value of 1 (using a volatile read), all other items needed that may be referenced are already there. You can't really tell it as when reading the old value (0 or null) you may simple not progress considering that you don't still have everything that you need.
I already saw some discussions that, even if that flushes the caches twice, the right pattern will be:
MemoryBarrier - will flush other variables changed before this call
Write
MemoryBarrier - will guarantee that the write was flushed
The Read will then need the same:
MemoryBarrier
Read - Guarantees that we see the latest info... maybe one that was put AFTER our memory barrier.
As something may have appeared after our MemoryBarrier and was already read, we must put another MemoryBarrier to access the contents.
Those could be two Write-Fences or two Read-Fences if that existed in .Net.
I am not sure on everything I said... that is a "compilation" of many information I got and it really explains why the VolatileRead and VolatileWrite appear to be reversed, but it also guarantees that no invalid values are read when using them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.