Why can't I cut off the volatile from spinlock implementation?

Why can't I cut off the volatile from spinlock implementation? - c#

According to this article, http://msdn.microsoft.com/en-us/magazine/cc163715.aspx,
this is the implementation of spinlock class:
class SpinLock
{
volatile int isEntered;
// non-zero if the lock is entered
public void Enter()
{
while (Interlocked.CompareExchange(ref isEntered, 1, 0) != 0)
{
Thread.Sleep(0); // force a thread context switch
}
}
public void Exit()
{
isEntered = 0;
}
}
I know what volatile means and does but I cant understand why its here.
Last thing I wanna ask in another topic- Does reading a object's property count as atomic operation? from my understanding, there are 2 reads here: first, the object reference and second the property reading.

After (re)studying memory models for a couple of days, I had the same vexing question about the use of volatile in SpinLock. I cannot find any problematic reordering of memory access that requires either of the fences entailed by the volatile access.
Acquire fence: Provided by Interlocked.CompareExchange().
Release fence: To get lock semantics, we want all side effects of the code "under the lock" to be visible when the lock is released. But the .NET memory model does not permit reordering of writes. Therefore, the writes from under the lock must become visible no later than the clearing of isEntered.
Update
After some more study based on the Core CLR discussions, I figured out where the disconnect is. First, the two popular sources that state writes also include a release fence or just that they cannot be reordered:
http://joeduffyblog.com/2007/11/10/clr-20-memory-model/
Rule 2: All stores have release semantics, i.e. no load or store may move after one.
https://learn.microsoft.com/en-us/archive/msdn-magazine/2005/october/understanding-low-lock-techniques-in-multithreaded-apps, in the "Strong Model 2: .NET Framework 2.0" section:
Writes cannot move past other writes from the same thread.
Turns out that the ".NET Framework 2 model" is not inherited by all later versions and ports of the framework. The most enlightening article ended up being this one: C# - The C# Memory Model in Theory and Practice, Part 2. The author explains the more relaxed CLR implementations for IA64 and ARM. Essentially, normal writes can be reordered, and thus memory fences are needed in some situations. (How they've decided to insert a release fence before a normal write to a reference type field on the heap strikes me as a desperate hack for maintaining compatibility with code written for x86, but it's well justified in the historical context and the context of the popular lock-free object publication pattern.)
Conclusion: volatile is needed for the SpinLock field in a generic ".NET" implementation to ensure writes under the lock occur before the lock is released--without taking dependence on the stronger ".NET Framework 2 model" or the CPU architecture.

After reading a few sources, I had the impression that volatile only affects reordering done by compiler/runtime during optimization. The C# 5.0 Reference indicates it's more than this:
These optimizations can be performed by the compiler, by the run-time system, or by hardware
And then I came across this little article
http://www.codeproject.com/Articles/31283/Volatile-fields-in-NET-A-look-inside

Related

Can memory reordering cause C# to access unallocated memory?

It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword. However, its memory model allows reordering when there is unsynchronized access between threads. This leads to race hazards where references to new instances appear to be available to racing threads before the instances have been fully initialized, and is a widely known problem for double-checked locking. Chris Brumme (from the CLR team) explains this in their Memory Model article:
Consider the standard double-locking protocol:
if (a == null)
{
lock(obj)
{
if (a == null)
a = new A();
}
}
This is a common technique for avoiding a lock on the read of ‘a’ in the typical case. It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA spec, acquiring a lock has acquire semantics and releasing a lock has release semantics.
However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’. At that point, there is a small window before the store.release implied by leaving the lock. Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance.
I've always been confused by what "partially constructed instance" means. Assuming that the .NET runtime clears out memory on allocation rather than garbage collection (discussion), does this mean that the other thread might read memory that still contains data from garbage-collected objects (like what happens in unsafe languages)?
Consider the following concrete example:
byte[] buffer = new byte[2];
Parallel.Invoke(
() => buffer = new byte[4],
() => Console.WriteLine(BitConverter.ToString(buffer)));
The above has a race condition; the output would be either 00-00 or 00-00-00-00. However, is it possible that the second thread reads the new reference to buffer before the array's memory has been initialized to 0, and outputs some other arbitrary string instead?

Let's not bury the lede here: the answer to your question is no, you will never observe the pre-allocated state of memory in the CLR 2.0 memory model.
I'll now address a couple of your non-central points.
It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword.
That is more or less correct. There are some mechanisms by which one can access bogus memory without using unsafe -- via unmanaged code, obviously, or by abusing structure layout. But in general, yes, C# is memory safe.
However, its memory model allows reordering when there is unsynchronized access between threads.
Again, that's more or less correct. A better way to think about it is that C# allows reordering at any point where the reordering would be invisible to a single threaded program, subject to certain constraints. Those constraints include introducing acquire and release semantics in certain cases, and preserving certain side effects at certain critical points.
Chris Brumme (from the CLR team) ...
The late great Chris's articles are gems and give a great deal of insight into the early days of the CLR, but I note that there have been some strengthenings of the memory model since 2003 when that article was written, particularly with respect to the issue you raise.
Chris is right that double-checked locking is super dangerous. There is a correct way to do double-checked locking in C#, and the moment you depart from it even slightly, you are off in the weeds of horrible bugs that only repro on weak memory model hardware.
does this mean that the other thread might read memory that still contains data from garbage-collected objects
I think your question is not specifically about the old weak ECMA memory model that Chris was describing, but rather about what guarantees are actually made today.
It is not possible for re-orderings to expose the previous state of objects. You are guaranteed that when you read a freshly-allocated object, its fields are all zeros.
This is made possible by the fact that all writes have release semantics in the current memory model; see this for details:
http://joeduffyblog.com/2007/11/10/clr-20-memory-model/
The write that initializes the memory to zero will not be moved forwards in time with respect to a read later.
I've always been confused by "partially constructed objects"
Joe discusses that here: http://joeduffyblog.com/2010/06/27/on-partiallyconstructed-objects/
Here the concern is not that we might see the pre-allocation state of an object. Rather, the concern here is that one thread might see an object while the constructor is still running on another thread.
Indeed, it is possible for the constructor and the finalizer to be running concurrently, which is super weird! Finalizers are hard to write correctly for this reason.
Put another way: the CLR guarantees you that its own invariants will be preserved. An invariant of the CLR is that newly allocated memory is observed to be zeroed out, so that invariant will be preserved.
But the CLR is not in the business of preserving your invariants! If you have a constructor which guarantees that field x is true if and only if y is non-null, then you are responsible for ensuring that this invariant is always observed to be true. If in some way this is observed by two threads, then one of those threads might observe the invariant being violated.

Can a read instruction after an unrelated lock statement be moved before the lock?

This question is a follow-up to comments in this thread.
Let's assume we have the following code:
// (1)
lock (padlock)
{
// (2)
}
var value = nonVolatileField; // (3)
Furthermore, let's assume that no instruction in (2) has any effect on the nonVolatileField and vice versa.
Can the reading instruction (3) be reordered in such a way that in ends up before the lock statement (1) or inside it (2)?
As far as I can tell, nothing in the C# Specification (§3.10) and the CLI Specification (§I.12.6.5) prohibits such reordering.
Please note that this is not the same question as this one. Here I am asking specifically about read instructions, because as far as I understand, they are not considered side-effects and have weaker guarantees.

I believe this is partially guaranteed by the CLI spec, although it's not as clear as it might be. From I.12.6.5:
Acquiring a lock (System.Threading.Monitor.Enter or entering a synchronized method) shall implicitly perform a volatile read operation, and releasing a lock
(System.Threading.Monitor.Exit or leaving a synchronized method) shall implicitly perform a volatile write operation. See §I.12.6.7.
Then from I.12.6.7:
A volatile read has “acquire semantics” meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has “release semantics” meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence.
So entering the lock should prevent (3) from moving to (1). Reading from nonVolatileField still counts as a "reference to memory", I believe. However, the read could still be performed before the volatile write when the lock exits, so it could still be moved to (2).
The C#/CLI memory model leaves a lot to be desired at the moment. I'm hoping that the whole thing can be clarified significantly (and probably tightened up, to make some "theoretically valid but practically awful" optimizations invalid).

As far as .NET is concerned, entering a monitor (the lock statement) has acquire semantics, as it implicitly performs a volatile read, and exiting a monitor (the end of the lock block) has release semantics, as it implicitly performs a volatile write (see §12.6.5 Locks and Threads in Common Language Infrastructure (CLI) Partition I).
volatile bool areWeThereYet = false;
// In thread 1
// Accesses, usually writes: create objects, initialize them
areWeThereYet = true;
// In thread 2
if (areWeThereYet)
{
// Accesses, usually reads: use created and initialized objects
}
When you write a value to areWeThereYet, all accesses before it were performed and not reordered to after the volatile write.
When you read from areWeThereYet, subsequent accesses are not reordered to before the volatile read.
In this case, when thread 2 observes that areWeThereYet has changed, it has a guarantee that the following accesses, usually reads, will observe the other thread's accesses, usually writes. Assuming there is no other code messing with the affected variables.
As for other synchronization primitives in .NET, such as SemaphoreSlim, although not explicitly documented, it would be rather useless if they didn't have similar semantics. Programs based on them could, in fact, not even work correctly in platforms or hardware architectures with a weaker memory model.
Many people share the thought that Microsoft ought to enforce a strong memory model on such architectures, similar to x86/amd64 as to keep the current code base (Microsoft's own and those of their clients) compatible.
I cannot verify myself, as I don't have an ARM device with Microsoft Windows, much less with .NET Framework for ARM, but at least one MSDN magazine article by Andrew Pardoe, CLR - .NET Development for ARM Processors, states:
The CLR is allowed to expose a stronger memory model than the ECMA CLI specification requires. On x86, for example, the memory model of the CLR is strong because the processor’s memory model is strong. The .NET team could’ve made the memory model on ARM as strong as the model on x86, but ensuring the perfect ordering whenever possible can have a notable impact on code execution performance. We’ve done targeted work to strengthen the memory model on ARM—specifically, we’ve inserted memory barriers at key points when writing to the managed heap to guarantee type safety—but we’ve made sure to only do this with a minimal impact on performance. The team went through multiple design reviews with experts to make sure that the techniques applied in the ARM CLR were correct. Moreover, performance benchmarks show that .NET code execution performance scales the same as native C++ code when compared across x86, x64 and ARM.

Events and multithreading once again

I'm worried about the correctness of the seemingly-standard pre-C#6 pattern for firing an event:
EventHandler localCopy = SomeEvent;
if (localCopy != null)
localCopy(this, args);
I've read Eric Lippert's Events and races and know that there is a remaining issue of calling a stale event handler, but my worry is whether the compiler/JITter is allowed to optimize away the local copy, effectively rewriting the code as
if (SomeEvent != null)
SomeEvent(this, args);
with possible NullReferenceException.
According to the C# Language Specification, §3.10,
The critical execution points at which the order of these side effects must be preserved are references to volatile fields (§10.5.3), lock statements (§8.12), and thread creation and termination.
— so there are no critical execution points are in the mentioned pattern, and the optimizer is not constrained by that.
The related answer by Jon Skeet (year 2009) states
The JIT isn't allowed to perform the optimization you're talking about in the first part, because of the condition. I know this was raised as a spectre a while ago, but it's not valid. (I checked it with either Joe Duffy or Vance Morrison a while ago; I can't remember which.)
— but comments refer to this blog post (year 2008): Events and Threads (Part 4), which basically says that CLR 2.0's JITter (and probably subsequent versions?) must not introduce reads or writes, so there must be no problem under Microsoft .NET. But this seems to say nothing about other .NET implementations.
[Side note: I don't see how non-introducing of reads proves the correctness of the said pattern. Couldn't JITter just see some stale value of SomeEvent in some other local variable and optimize out one of the reads, but not the other? Perfectly legitimate, right?]
Moreover, this MSDN article (year 2012): The C# Memory Model in Theory and Practice by Igor Ostrovsky states the following:
Non-Reordering Optimizations Some compiler optimizations may introduce or eliminate certain memory operations. For example, the compiler might replace repeated reads of a field with a single read. Similarly, if code reads a field and stores the value in a local variable and then repeatedly reads the variable, the compiler could choose to repeatedly read the field instead.
Because the ECMA C# spec doesn’t rule out the non-reordering optimizations, they’re presumably allowed. In fact, as I’ll discuss in Part 2, the JIT compiler does perform these types of optimizations.
This seems to contradict the Jon Skeet's answer.
As now C# is not a Windows-only language any more, the question arises whether the validity of the pattern is a consequence of limited JITter optimizations in the current CLR implementation, or it is expected property of the language.
So, the question is following: is the pattern being discussed valid from the point of view of C#-the-language? (That implies whether a language compiler/runtime is required to prohibit certain kind of optimizations.)
Of course, normative references on the topic are welcome.

According to the sources you provided and a few others in the past, it breaks down to this:
With the Microsoft implementation, you can rely on not having read introduction
[1]
[2]
[3]
For any other implementation, it may have read introduction unless it states otherwise
EDIT: Having re-read the ECMA CLI specification carefully, read introductions are possible, but constrained. From Partition I, 12.6.4 Optimization:
Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL. For this purpose only volatile operations (including volatile reads) constitute visible side-effects. (Note that while only volatile operations constitute visible side-effects, volatile operations also affect the visibility of non-volatile references.)
A very important part of this paragraph is in parentheses:
Note that while only volatile operations constitute visible side-effects, volatile operations also affect the visibility of non-volatile references.
So, if the generated CIL reads a field only once, the implementation must behave the same. If it introduces reads, it's because it can prove that the subsequent reads will yield the same result, even facing side effects from other threads. If it cannot prove that and it still introduces reads, it's a bug.
In the same manner, C# the language also constrains read introduction at the C#-to-CIL level. From the C# Language Specification Version 5.0, 3.10 Execution Order:
Execution of a C# program proceeds such that the side effects of each executing thread are preserved at critical execution points. A side effect is defined as a read or write of a volatile field, a write to a non-volatile variable, a write to an external resource, and the throwing of an exception. The critical execution points at which the order of these side effects must be preserved are references to volatile fields (§10.5.3), lock statements (§8.12), and thread creation and termination. The execution environment is free to change the order of execution of a C# program, subject to the following constraints:
Data dependence is preserved within a thread of execution. That is, the value of each variable is computed as if all statements in the thread were executed in original program order.
Initialization ordering rules are preserved (§10.5.4 and §10.5.5).
The ordering of side effects is preserved with respect to volatile reads and writes (§10.5.3). Additionally, the execution environment need not evaluate part of an expression if it can deduce that that expression’s value is not used and that no needed side effects are produced (including any caused by calling a method or accessing a volatile field). When program execution is interrupted by an asynchronous event (such as an exception thrown by another thread), it is not guaranteed that the observable side effects are visible in the original program order.
The point about data dependence is the one I want to emphasize:
Data dependence is preserved within a thread of execution. That is, the value of each variable is computed as if all statements in the thread were executed in original program order.
As such, looking at your example (similar to the one given by Igor Ostrovsky [4]):
EventHandler localCopy = SomeEvent;
if (localCopy != null)
localCopy(this, args);
The C# compiler should not perform read introduction, ever. Even if it can prove that there are no interfering accesses, there's no guarantee from the underlying CLI that two sequential non-volatile reads on SomeEvent will have the same result.
Or, using the equivalent null conditional operator since C# 6.0:
SomeEvent?.Invoke(this, args);
The C# compiler should always expand to the previous code (guaranteeing a unique non-conflicting variable name) without performing read introduction, as that would leave the race condition.
The JIT compiler should only perform the read introduction if it can prove that there are no interfering accesses, depending on the underlying hardware platform, such that the two sequential non-volatile reads on SomeEvent will in fact have the same result. This may not be the case if, for instance, the value is not kept in a register and if the cache may be flushed between reads.
Such optimization, if local, can only be performed on plain (non-ref and non-out) parameters and non-captured local variables. With inter-method or whole program optimizations, it can be performed on shared fields, ref or out parameters and captured local variables that can be proven they are never visibly affected by other threads.
So, there's a big difference whether it's you writing the following code or the C# compiler generating the following code, versus the JIT compiler generating machine code equivalent to the following code, as the JIT compiler is the only one capable of proving if the introduced read is consistent with the single thread execution, even facing potential side-effects caused by other threads:
if (SomeEvent != null)
SomeEvent(this, args);
An introduced read that may yield a different result is a bug, even according to the standard, as there's an observable difference were the code executed in program order without the introduced read.
As such, if the comment in Igor Ostrovsky's example [4] is true, I say it's a bug.
[1]: A comment by Eric Lippert; quoting:
To address your point about the ECMA CLI spec and the C# spec: the stronger memory model promises made by CLR 2.0 are promises made by Microsoft. A third party that decided to make their own implementation of C# that generates code that runs on their own implementation of CLI could choose a weaker memory model and still be compliant with the specifications. Whether the Mono team has done so, I do not know; you'll have to ask them.
[2]: CLR 2.0 memory model by Joe Duffy, reiterating the next link; quoting the relevant part:
Rule 1: Data dependence among loads and stores is never violated.
Rule 2: All stores have release semantics, i.e. no load or store may move after one.
Rule 3: All volatile loads are acquire, i.e. no load or store may move before one.
Rule 4: No loads and stores may ever cross a full-barrier (e.g. Thread.MemoryBarrier, lock acquire, Interlocked.Exchange, Interlocked.CompareExchange, etc.).
Rule 5: Loads and stores to the heap may never be introduced.
Rule 6: Loads and stores may only be deleted when coalescing adjacent loads and stores from/to the same location.
[3]: Understand the Impact of Low-Lock Techniques in Multithreaded Apps by Vance Morrison, the latest snapshot I could get on the Internet Archive; quoting the relevant portion:
Strong Model 2: .NET Framework 2.0
(...)
All the rules that are contained in the ECMA model, in particular the three fundamental memory model rules as well as the ECMA rules for volatile.
Reads and writes cannot be introduced.
A read can only be removed if it is adjacent to another read to the same location from the same thread. A write can only be removed if it is adjacent to another write to the same location from the same thread. Rule 5 can be used to make reads or writes adjacent before applying this rule.
Writes cannot move past other writes from the same thread.
Reads can only move earlier in time, but never past a write to the same memory location from the same thread.
[4]: C# - The C# Memory Model in Theory and Practice, Part 2 by Igor Ostrovsky, where he shows a read introduction example that, according to him, the JIT may perform such that two consequent reads may have different results; quoting the relevant part:
Read Introduction As I just explained, the compiler sometimes fuses multiple reads into one. The compiler can also split a single read into multiple reads. In the .NET Framework 4.5, read introduction is much less common than read elimination and occurs only in very rare, specific circumstances. However, it does sometimes happen.
To understand read introduction, consider the following example:
public class ReadIntro {
private Object _obj = new Object();
void PrintObj() {
Object obj = _obj;
if (obj != null) {
Console.WriteLine(obj.ToString());
// May throw a NullReferenceException
}
}
void Uninitialize() {
_obj = null;
}
}
If you examine the PrintObj method, it looks like the obj value will never be null in the obj.ToString expression. However, that line of code could in fact throw a NullReferenceException. The CLR JIT might compile the PrintObj method as if it were written like this:
void PrintObj() {
if (_obj != null) {
Console.WriteLine(_obj.ToString());
}
}
Because the read of the _obj field has been split into two reads of the field, the ToString method may now be called on a null target.
Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.

The optimizer is not allowed to transform the pattern of code stored in a local variable that is later used into having all uses of that variable just be the original expression used to initialize it. That's not a valid transformation to make, so it's not an "optimization". The expression can cause, or be dependent on, side effects, so that expression needs to be run, stored somewhere, and then used when specified. It would be an invalid transformation of the runtime to resolve the event to a delegate twice, when your code only has it done once.
As far as re-ordering is concerned; the re-ordering of operation is quite complicated with respect to multiple threads, but the whole point of this pattern is that you're now doing the relevant logic in a single threaded context. The value of the event is stored into a local, and that read can be ordered more or less arbitrarily with respect to any code running in other threads, but the read of that value into the local variable cannot be re-reordered with respect to subsequent operations of that same thread, namely the if check or the invocation of that delegate.
Given that, the pattern does indeed do what it intends to do, which is to take a snapshot of the event and invokes all handlers, if there are any, without throwing a NRE due to there not being any handlers.

Why not volatile on System.Double and System.Long?

A question like mine has been asked, but mine is a bit different. The question is, "Why is the volatile keyword not allowed in C# on types System.Double and System.Int64, etc.?"
On first blush, I answered my colleague, "Well, on a 32-bit machine, those types take at least two ticks to even enter the processor, and the .Net framework has the intention of abstracting away processor-specific details like that." To which he responds, "It's not abstracting anything if it's preventing you from using a feature because of a processor-specific problem!"
He's implying that a processor-specific detail should not show up to a person using a framework that "abstracts" details like that away from the programmer. So, the framework (or C#) should abstract away those and do what it needs to do to offer the same guarantees for System.Double, etc. (whether that's a Semaphore, memory barrier, or whatever). I argued that the framework shouldn't add the overhead of a Semaphore on volatile, because the programmer isn't expecting such overhead with such a keyword, because a Semaphore isn't necessary for the 32-bit types. The greater overhead for the 64-bit types might come as a surprise, so, better for the .Net framework to just not allow it, and make you do your own Semaphore on larger types if the overhead is acceptable.
That led to our investigating what the volatile keyword is all about. (see this page). That page states, in the notes:
In C#, using the volatile modifier on a field guarantees that all access to that field uses VolatileRead or VolatileWrite.
Hmmm.....VolatileRead and VolatileWrite both support our 64-bit types!! My question, then, is,
"Why is the volatile keyword not allowed in C# on types System.Double and System.Int64, etc.?"

He's implying that a processor-specific detail should not show up to a person using a framework that "abstracts" details like that away from the programmer.
If you are using low-lock techniques like volatile fields, explicit memory barriers, and the like, then you are entirely in the world of processor-specific details. You need to understand at a deep level precisely what the processor is and is not allowed to do as far as reordering, consistency, and so on, in order to write correct, portable, robust programs that use low-lock techniques.
The point of this feature is to say "I am abandoning the convenient abstractions guaranteed by single-threaded programming and embracing the performance gains possible by having a deep implementation-specific knowledge of my processor." You should expect less abstractions at your disposal when you start using low-lock techniques, not more abstractions.
You're going "down to the metal" for a reason, presumably; the price you pay is having to deal with the quirks of said metal.

Yes. Reason is that you even can't read double or long in one operation. I agree that it is poor abstraction. I have a feeling that reason was that reading them atomically requires effort and it would be too smart for compiler. So they let you choose the best solution: locking, Interlocked, etc.
Interesting thing is that they can actually be read atomically on 32 bit using MMX registers. This is what java JIT compiler does. And they can be read atomically on 64 bit machine. So I think it is serious flaw in design.

Not really an answer to your question, but...
I'm pretty sure that the MSDN documentation you've referenced is incorrect when it states that "using the volatile modifier on a field guarantees that all access to that field uses VolatileRead or VolatileWrite".
Directly reading or writing to a volatile field only generates a half-fence (an acquire-fence when reading and a release-fence when writing).
The VolatileRead and VolatileWrite methods use MemoryBarrier internally, which generates a full-fence.
Joe Duffy knows a thing or two about concurrent programming; this is what he has to say about volatile:
(As an aside, many people wonder about
the difference between loads and
stores of variables marked as volatile
and calls to Thread.VolatileRead and
Thread.VolatileWrite. The difference
is that the former APIs are
implemented stronger than the jitted
code: they achieve acquire/release
semantics by emitting full fences on
the right side. The APIs are more
expensive to call too, but at least
allow you to decide on a
callsite-by-callsite basis which
individual loads and stores need the
MM guarantees.)

It's a simple explanation of legacy. If you read this article - http://msdn.microsoft.com/en-au/magazine/cc163715.aspx, you'll find that the only implementation of the .NET Framework 1.x runtime was on x86 machines, so it makes sense for Microsoft to implement it against the x86 memory model. x64 and IA64 were added later. So the base memory model was always one of x86.
Could it have been implemented for x86? I'm actually not sure it can be fully implemented - a ref of a double returned from native code could be aligned to 4 bytes instead of 8. In which case, all your guarantees of atomic reads/writes no longer hold true.

Starting from .NET Framework 4.5, it is now possible to perform a volatile read or write on long or double variables by using the Volatile.Read and Volatile.Write methods. Although it's not documented, these methods perform atomic reads and writes on the long/double variables, as it's evident from their implementation:
private struct VolatileIntPtr { public volatile IntPtr Value; }
[Intrinsic]
[NonVersionable]
public static long Read(ref long location) =>
#if TARGET_64BIT
(long)Unsafe.As<long, VolatileIntPtr>(ref location).Value;
#else
// On 32-bit machines, we use Interlocked, since an ordinary volatile read would not be atomic.
Interlocked.CompareExchange(ref location, 0, 0);
#endif
Using these two methods is not as convenient as the volatile keyword though. Attention is required to not forget wrapping every read/write access of the volatile field in Volatile.Read or Volatile.Write respectively.

Interlocked and Memory Barriers

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.