Events and multithreading once again

Events and multithreading once again - c#

I'm worried about the correctness of the seemingly-standard pre-C#6 pattern for firing an event:
EventHandler localCopy = SomeEvent;
if (localCopy != null)
localCopy(this, args);
I've read Eric Lippert's Events and races and know that there is a remaining issue of calling a stale event handler, but my worry is whether the compiler/JITter is allowed to optimize away the local copy, effectively rewriting the code as
if (SomeEvent != null)
SomeEvent(this, args);
with possible NullReferenceException.
According to the C# Language Specification, §3.10,
The critical execution points at which the order of these side effects must be preserved are references to volatile fields (§10.5.3), lock statements (§8.12), and thread creation and termination.
— so there are no critical execution points are in the mentioned pattern, and the optimizer is not constrained by that.
The related answer by Jon Skeet (year 2009) states
The JIT isn't allowed to perform the optimization you're talking about in the first part, because of the condition. I know this was raised as a spectre a while ago, but it's not valid. (I checked it with either Joe Duffy or Vance Morrison a while ago; I can't remember which.)
— but comments refer to this blog post (year 2008): Events and Threads (Part 4), which basically says that CLR 2.0's JITter (and probably subsequent versions?) must not introduce reads or writes, so there must be no problem under Microsoft .NET. But this seems to say nothing about other .NET implementations.
[Side note: I don't see how non-introducing of reads proves the correctness of the said pattern. Couldn't JITter just see some stale value of SomeEvent in some other local variable and optimize out one of the reads, but not the other? Perfectly legitimate, right?]
Moreover, this MSDN article (year 2012): The C# Memory Model in Theory and Practice by Igor Ostrovsky states the following:
Non-Reordering Optimizations Some compiler optimizations may introduce or eliminate certain memory operations. For example, the compiler might replace repeated reads of a field with a single read. Similarly, if code reads a field and stores the value in a local variable and then repeatedly reads the variable, the compiler could choose to repeatedly read the field instead.
Because the ECMA C# spec doesn’t rule out the non-reordering optimizations, they’re presumably allowed. In fact, as I’ll discuss in Part 2, the JIT compiler does perform these types of optimizations.
This seems to contradict the Jon Skeet's answer.
As now C# is not a Windows-only language any more, the question arises whether the validity of the pattern is a consequence of limited JITter optimizations in the current CLR implementation, or it is expected property of the language.
So, the question is following: is the pattern being discussed valid from the point of view of C#-the-language? (That implies whether a language compiler/runtime is required to prohibit certain kind of optimizations.)
Of course, normative references on the topic are welcome.

According to the sources you provided and a few others in the past, it breaks down to this:
With the Microsoft implementation, you can rely on not having read introduction
[1]
[2]
[3]
For any other implementation, it may have read introduction unless it states otherwise
EDIT: Having re-read the ECMA CLI specification carefully, read introductions are possible, but constrained. From Partition I, 12.6.4 Optimization:
Conforming implementations of the CLI are free to execute programs using any technology that guarantees, within a single thread of execution, that side-effects and exceptions generated by a thread are visible in the order specified by the CIL. For this purpose only volatile operations (including volatile reads) constitute visible side-effects. (Note that while only volatile operations constitute visible side-effects, volatile operations also affect the visibility of non-volatile references.)
A very important part of this paragraph is in parentheses:
Note that while only volatile operations constitute visible side-effects, volatile operations also affect the visibility of non-volatile references.
So, if the generated CIL reads a field only once, the implementation must behave the same. If it introduces reads, it's because it can prove that the subsequent reads will yield the same result, even facing side effects from other threads. If it cannot prove that and it still introduces reads, it's a bug.
In the same manner, C# the language also constrains read introduction at the C#-to-CIL level. From the C# Language Specification Version 5.0, 3.10 Execution Order:
Execution of a C# program proceeds such that the side effects of each executing thread are preserved at critical execution points. A side effect is defined as a read or write of a volatile field, a write to a non-volatile variable, a write to an external resource, and the throwing of an exception. The critical execution points at which the order of these side effects must be preserved are references to volatile fields (§10.5.3), lock statements (§8.12), and thread creation and termination. The execution environment is free to change the order of execution of a C# program, subject to the following constraints:
Data dependence is preserved within a thread of execution. That is, the value of each variable is computed as if all statements in the thread were executed in original program order.
Initialization ordering rules are preserved (§10.5.4 and §10.5.5).
The ordering of side effects is preserved with respect to volatile reads and writes (§10.5.3). Additionally, the execution environment need not evaluate part of an expression if it can deduce that that expression’s value is not used and that no needed side effects are produced (including any caused by calling a method or accessing a volatile field). When program execution is interrupted by an asynchronous event (such as an exception thrown by another thread), it is not guaranteed that the observable side effects are visible in the original program order.
The point about data dependence is the one I want to emphasize:
Data dependence is preserved within a thread of execution. That is, the value of each variable is computed as if all statements in the thread were executed in original program order.
As such, looking at your example (similar to the one given by Igor Ostrovsky [4]):
EventHandler localCopy = SomeEvent;
if (localCopy != null)
localCopy(this, args);
The C# compiler should not perform read introduction, ever. Even if it can prove that there are no interfering accesses, there's no guarantee from the underlying CLI that two sequential non-volatile reads on SomeEvent will have the same result.
Or, using the equivalent null conditional operator since C# 6.0:
SomeEvent?.Invoke(this, args);
The C# compiler should always expand to the previous code (guaranteeing a unique non-conflicting variable name) without performing read introduction, as that would leave the race condition.
The JIT compiler should only perform the read introduction if it can prove that there are no interfering accesses, depending on the underlying hardware platform, such that the two sequential non-volatile reads on SomeEvent will in fact have the same result. This may not be the case if, for instance, the value is not kept in a register and if the cache may be flushed between reads.
Such optimization, if local, can only be performed on plain (non-ref and non-out) parameters and non-captured local variables. With inter-method or whole program optimizations, it can be performed on shared fields, ref or out parameters and captured local variables that can be proven they are never visibly affected by other threads.
So, there's a big difference whether it's you writing the following code or the C# compiler generating the following code, versus the JIT compiler generating machine code equivalent to the following code, as the JIT compiler is the only one capable of proving if the introduced read is consistent with the single thread execution, even facing potential side-effects caused by other threads:
if (SomeEvent != null)
SomeEvent(this, args);
An introduced read that may yield a different result is a bug, even according to the standard, as there's an observable difference were the code executed in program order without the introduced read.
As such, if the comment in Igor Ostrovsky's example [4] is true, I say it's a bug.
[1]: A comment by Eric Lippert; quoting:
To address your point about the ECMA CLI spec and the C# spec: the stronger memory model promises made by CLR 2.0 are promises made by Microsoft. A third party that decided to make their own implementation of C# that generates code that runs on their own implementation of CLI could choose a weaker memory model and still be compliant with the specifications. Whether the Mono team has done so, I do not know; you'll have to ask them.
[2]: CLR 2.0 memory model by Joe Duffy, reiterating the next link; quoting the relevant part:
Rule 1: Data dependence among loads and stores is never violated.
Rule 2: All stores have release semantics, i.e. no load or store may move after one.
Rule 3: All volatile loads are acquire, i.e. no load or store may move before one.
Rule 4: No loads and stores may ever cross a full-barrier (e.g. Thread.MemoryBarrier, lock acquire, Interlocked.Exchange, Interlocked.CompareExchange, etc.).
Rule 5: Loads and stores to the heap may never be introduced.
Rule 6: Loads and stores may only be deleted when coalescing adjacent loads and stores from/to the same location.
[3]: Understand the Impact of Low-Lock Techniques in Multithreaded Apps by Vance Morrison, the latest snapshot I could get on the Internet Archive; quoting the relevant portion:
Strong Model 2: .NET Framework 2.0
(...)
All the rules that are contained in the ECMA model, in particular the three fundamental memory model rules as well as the ECMA rules for volatile.
Reads and writes cannot be introduced.
A read can only be removed if it is adjacent to another read to the same location from the same thread. A write can only be removed if it is adjacent to another write to the same location from the same thread. Rule 5 can be used to make reads or writes adjacent before applying this rule.
Writes cannot move past other writes from the same thread.
Reads can only move earlier in time, but never past a write to the same memory location from the same thread.
[4]: C# - The C# Memory Model in Theory and Practice, Part 2 by Igor Ostrovsky, where he shows a read introduction example that, according to him, the JIT may perform such that two consequent reads may have different results; quoting the relevant part:
Read Introduction As I just explained, the compiler sometimes fuses multiple reads into one. The compiler can also split a single read into multiple reads. In the .NET Framework 4.5, read introduction is much less common than read elimination and occurs only in very rare, specific circumstances. However, it does sometimes happen.
To understand read introduction, consider the following example:
public class ReadIntro {
private Object _obj = new Object();
void PrintObj() {
Object obj = _obj;
if (obj != null) {
Console.WriteLine(obj.ToString());
// May throw a NullReferenceException
}
}
void Uninitialize() {
_obj = null;
}
}
If you examine the PrintObj method, it looks like the obj value will never be null in the obj.ToString expression. However, that line of code could in fact throw a NullReferenceException. The CLR JIT might compile the PrintObj method as if it were written like this:
void PrintObj() {
if (_obj != null) {
Console.WriteLine(_obj.ToString());
}
}
Because the read of the _obj field has been split into two reads of the field, the ToString method may now be called on a null target.
Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.

The optimizer is not allowed to transform the pattern of code stored in a local variable that is later used into having all uses of that variable just be the original expression used to initialize it. That's not a valid transformation to make, so it's not an "optimization". The expression can cause, or be dependent on, side effects, so that expression needs to be run, stored somewhere, and then used when specified. It would be an invalid transformation of the runtime to resolve the event to a delegate twice, when your code only has it done once.
As far as re-ordering is concerned; the re-ordering of operation is quite complicated with respect to multiple threads, but the whole point of this pattern is that you're now doing the relevant logic in a single threaded context. The value of the event is stored into a local, and that read can be ordered more or less arbitrarily with respect to any code running in other threads, but the read of that value into the local variable cannot be re-reordered with respect to subsequent operations of that same thread, namely the if check or the invocation of that delegate.
Given that, the pattern does indeed do what it intends to do, which is to take a snapshot of the event and invokes all handlers, if there are any, without throwing a NRE due to there not being any handlers.

Related

Can memory reordering cause C# to access unallocated memory?

It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword. However, its memory model allows reordering when there is unsynchronized access between threads. This leads to race hazards where references to new instances appear to be available to racing threads before the instances have been fully initialized, and is a widely known problem for double-checked locking. Chris Brumme (from the CLR team) explains this in their Memory Model article:
Consider the standard double-locking protocol:
if (a == null)
{
lock(obj)
{
if (a == null)
a = new A();
}
}
This is a common technique for avoiding a lock on the read of ‘a’ in the typical case. It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA spec, acquiring a lock has acquire semantics and releasing a lock has release semantics.
However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’. At that point, there is a small window before the store.release implied by leaving the lock. Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance.
I've always been confused by what "partially constructed instance" means. Assuming that the .NET runtime clears out memory on allocation rather than garbage collection (discussion), does this mean that the other thread might read memory that still contains data from garbage-collected objects (like what happens in unsafe languages)?
Consider the following concrete example:
byte[] buffer = new byte[2];
Parallel.Invoke(
() => buffer = new byte[4],
() => Console.WriteLine(BitConverter.ToString(buffer)));
The above has a race condition; the output would be either 00-00 or 00-00-00-00. However, is it possible that the second thread reads the new reference to buffer before the array's memory has been initialized to 0, and outputs some other arbitrary string instead?

Let's not bury the lede here: the answer to your question is no, you will never observe the pre-allocated state of memory in the CLR 2.0 memory model.
I'll now address a couple of your non-central points.
It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword.
That is more or less correct. There are some mechanisms by which one can access bogus memory without using unsafe -- via unmanaged code, obviously, or by abusing structure layout. But in general, yes, C# is memory safe.
However, its memory model allows reordering when there is unsynchronized access between threads.
Again, that's more or less correct. A better way to think about it is that C# allows reordering at any point where the reordering would be invisible to a single threaded program, subject to certain constraints. Those constraints include introducing acquire and release semantics in certain cases, and preserving certain side effects at certain critical points.
Chris Brumme (from the CLR team) ...
The late great Chris's articles are gems and give a great deal of insight into the early days of the CLR, but I note that there have been some strengthenings of the memory model since 2003 when that article was written, particularly with respect to the issue you raise.
Chris is right that double-checked locking is super dangerous. There is a correct way to do double-checked locking in C#, and the moment you depart from it even slightly, you are off in the weeds of horrible bugs that only repro on weak memory model hardware.
does this mean that the other thread might read memory that still contains data from garbage-collected objects
I think your question is not specifically about the old weak ECMA memory model that Chris was describing, but rather about what guarantees are actually made today.
It is not possible for re-orderings to expose the previous state of objects. You are guaranteed that when you read a freshly-allocated object, its fields are all zeros.
This is made possible by the fact that all writes have release semantics in the current memory model; see this for details:
http://joeduffyblog.com/2007/11/10/clr-20-memory-model/
The write that initializes the memory to zero will not be moved forwards in time with respect to a read later.
I've always been confused by "partially constructed objects"
Joe discusses that here: http://joeduffyblog.com/2010/06/27/on-partiallyconstructed-objects/
Here the concern is not that we might see the pre-allocation state of an object. Rather, the concern here is that one thread might see an object while the constructor is still running on another thread.
Indeed, it is possible for the constructor and the finalizer to be running concurrently, which is super weird! Finalizers are hard to write correctly for this reason.
Put another way: the CLR guarantees you that its own invariants will be preserved. An invariant of the CLR is that newly allocated memory is observed to be zeroed out, so that invariant will be preserved.
But the CLR is not in the business of preserving your invariants! If you have a constructor which guarantees that field x is true if and only if y is non-null, then you are responsible for ensuring that this invariant is always observed to be true. If in some way this is observed by two threads, then one of those threads might observe the invariant being violated.

Can a read instruction after an unrelated lock statement be moved before the lock?

This question is a follow-up to comments in this thread.
Let's assume we have the following code:
// (1)
lock (padlock)
{
// (2)
}
var value = nonVolatileField; // (3)
Furthermore, let's assume that no instruction in (2) has any effect on the nonVolatileField and vice versa.
Can the reading instruction (3) be reordered in such a way that in ends up before the lock statement (1) or inside it (2)?
As far as I can tell, nothing in the C# Specification (§3.10) and the CLI Specification (§I.12.6.5) prohibits such reordering.
Please note that this is not the same question as this one. Here I am asking specifically about read instructions, because as far as I understand, they are not considered side-effects and have weaker guarantees.

I believe this is partially guaranteed by the CLI spec, although it's not as clear as it might be. From I.12.6.5:
Acquiring a lock (System.Threading.Monitor.Enter or entering a synchronized method) shall implicitly perform a volatile read operation, and releasing a lock
(System.Threading.Monitor.Exit or leaving a synchronized method) shall implicitly perform a volatile write operation. See §I.12.6.7.
Then from I.12.6.7:
A volatile read has “acquire semantics” meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has “release semantics” meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence.
So entering the lock should prevent (3) from moving to (1). Reading from nonVolatileField still counts as a "reference to memory", I believe. However, the read could still be performed before the volatile write when the lock exits, so it could still be moved to (2).
The C#/CLI memory model leaves a lot to be desired at the moment. I'm hoping that the whole thing can be clarified significantly (and probably tightened up, to make some "theoretically valid but practically awful" optimizations invalid).

As far as .NET is concerned, entering a monitor (the lock statement) has acquire semantics, as it implicitly performs a volatile read, and exiting a monitor (the end of the lock block) has release semantics, as it implicitly performs a volatile write (see §12.6.5 Locks and Threads in Common Language Infrastructure (CLI) Partition I).
volatile bool areWeThereYet = false;
// In thread 1
// Accesses, usually writes: create objects, initialize them
areWeThereYet = true;
// In thread 2
if (areWeThereYet)
{
// Accesses, usually reads: use created and initialized objects
}
When you write a value to areWeThereYet, all accesses before it were performed and not reordered to after the volatile write.
When you read from areWeThereYet, subsequent accesses are not reordered to before the volatile read.
In this case, when thread 2 observes that areWeThereYet has changed, it has a guarantee that the following accesses, usually reads, will observe the other thread's accesses, usually writes. Assuming there is no other code messing with the affected variables.
As for other synchronization primitives in .NET, such as SemaphoreSlim, although not explicitly documented, it would be rather useless if they didn't have similar semantics. Programs based on them could, in fact, not even work correctly in platforms or hardware architectures with a weaker memory model.
Many people share the thought that Microsoft ought to enforce a strong memory model on such architectures, similar to x86/amd64 as to keep the current code base (Microsoft's own and those of their clients) compatible.
I cannot verify myself, as I don't have an ARM device with Microsoft Windows, much less with .NET Framework for ARM, but at least one MSDN magazine article by Andrew Pardoe, CLR - .NET Development for ARM Processors, states:
The CLR is allowed to expose a stronger memory model than the ECMA CLI specification requires. On x86, for example, the memory model of the CLR is strong because the processor’s memory model is strong. The .NET team could’ve made the memory model on ARM as strong as the model on x86, but ensuring the perfect ordering whenever possible can have a notable impact on code execution performance. We’ve done targeted work to strengthen the memory model on ARM—specifically, we’ve inserted memory barriers at key points when writing to the managed heap to guarantee type safety—but we’ve made sure to only do this with a minimal impact on performance. The team went through multiple design reviews with experts to make sure that the techniques applied in the ARM CLR were correct. Moreover, performance benchmarks show that .NET code execution performance scales the same as native C++ code when compared across x86, x64 and ARM.

What is C#'s version of the GIL?

In the current implementation of CPython, there is an object known as the "GIL" or "Global Interpreter Lock". It is essentially a mutex that prevents two Python threads from executing Python code at the same time. This prevents two threads from being able to corrupt the state of the Python interpreter, but also prevents multiple threads from really executing together. Essentially, if I do this:
# Thread A
some_list.append(3)
# Thread B
some_list.append(4)
I can't corrupt the list, because at any given time, only one of those threads are executing, since they must hold the GIL to do so. Now, the items in the list might be added in some indeterminate order, but the point is that the list isn't corrupted, and two things will always get added.
So, now to C#. C# essentially faces the same problem as Python, so, how does C# prevent this? I'd also be interested in hearing Java's story, if anyone knows it.
Clarification: I'm interested in what happens without explicit locking statements, especially to the VM. I am aware that locking primitives exist for both Java & C# - they exist in Python as well: The GIL is not used for multi-threaded code, other than to keep the interpreter sane. I am interested in the direct equivalent of the above, so, in C#, if I can remember enough... :-)
List<String> s;
// Reference to s is shared by two threads, which both execute this:
s.Add("hello");
// State of s?
// State of the VM? (And if sane, how so?)
Here's another example:
class A
{
public String s;
}
// Thread A & B
some_A.s = some_other_value;
// some_A's state must change: how does it change?
// Is the VM still in good shape afterwards?
I'm not looking to write bad C# code, I understand the lock statements. Even in Python, the GIL doesn't give you magic-multi-threaded code: you must still lock shared resources. But the GIL prevents Python's "VM" from being corrupted - it is this behavior that I'm interested in.

Most other languages that support threading don't have an equivalent of the Python GIL; they require you to use mutexes, either implicitly or explicitly.

Using lock, you would do this:
lock(some_list)
{
some_list.Add(3);
}
and in thread 2:
lock(some_list)
{
some_list.Add(4);
}
The lock statement ensures that the object inside the lock statement, some_list in this case, can only be accessed by a single thread at a time. See http://msdn.microsoft.com/en-us/library/c5kehkcz(VS.80).aspx for more information.

C# does not have an equivalent of GIL to Python.
Though they face the same issue, their design goals make them
different.
With GIL, CPython ensures that suche operations as appending a list
from two threads is simple. Which also
means that it would allow only one
thread to run at any time. This
makes lists and dictionaries thread safe. Though this makes the job
simpler and intuitive, it makes it
harder to exploit the multithreading
advantage on multicores.
With no GIL, C# does the opposite. It ensures that the burden of integrity is on the developer of the
program but allows you to take
advantage of running multiple threads
simultaneously.
As per one of the discussion -
The GIL in CPython is purely a design choice of having
a big lock vs a lock per object
and synchronisation to make sure that objects are kept in a coherent state.
This consist of a trade off - Giving up the full power of
multithreading.
It has been that most problems do not suffer from this disadvantage
and there are libraries which help you exclusively solve this issue when
required.
That means for a certain class of problems, the burden to utilize the
multicore is
passed to developer so that rest can enjoy the more simpler, intuitive
approach.
Note: Other implementation like IronPython do not have GIL.

It may be instructive to look at the documentation for the Java equivalent of the class you're discussing:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method. This is best done at creation time, to prevent accidental unsynchronized access to the list:
List list = Collections.synchronizedList(new ArrayList(...));
The iterators returned by this class's iterator and listIterator methods are fail-fast: if the list is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.

Most complex datastructures(for example lists) can be corrupted when used without locking in multiple threads.
Since changes of references are atomic, a reference always stays a valid reference.
But there is a problem when interacting with security critical code. So any datastructures used by critical code most be one of the following:
Inaccessible from untrusted code, and locked/used correctly by trusted code
Immutable (String class)
Copied before use (valuetype parameters)
Written in trusted code and uses internal locking to guarantee a safe state
For example critical code cannot trust a list accessible from untrusted code. If it gets passed in a List, it has to create a private copy, do it's precondition checks on the copy, and then operate on the copy.

I'm going to take a wild guess at what the question really means...
In Python data structures in the interpreter get corrupted because Python is using a form of reference counting.
Both C# and Java use garbage collection and in fact they do use a global lock when doing a full heap collection.
Data can be marked and moved between "generations" without a lock. But to actually clean it up everything must come to a stop. Hopefully a very short stop, but a full stop.
Here is an interesting link on CLR garbage collection as of 2007:
http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry

Interlocked and Memory Barriers

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

Why volatile is not enough?

I'm confused. Answers to my previous question seems to confirm my assumptions. But as stated here volatile is not enough to assure atomicity in .Net. Either operations like incrementation and assignment in MSIL are not translated directly to single, native OPCODE or many CPUs can simultaneously read and write to the same RAM location.
To clarify:
I want to know if writes and reads are atomic on multiple CPUs?
I understand what volatile is about. But is it enough? Do I need to use interlocked operations if I want to get latest value writen by other CPU?

Herb Sutter recently wrote an article on volatile and what it really means (how it affects ordering of memory access and atomicity) in the native C++. .NET, and Java environments. It's a pretty good read:
volatile vs. volatile

volatile in .NET does make access to the variable atomic.
The problem is, that's often not enough. What if you need to read the variable, and if it is 0 (indicating that the resource is free), you set it to 1 (indicating that it's locked, and other threads should stay away from it).
Reading the 0 is atomic. Writing the 1 is atomic. But between those two operations, anything might happen. You might read a 0, and then before you can write the 1, another thread jumps in, reads the 0, and writes an 1.
However, volatile in .NET does guarantee atomicity of accesses to the variable. It just doesn't guarantee thread safety for operations relying on multiple accesses to it. (Disclaimer: volatile in C/C++ does not even guarantee this. Just so you know. It is much weaker, and occasinoally a source of bugs because people assume it guarantees atomicity :))
So you need to use locks as well, to group together multiple operations as one thread-safe chunk. (Or, for simple operations, the Interlocked operations in .NET may do the trick)

I might be jumping the gun here but it sounds to me as though you're confusing two issues here.
One is atomicity, which in my mind means that a single operation (that may require multiple steps) should not come in conflict with another such single operation.
The other is volatility, when is this value expected to change, and why.
Take the first. If your two-step operation requires you to read the current value, modify it, and write it back, you're most certainly going to want a lock, unless this whole operation can be translated into a single CPU instruction that can work on a single cache-line of data.
However, the second issue is, even when you're doing the locking thing, what will other threads see.
A volatile field in .NET is a field that the compiler knows can change at arbitrary times. In a single-threaded world, the change of a variable is something that happens at some point in a sequential stream of instructions so the compiler knows when it has added code that changes it, or at least when it has called out to outside world that may or may not have changed it so that once the code returns, it might not be the same value it was before the call.
This knowledge allows the compiler to lift the value from the field into a register once, before a loop or similar block of code, and never re-read the value from the field for that particular code.
With multi-threading however, that might give you some problems. One thread might have adjusted the value, and another thread, due to optimization, won't be reading this value for some time, because it knows it hasn't changed.
So when you flag a field as volatile you're basically telling the compiler that it shouldn't assume that it has the current value of this at any point, except for grabbing snapshots every time it needs the value.
Locks solve multiple-step operations, volatility handles how the compiler caches the field value in a register, and together they will solve more problems.
Also note that if a field contains something that cannot be read in a single cpu-instruction, you're most likely going to want to lock read-access to it as well.
For instance, if you're on a 32-bit cpu and writing a 64-bit value, that write-operation will require two steps to complete, and if another thread on another cpu manages to read the 64-bit value before step 2 has completed, it will get half of the previous value and half of the new, nicely mixed together, which can be even worse than getting an outdated one.
Edit: To answer the comment, that volatile guarantees the atomicity of the read/write operation, that's well, true, in a way, because the volatile keyword cannot be applied to fields that are larger than 32-bit, in effect making the field single-cpu-instruction read/writeable on both 32 and 64-bit cpu's. And yes, it will prevent the value from being kept in a register as much as possible.
So part of the comment is wrong, volatile cannot be applied to 64-bit values.
Note also that volatile has some semantics regarding reordering of reads/writes.
For relevant information, see the MSDN documentation or the C# specification, found here, section 10.5.3.

On a hardware level, multiple CPUs can never write simultanously to the same atomic RAM location. The size of an atomic read/write operation dependeds on CPU architecture, but is typically 1, 2 or 4 bytes on a 32-bit architecture. However, if you try reading the result back there is always a chance that another CPU has made a write to the same RAM location inbetween. On a low level, spin-locks are typically used to synchronize access to shared memory. In a high level language, such mechanisms may be called e.g. critical regions.
The volatile type just makes sure the variable is written immediately back to memory when it is changed (even if the value is to be used in the same function). A compiler will usually keep a value in an internal register for as long as possible if the value is to be reused later in the same function, and it is stored back to RAM when all modifications are finished or when a function returns. Volatile types are mostly useful when writing to hardware registers, or when you want to be sure a value is stored back to RAM in e.g. a multithread system.

Your question doesn't entirely make sense, because volatile specifies the how the read happens, not atomicity of multi-step processes. My car doesn't mow my lawn, either, but I try not to hold that against it. :)

The problem comes in with register based cashed copies of your variable's values.
When reading a value, the cpu will first see if it's in a register (fast) before checking main memory (slower).
Volatile tells the compiler to push the value out to main memory asap, and not to trust the cached register value. It's only useful in certain cases.
If you're looking for single op code writes, you'll need to use the Interlocked.Increment related methods.. But they're fairly limited in what they can do in a single safe instruction.
Safest and most reliable bet is to lock() (if you can't do an Interlocked.*)
Edit: Writes and reads are atomic if they're in a lock or an interlocked.* statement. Volatile alone is not enough under the terms of your question

Volatile is a compiler keyword that tells the compiler what to do. It does not necessarily translate into (essentially) bus operations that are required for atomicity. That is usually left up to the operating system.
Edit: to clarify, volatile is never enough if you want to guarantee atomicity. Or rather, it's up to the compiler to make it enough or not.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.