Explanation of Thread.MemoryBarrier() Bug with OoOP

Explanation of Thread.MemoryBarrier() Bug with OoOP - c#

Ok so after reading Albahari's Threading in C#, I am trying to get my head around Thread.MemoryBarrier() and Out-of-Order Processing.
Following Brian Gideon's answer on the Why we need Thread.MemoerBarrier() he mentions the following code causes the program to loop indefinitely on Release mode and without debugger attached.
class Program
{
static bool stop = false;
public static void Main(string[] args)
{
var t = new Thread(() =>
{
Console.WriteLine("thread begin");
bool toggle = false;
while (!stop)
{
// Thread.MemoryBarrier() or Console.WriteLine() fixes issue
toggle = !toggle;
}
Console.WriteLine("thread end");
});
t.Start();
Thread.Sleep(1000);
stop = true;
Console.WriteLine("stop = true");
Console.WriteLine("waiting...");
t.Join();
}
}
My question is why, without adding a Thread.MemoryBarrier(), or even Console.WriteLine() in the while loop fixes the issue?
I am guessing that because on a multi processor machine, the thread runs with its own cache of values, and never retrieves the updated value of stop because it has its value in cache?
Or is it that the main thread does not commit this to memory?
Also why does Console.WriteLine() fix this? Is it because it also implements a MemoryBarrier?

The compiler and CPU are free to optimize your code by re-ordering it in any way they see fit, as long as any changes are consistent for a single thread. This is why you never encounter issues in a single threaded program.
In your code you've got two threads that are using the stop flag. The compiler or CPU may choose to cache the value in a CPU register since in the thread you create since it can detect that your not writing to it in the thread. What you need is some way to tell the compiler/CPU that the variable is being modified in another thread and therefore it shouldn't cache the value but should read it from memory.
There are a couple of easy ways to do this. One if by surrounding all access to the stop variable in a lock statement. This will create a full barrier and ensure that each thread sees the current value. Another is to use the Interlocked class to read/write the variable, as this also puts up a full barrier.
There are also certain methods, such as Wait and Join, that also put up memory barriers in order to prevent reordering. The Albahari book lists these methods.

It doesn't fix any issues. It's a fake fix, rather dangerous in production code, as it may work, or it may not work.
The core problem is in this line
static bool stop = false;
The variable that stops a while loop is not volatile. Which means it may or may not be read from memory all the time. It can be cached, so that only the last read value is presented to a system (which may not be the actual current value).
This code
// Thread.MemoryBarrier() or Console.WriteLine() fixes issue
May or may not fix an issue on different platforms. Memory barrier or console write just happen to force application to read fresh values on a particular system. It may not be the same elsewhere.
Additionally, volatile and Thread.MemoryBarrier() only provide weak guarantees, which means they don't provide 100% assurance that a read value will always be the latest on all systems and CPUs.
Eric Lippert says
The true semantics of volatile reads
and writes are considerably more complex than I've outlined here; in
fact they do not actually guarantee that every processor stops what it
is doing and updates caches to/from main memory. Rather, they provide
weaker guarantees about how memory accesses before and after reads and
writes may be observed to be ordered with respect to each other.
Certain operations such as creating a new thread, entering a lock, or
using one of the Interlocked family of methods introduce stronger
guarantees about observation of ordering. If you want more details,
read sections 3.10 and 10.5.3 of the C# 4.0 specification.

That example does not have anything to do with out-of-order execution. It only shows the effect of possible compiler optimizing away the stop access, which should be addressed by simply marking the variable volatile. See Memory Reordering Caught in the Act for a better example.

Let us start with some definitions. The volatile keyword produces an acquire-fence on reads and a release-fence on writes. These are defined as follows.
acquire-fence: A memory barrier in which other reads and writes are not allowed to move before the fence.
release-fence: A memory barrier in which other reads and writes are not allowed to move after the fence.
The method Thread.MemoryBarrier generates a full-fence. That means it produces both the acquire-fence and the release-fence. Frustratingly the MSDN says this though.
Synchronizes memory access as follows: The processor executing the
current thread cannot reorder instructions in such a way that memory
accesses prior to the call to MemoryBarrier execute after memory
accesses that follow the call to MemoryBarrier.
Interpreting this leads us to believe that it only generates a release-fence though. So what is it? A full fence or half fence? That is probably a topic for another question. I am going to work under the assumption that it is a full fence because a lot of smart people have made that claim. But, more convincingly, the BCL itself uses Thread.MemoryBarrier as if it produced a full-fence. So in this case the documentation is probably wrong. Even more amusingly the statement actually implies that instructions before the call can somehow be sandwiched between the call and instructions after it. That would be absurd. I say this in jest (but not really) that it might benefit Microsoft to have a lawyer review all documentation regarding threading. I am sure their legalese skills could be put to good use in that area.
Now I am going to introduce an arrow notation to help illustrate the fences in action. An ↑ arrow will represent a release-fence and a ↓ arrow will represent an acquire-fence. Think of the arrow head as pushing memory access away in the direction of the arrow. But, and this is important, memory accesses can move past the tail. Read the definitions of the fences above and convince yourself that the arrows visually represent those definitions.
Next we will analyze the loop only as that is the most important part of the code. To do this I am going to unwind the loop. Here is what it looks like.
LOOP_TOP:
// Iteration 1
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
// Iteration 2
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
...
// Iteration N
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
LOOP_BOTTOM:
Notice that the call to Thread.MemoryBarrier is constraining the movement of some of the memory access. For example, the read of toggle cannot move before the read of stop or vice-versa because those memory access are not allowed to move through an arrow head.
Now imagine what would happen if the full-fence were removed. The C# compiler, JIT compiler, or hardware are now have a lot more liberty in the moving the instructions around. In particular the lifting optimization, know formally as loop invariant code motion, is now allowed. Basically the compiler detects that stop is never modified and so the read is bubbled up out of the loop. It is now effectively cached into a register. If the memory barrier were in place then the read would have to push up through an arrow head and the specification specifically disallows that. This is much easier to visualize if you unwind the loop like I did above. Remember, the call to Thread.MemoryBarrier would be occurring on every iteration of the loop so you cannot simply draw conclusions about what would happen from only a single iteration.
The astute reader will notice that the compiler is free to swap the read of toggle and stop in such a manner that stop gets "refreshed" at the end of the loop instead of the beginning, but that is irrelevant to the contextual behavior of the loop. It has the exact same semantics and produces the same result.
My question is why, without adding a Thread.MemoryBarrier(), or even
Console.WriteLine() in the while loop fixes the issue?
Because the memory barrier places restrictions on the optimizations the compiler can perform. It would disallow loop invariant code motion. The assumption is that Console.WriteLine produces a memory barrier which is probably true. Without the memory barrier the C# compiler, JIT compiler, or hardware are free to hoist the read of stop up and outside of the loop itself.
I am guessing that because on a multi processor machine, the thread
runs with its own cache of values, and never retrieves the updated
value of stop because it has its value in cache?
In a nutshell...yes. Though keep in mind that it has nothing to do with the number of processors. This can be demonstrated with a single processor.
Or is it that the main thread does not commit this to memory?
No. The main thread will commit the write. The call to Thread.Join ensures that because it will create a memory barrier that disallows the movement of the write to fall below the join.
Also why does Console.WriteLine() fix this? Is it because it also
implements a MemoryBarrier?
Yes. It probably produces a memory barrier. I have been keeping a list of memory barrier generators here.

Related

Why don't all member variables need volatile for thread safety even when using Monitor? (why does the model really work?)

(I know they don't but I'm looking for the underlying reason this actually works without using volatile since there should be nothing preventing the compiler from storing a variable in a register without volatile... or is there...)
This question stems from the discord of thought that without volatile the compiler (can optimize in theory any variable in various ways including storing it in a CPU register.) While the docs say that is not needed when using synchronization like lock around variables. But there is really in some cases seemingly no way the compiler/jit can know you will use them or not in your code path. So the suspicion is something else is really happening here to make the memory model "work".
In this example what prevents the compiler/jit from optimizing _count into a register and thus having the increment done on the register rather then directly to memory (later writing to memory after the exit call)? If _count were volatile it would seem everything should be fine, but a lot of code is written without volatile. It makes sense the compiler could know not to optimize _count into a register if it saw a lock or synchronization object in the method.. but in this case the lock call is in another function.
Most documentation says you don't need to use volatile if you use a synchronization call like lock.
So what prevents the compiler from optimizing _count into a register and potentially updating just the register within the lock? I have a feeling that most member variables won't be optimized into registers for this exact reason as then every member variable would really need to be volatile unless the compiler could tell it shouldn't optimize (otherwise I suspect tons of code would fail). I saw something similar when looking at C++ years ago local function variables got stored in registers, class member variables did not.
So the main question is, is it really the only way this possibly works without volatile that the compiler/jit won't put class member variables in registers and thus volatile is then unnecessary?
(Please ignore the lack of exception handling and safety in the calls, but you get the gist.)
public class MyClass
{
object _o=new object();
int _count=0;
public void Increment()
{
Enter();
// ... many usages of count here...
count++;
Exit();
}
//lets pretend these functions are too big to inline and even call other methods
// that actually make the monitor call (for example a base class that implemented these)
private void Enter() { Monitor.Enter(_o); }
private void Exit() { Monitor.Exit(_o); } //lets pretend this function is too big to inline
// ...
// ...
}

Entering and leaving a Monitor causes a full memory fence. Thus the CLR makes sure that all writing operations before the Monitor.Enter / Monitor.Exit become visible to all other threads and that all reading operations after the method call "happen" after it. That also means that statements before the call cannot be moved after the call and vice versa.
See http://www.albahari.com/threading/part4.aspx.

The best guess answer to this question would appear to be that that any variables that are stored in CPU registers are saved to memory before any function would be called. This makes sense because compiler design viewpoint from a single thread would require that, otherwise the object might appear to be inconsistent if it were used by other functions/methods/objects.
So it may not be so much as some people/articles claim that synchronization objects/classes are detected by the compilers and non-volatile variables are made safe through their calls. (Perhaps they are when a lock is used or other synchronization objects in the same method, but once you have calls in another method that calls those synchronization objects probably not), instead it is likely that just the fact of calling another method is probably enough to cause the values stored in CPU registers to be saved to memory. Thus not requiring all variables to be volatile.
Also I suspect and others have suspected too that fields of a class are not optimized as much due to some of the threading concerns.
Some notes (my understanding):
Thread.MemoryBarrier() is mostly a CPU instruction to insure writes/reads don't bypass the barrier from a CPU perspective. (This is not directly related to values stored in registers) So this is probably not what directly causes to save variables from registers to memory (except just by the fact it is a method call as per our discussion here, would likely cause that to happen- It could have really been any method call though perhaps to affect all class fields that were used being saved from registers)
It is theoretically possible the JIT/Compiler could also take that method into an account in the same method to ensure variables are stored from CPU registers. But just following our simple proposed rule of any calls to another method or class would result in saving variables stored in registers to memory. Plus if someone wrapped that call in another method (maybe many methods deep), the compiler wouldn't likely analyze that deep to speculate on execution. The JIT could do something but again it likely wouldn't analyze that deep, and both cases need to ensure locks/synchronization work no matter what, thus the simplest optimization is the likely answer.
Unless we have anyone that writes the compilers that can confirm this its all a guess, but its likely the best guess we have of why volatile is not needed.
If that rule is followed synchronization objects just need to employ their own call to MemoryBarrier when they enter and leave to ensure the CPU has the most up to date values from its write caches so they get flushed so proper values can be read. On this site you will see that is what is suggested implicit memory barriers: http://www.albahari.com/threading/part4.aspx

So what prevents the compiler from optimizing _count into a register
and potentially updating just the register within the lock?
There is nothing in the documentation that I am aware that would preclude that from happening. The point is that the call to Monitor.Exit will effectively guarantee that the final value of _count will be committed to memory upon completion.
It makes sense the compiler could know not to optimize _count into a
register if it saw a lock or synchronization object in the method..
but in this case the lock call is in another function.
The fact that the lock is acquired and released in other methods is irrelevant from your point of view. The model memory defines a pretty rigid set of rules that must be adhered to regarding memory barrier generators. The only consequence of putting those Monitor calls in another method is that JIT compiler will have a harder time complying with those rules. But, the JIT compiler must comply; period. If the method calls get to complex or nested too deep then I suspect the JIT compiler punts on any heuristics it might have in this regard and says, "Forget it, I'm just not going to optimize anything!"
So the main question is, is it really the only way this possibly works
without volatile that the compiler/jit won't put class member
variables in registers and thus volatile is then unnecessary?
It works because the protocol is to acquire the lock prior to reading _count as well. If the readers do not do that then all bets are off.

Thread Safety General Rules

A few questions about thread safety that I think I understand, but would like clarification on, if you could be so kind. The specific languages I program in are C++, C#, and Java. Hopefully keep these in mind when describing specific language keywords/features.
1) Cases of 1 writer, n readers. In cases such as n threads reading a variable, such as in a polled loop, and 1 writer updating this variable, is explicit locking required?
Consider:
// thread 1.
volatile bool bWorking = true;
void stopWork() { bWorking = false; }
// thread n
while (bWorking) {...}
Here, should it be enough to just have a memory barrier, and accomplish this with volatile? Since as I understand, in my above mentioned languages, simple reads and writes to primitives will not be interleaved so explicit locking is not required, however memory consistency cannot be guaranteed without some explicit lock, or volatile. Are my assumptions correct here?
2) Assuming my assumption above is correct, then it is only correct for simple reads and writes. That is bWorking = x... and x = bWorking; are the ONLY safe operations? IE complex assignments such as unary operators (++, --) are unsafe here, as are +=, *=, etc... ?
3) I assume if case 1 is correct, then it is not safe to expand that statement to also be safe for n writers and n readers when only assignment and reading is involved?

For Java:
1) a volatile variable is updated from/to the "main memory" on each reading writing, which means that the change by the updater thread will be seen by all reading threads on their next read. Also, updates are atomic (independent of variable type).
2) Yes, combined operations like ++ are not thread safe if you have multiple writers. For a single writing thread, there is no problem. (The volatile keyword makes sure that the update is seen by the other threads.)
3) As long as you only assign and read, volatile is enough - but if you have multiple writers, you can't be sure which value is the "final" one, or which will be read by which thread. Even the writing threads themselves can't reliably know that their own value is set. (If you only have boolean and will only set from true to false, there is no problem here.)
If you want more control, have a look at the classes in the java.util.concurrent.atomic package.

Do the locking. You are going to need to have locking anyway if you are writing multi-threaded code. C# and Java make it fairly simple. C++ is a little more complex but you should be able to use boost or make your own RAII classes. Given that you are going to be locking all over the place don't try to see if there are a few places where you might be able to avoid it. All will work fine until you run the code on a 64-way processor using new INtel microcode on a Tuesday in march on some misison critical customer system. Then bang.
People think that locks are expensive; they really aren't. The kernel devs spend a lot of time optimizing them and compared to one disk read they are utterly trivial; yet nobody ever seems to expend this much effort analyzing every last disk read
Add the usual statements about performance tuning evils, wise saying from Knuth, Spolsky ...... etc, etc,

For C++
1) This is tempting to try, and will usually work. However, a few things to keep in mind:
You're doing it with a boolean, so that seems safest. Other POD types might nor be so safe. E.g. it may take two instructions to set a 64-bit double on a 32-bit machine. So that would clearly be not thread safe.
If the boolean is the only thing you care about the threads sharing, this could work. If you're using it as a variant of the Double-Checked Lock Paradigm, you run into all the pitfalls therein. Consider:
std::string failure_message; // shared across threads
// some thread triggers the stop, and also reports why
failure_message = "File not found";
stopWork();
// all the other threads
while (bWorking) {...}
log << "Stopped work: " << failure_message;
This looks ok at first, because failure_message is set before bWorking is set to false. However, that may not be the case in practice. The compiler can rearrange the statements, and set bWorking first, resulting in thread unsafe access of failure_message. Even if the compiler doesn't, the hardware might. Multi-core cpus have their own caches, and thus things aren't quite so simple.
If it's just a boolean, it's probably ok. If it's more than that, it might have issues once in a while. How important is the code you're writing, and can you take that risk?
2) Correct, ++/--, +=, other operators will take multiple cpu instructions and will be thread unsafe. Depending on your platform and compiler, you may be able to write non-portable code to do atomic increments.
3) Correct, this would be unsafe in a general case. You can kinda squeak by when you have one thread, writing a single boolean once. As soon as you introduce multiple writes, you'd better have some real thread synchronization.
Note about cpu instructions
If an operation takes multiple instructions, your thread could be preempted between them -- and the operation would be partially complete. This is clearly bad for thread safety, and this is one reason why ++, +=, etc are not thread safe.
However, even if an operation takes a single instruction, that does not necessarily mean that it's thread safe. With multi-core and multi-cpu you have to worry about the visibility of a change -- when is the cpu cache flushed to main memory.
So while multiple instructions does imply not thread safe, it's false to assume that single instruction implies thread safe

With a 1-byte bool, you might be able to get away without using locking, but since you cannot guarantee the internals of the processor it'd still be a bad idea. Certainly with anything beyond 1 byte such as an integer you couldn't. One processor could be updating it while another was reading it on another thread, and you could get inconsistent results. In C# I would use a lock { } statement around the access (read or write) to bWorking. If it was something more complex, for example IO access to a large memory buffer, I'd use ReaderWriterLock or some variant of it. In C++ volatile won't help much, because that just prevents certain kinds of optimizations such as register variables which would totally cause problems in multithreading. You still need to use a locking construct.
So in summary I would never read and write anything in a multithreaded program without locking it somehow.

Updating a bool is going to be atomic on any sensible extant system. However, once your writer has written, there's no telling how long before your reader will read, especially once you take into account multiple cores, caches, scheduler oddities, and so on.
Part of the problem with increments and decrements (++, --) and compound assignments (+=, *=) is that they are misleading. They imply something is happening atomically that is actually happening in several operations. But even simple assignments can be unsafe one you have stepped away from the purity of boolean variables. Guaranteeing that a write as simple as x=foo is atomic is up to the details of your platform.
I assume by thread safe, you mean that readers will always see a consistent object no matter what the writers do. In your example this will always be the case since booleans can only evaluate to two values, both valid, and the value is only transitions once from true to false. Thread safety is going to be more difficult in a more complicated scenario.

What is C#'s version of the GIL?

In the current implementation of CPython, there is an object known as the "GIL" or "Global Interpreter Lock". It is essentially a mutex that prevents two Python threads from executing Python code at the same time. This prevents two threads from being able to corrupt the state of the Python interpreter, but also prevents multiple threads from really executing together. Essentially, if I do this:
# Thread A
some_list.append(3)
# Thread B
some_list.append(4)
I can't corrupt the list, because at any given time, only one of those threads are executing, since they must hold the GIL to do so. Now, the items in the list might be added in some indeterminate order, but the point is that the list isn't corrupted, and two things will always get added.
So, now to C#. C# essentially faces the same problem as Python, so, how does C# prevent this? I'd also be interested in hearing Java's story, if anyone knows it.
Clarification: I'm interested in what happens without explicit locking statements, especially to the VM. I am aware that locking primitives exist for both Java & C# - they exist in Python as well: The GIL is not used for multi-threaded code, other than to keep the interpreter sane. I am interested in the direct equivalent of the above, so, in C#, if I can remember enough... :-)
List<String> s;
// Reference to s is shared by two threads, which both execute this:
s.Add("hello");
// State of s?
// State of the VM? (And if sane, how so?)
Here's another example:
class A
{
public String s;
}
// Thread A & B
some_A.s = some_other_value;
// some_A's state must change: how does it change?
// Is the VM still in good shape afterwards?
I'm not looking to write bad C# code, I understand the lock statements. Even in Python, the GIL doesn't give you magic-multi-threaded code: you must still lock shared resources. But the GIL prevents Python's "VM" from being corrupted - it is this behavior that I'm interested in.

Most other languages that support threading don't have an equivalent of the Python GIL; they require you to use mutexes, either implicitly or explicitly.

Using lock, you would do this:
lock(some_list)
{
some_list.Add(3);
}
and in thread 2:
lock(some_list)
{
some_list.Add(4);
}
The lock statement ensures that the object inside the lock statement, some_list in this case, can only be accessed by a single thread at a time. See http://msdn.microsoft.com/en-us/library/c5kehkcz(VS.80).aspx for more information.

C# does not have an equivalent of GIL to Python.
Though they face the same issue, their design goals make them
different.
With GIL, CPython ensures that suche operations as appending a list
from two threads is simple. Which also
means that it would allow only one
thread to run at any time. This
makes lists and dictionaries thread safe. Though this makes the job
simpler and intuitive, it makes it
harder to exploit the multithreading
advantage on multicores.
With no GIL, C# does the opposite. It ensures that the burden of integrity is on the developer of the
program but allows you to take
advantage of running multiple threads
simultaneously.
As per one of the discussion -
The GIL in CPython is purely a design choice of having
a big lock vs a lock per object
and synchronisation to make sure that objects are kept in a coherent state.
This consist of a trade off - Giving up the full power of
multithreading.
It has been that most problems do not suffer from this disadvantage
and there are libraries which help you exclusively solve this issue when
required.
That means for a certain class of problems, the burden to utilize the
multicore is
passed to developer so that rest can enjoy the more simpler, intuitive
approach.
Note: Other implementation like IronPython do not have GIL.

It may be instructive to look at the documentation for the Java equivalent of the class you're discussing:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method. This is best done at creation time, to prevent accidental unsynchronized access to the list:
List list = Collections.synchronizedList(new ArrayList(...));
The iterators returned by this class's iterator and listIterator methods are fail-fast: if the list is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.

Most complex datastructures(for example lists) can be corrupted when used without locking in multiple threads.
Since changes of references are atomic, a reference always stays a valid reference.
But there is a problem when interacting with security critical code. So any datastructures used by critical code most be one of the following:
Inaccessible from untrusted code, and locked/used correctly by trusted code
Immutable (String class)
Copied before use (valuetype parameters)
Written in trusted code and uses internal locking to guarantee a safe state
For example critical code cannot trust a list accessible from untrusted code. If it gets passed in a List, it has to create a private copy, do it's precondition checks on the copy, and then operate on the copy.

I'm going to take a wild guess at what the question really means...
In Python data structures in the interpreter get corrupted because Python is using a form of reference counting.
Both C# and Java use garbage collection and in fact they do use a global lock when doing a full heap collection.
Data can be marked and moved between "generations" without a lock. But to actually clean it up everything must come to a stop. Hopefully a very short stop, but a full stop.
Here is an interesting link on CLR garbage collection as of 2007:
http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry

Interlocked and Memory Barriers

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

Why volatile is not enough?

I'm confused. Answers to my previous question seems to confirm my assumptions. But as stated here volatile is not enough to assure atomicity in .Net. Either operations like incrementation and assignment in MSIL are not translated directly to single, native OPCODE or many CPUs can simultaneously read and write to the same RAM location.
To clarify:
I want to know if writes and reads are atomic on multiple CPUs?
I understand what volatile is about. But is it enough? Do I need to use interlocked operations if I want to get latest value writen by other CPU?

Herb Sutter recently wrote an article on volatile and what it really means (how it affects ordering of memory access and atomicity) in the native C++. .NET, and Java environments. It's a pretty good read:
volatile vs. volatile

volatile in .NET does make access to the variable atomic.
The problem is, that's often not enough. What if you need to read the variable, and if it is 0 (indicating that the resource is free), you set it to 1 (indicating that it's locked, and other threads should stay away from it).
Reading the 0 is atomic. Writing the 1 is atomic. But between those two operations, anything might happen. You might read a 0, and then before you can write the 1, another thread jumps in, reads the 0, and writes an 1.
However, volatile in .NET does guarantee atomicity of accesses to the variable. It just doesn't guarantee thread safety for operations relying on multiple accesses to it. (Disclaimer: volatile in C/C++ does not even guarantee this. Just so you know. It is much weaker, and occasinoally a source of bugs because people assume it guarantees atomicity :))
So you need to use locks as well, to group together multiple operations as one thread-safe chunk. (Or, for simple operations, the Interlocked operations in .NET may do the trick)

I might be jumping the gun here but it sounds to me as though you're confusing two issues here.
One is atomicity, which in my mind means that a single operation (that may require multiple steps) should not come in conflict with another such single operation.
The other is volatility, when is this value expected to change, and why.
Take the first. If your two-step operation requires you to read the current value, modify it, and write it back, you're most certainly going to want a lock, unless this whole operation can be translated into a single CPU instruction that can work on a single cache-line of data.
However, the second issue is, even when you're doing the locking thing, what will other threads see.
A volatile field in .NET is a field that the compiler knows can change at arbitrary times. In a single-threaded world, the change of a variable is something that happens at some point in a sequential stream of instructions so the compiler knows when it has added code that changes it, or at least when it has called out to outside world that may or may not have changed it so that once the code returns, it might not be the same value it was before the call.
This knowledge allows the compiler to lift the value from the field into a register once, before a loop or similar block of code, and never re-read the value from the field for that particular code.
With multi-threading however, that might give you some problems. One thread might have adjusted the value, and another thread, due to optimization, won't be reading this value for some time, because it knows it hasn't changed.
So when you flag a field as volatile you're basically telling the compiler that it shouldn't assume that it has the current value of this at any point, except for grabbing snapshots every time it needs the value.
Locks solve multiple-step operations, volatility handles how the compiler caches the field value in a register, and together they will solve more problems.
Also note that if a field contains something that cannot be read in a single cpu-instruction, you're most likely going to want to lock read-access to it as well.
For instance, if you're on a 32-bit cpu and writing a 64-bit value, that write-operation will require two steps to complete, and if another thread on another cpu manages to read the 64-bit value before step 2 has completed, it will get half of the previous value and half of the new, nicely mixed together, which can be even worse than getting an outdated one.
Edit: To answer the comment, that volatile guarantees the atomicity of the read/write operation, that's well, true, in a way, because the volatile keyword cannot be applied to fields that are larger than 32-bit, in effect making the field single-cpu-instruction read/writeable on both 32 and 64-bit cpu's. And yes, it will prevent the value from being kept in a register as much as possible.
So part of the comment is wrong, volatile cannot be applied to 64-bit values.
Note also that volatile has some semantics regarding reordering of reads/writes.
For relevant information, see the MSDN documentation or the C# specification, found here, section 10.5.3.

On a hardware level, multiple CPUs can never write simultanously to the same atomic RAM location. The size of an atomic read/write operation dependeds on CPU architecture, but is typically 1, 2 or 4 bytes on a 32-bit architecture. However, if you try reading the result back there is always a chance that another CPU has made a write to the same RAM location inbetween. On a low level, spin-locks are typically used to synchronize access to shared memory. In a high level language, such mechanisms may be called e.g. critical regions.
The volatile type just makes sure the variable is written immediately back to memory when it is changed (even if the value is to be used in the same function). A compiler will usually keep a value in an internal register for as long as possible if the value is to be reused later in the same function, and it is stored back to RAM when all modifications are finished or when a function returns. Volatile types are mostly useful when writing to hardware registers, or when you want to be sure a value is stored back to RAM in e.g. a multithread system.

Your question doesn't entirely make sense, because volatile specifies the how the read happens, not atomicity of multi-step processes. My car doesn't mow my lawn, either, but I try not to hold that against it. :)

The problem comes in with register based cashed copies of your variable's values.
When reading a value, the cpu will first see if it's in a register (fast) before checking main memory (slower).
Volatile tells the compiler to push the value out to main memory asap, and not to trust the cached register value. It's only useful in certain cases.
If you're looking for single op code writes, you'll need to use the Interlocked.Increment related methods.. But they're fairly limited in what they can do in a single safe instruction.
Safest and most reliable bet is to lock() (if you can't do an Interlocked.*)
Edit: Writes and reads are atomic if they're in a lock or an interlocked.* statement. Volatile alone is not enough under the terms of your question

Volatile is a compiler keyword that tells the compiler what to do. It does not necessarily translate into (essentially) bus operations that are required for atomicity. That is usually left up to the operating system.
Edit: to clarify, volatile is never enough if you want to guarantee atomicity. Or rather, it's up to the compiler to make it enough or not.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.