Thread Safety General Rules

Thread Safety General Rules - c#

A few questions about thread safety that I think I understand, but would like clarification on, if you could be so kind. The specific languages I program in are C++, C#, and Java. Hopefully keep these in mind when describing specific language keywords/features.
1) Cases of 1 writer, n readers. In cases such as n threads reading a variable, such as in a polled loop, and 1 writer updating this variable, is explicit locking required?
Consider:
// thread 1.
volatile bool bWorking = true;
void stopWork() { bWorking = false; }
// thread n
while (bWorking) {...}
Here, should it be enough to just have a memory barrier, and accomplish this with volatile? Since as I understand, in my above mentioned languages, simple reads and writes to primitives will not be interleaved so explicit locking is not required, however memory consistency cannot be guaranteed without some explicit lock, or volatile. Are my assumptions correct here?
2) Assuming my assumption above is correct, then it is only correct for simple reads and writes. That is bWorking = x... and x = bWorking; are the ONLY safe operations? IE complex assignments such as unary operators (++, --) are unsafe here, as are +=, *=, etc... ?
3) I assume if case 1 is correct, then it is not safe to expand that statement to also be safe for n writers and n readers when only assignment and reading is involved?

For Java:
1) a volatile variable is updated from/to the "main memory" on each reading writing, which means that the change by the updater thread will be seen by all reading threads on their next read. Also, updates are atomic (independent of variable type).
2) Yes, combined operations like ++ are not thread safe if you have multiple writers. For a single writing thread, there is no problem. (The volatile keyword makes sure that the update is seen by the other threads.)
3) As long as you only assign and read, volatile is enough - but if you have multiple writers, you can't be sure which value is the "final" one, or which will be read by which thread. Even the writing threads themselves can't reliably know that their own value is set. (If you only have boolean and will only set from true to false, there is no problem here.)
If you want more control, have a look at the classes in the java.util.concurrent.atomic package.

Do the locking. You are going to need to have locking anyway if you are writing multi-threaded code. C# and Java make it fairly simple. C++ is a little more complex but you should be able to use boost or make your own RAII classes. Given that you are going to be locking all over the place don't try to see if there are a few places where you might be able to avoid it. All will work fine until you run the code on a 64-way processor using new INtel microcode on a Tuesday in march on some misison critical customer system. Then bang.
People think that locks are expensive; they really aren't. The kernel devs spend a lot of time optimizing them and compared to one disk read they are utterly trivial; yet nobody ever seems to expend this much effort analyzing every last disk read
Add the usual statements about performance tuning evils, wise saying from Knuth, Spolsky ...... etc, etc,

For C++
1) This is tempting to try, and will usually work. However, a few things to keep in mind:
You're doing it with a boolean, so that seems safest. Other POD types might nor be so safe. E.g. it may take two instructions to set a 64-bit double on a 32-bit machine. So that would clearly be not thread safe.
If the boolean is the only thing you care about the threads sharing, this could work. If you're using it as a variant of the Double-Checked Lock Paradigm, you run into all the pitfalls therein. Consider:
std::string failure_message; // shared across threads
// some thread triggers the stop, and also reports why
failure_message = "File not found";
stopWork();
// all the other threads
while (bWorking) {...}
log << "Stopped work: " << failure_message;
This looks ok at first, because failure_message is set before bWorking is set to false. However, that may not be the case in practice. The compiler can rearrange the statements, and set bWorking first, resulting in thread unsafe access of failure_message. Even if the compiler doesn't, the hardware might. Multi-core cpus have their own caches, and thus things aren't quite so simple.
If it's just a boolean, it's probably ok. If it's more than that, it might have issues once in a while. How important is the code you're writing, and can you take that risk?
2) Correct, ++/--, +=, other operators will take multiple cpu instructions and will be thread unsafe. Depending on your platform and compiler, you may be able to write non-portable code to do atomic increments.
3) Correct, this would be unsafe in a general case. You can kinda squeak by when you have one thread, writing a single boolean once. As soon as you introduce multiple writes, you'd better have some real thread synchronization.
Note about cpu instructions
If an operation takes multiple instructions, your thread could be preempted between them -- and the operation would be partially complete. This is clearly bad for thread safety, and this is one reason why ++, +=, etc are not thread safe.
However, even if an operation takes a single instruction, that does not necessarily mean that it's thread safe. With multi-core and multi-cpu you have to worry about the visibility of a change -- when is the cpu cache flushed to main memory.
So while multiple instructions does imply not thread safe, it's false to assume that single instruction implies thread safe

With a 1-byte bool, you might be able to get away without using locking, but since you cannot guarantee the internals of the processor it'd still be a bad idea. Certainly with anything beyond 1 byte such as an integer you couldn't. One processor could be updating it while another was reading it on another thread, and you could get inconsistent results. In C# I would use a lock { } statement around the access (read or write) to bWorking. If it was something more complex, for example IO access to a large memory buffer, I'd use ReaderWriterLock or some variant of it. In C++ volatile won't help much, because that just prevents certain kinds of optimizations such as register variables which would totally cause problems in multithreading. You still need to use a locking construct.
So in summary I would never read and write anything in a multithreaded program without locking it somehow.

Updating a bool is going to be atomic on any sensible extant system. However, once your writer has written, there's no telling how long before your reader will read, especially once you take into account multiple cores, caches, scheduler oddities, and so on.
Part of the problem with increments and decrements (++, --) and compound assignments (+=, *=) is that they are misleading. They imply something is happening atomically that is actually happening in several operations. But even simple assignments can be unsafe one you have stepped away from the purity of boolean variables. Guaranteeing that a write as simple as x=foo is atomic is up to the details of your platform.
I assume by thread safe, you mean that readers will always see a consistent object no matter what the writers do. In your example this will always be the case since booleans can only evaluate to two values, both valid, and the value is only transitions once from true to false. Thread safety is going to be more difficult in a more complicated scenario.

Related

Explanation of Thread.MemoryBarrier() Bug with OoOP

Ok so after reading Albahari's Threading in C#, I am trying to get my head around Thread.MemoryBarrier() and Out-of-Order Processing.
Following Brian Gideon's answer on the Why we need Thread.MemoerBarrier() he mentions the following code causes the program to loop indefinitely on Release mode and without debugger attached.
class Program
{
static bool stop = false;
public static void Main(string[] args)
{
var t = new Thread(() =>
{
Console.WriteLine("thread begin");
bool toggle = false;
while (!stop)
{
// Thread.MemoryBarrier() or Console.WriteLine() fixes issue
toggle = !toggle;
}
Console.WriteLine("thread end");
});
t.Start();
Thread.Sleep(1000);
stop = true;
Console.WriteLine("stop = true");
Console.WriteLine("waiting...");
t.Join();
}
}
My question is why, without adding a Thread.MemoryBarrier(), or even Console.WriteLine() in the while loop fixes the issue?
I am guessing that because on a multi processor machine, the thread runs with its own cache of values, and never retrieves the updated value of stop because it has its value in cache?
Or is it that the main thread does not commit this to memory?
Also why does Console.WriteLine() fix this? Is it because it also implements a MemoryBarrier?

The compiler and CPU are free to optimize your code by re-ordering it in any way they see fit, as long as any changes are consistent for a single thread. This is why you never encounter issues in a single threaded program.
In your code you've got two threads that are using the stop flag. The compiler or CPU may choose to cache the value in a CPU register since in the thread you create since it can detect that your not writing to it in the thread. What you need is some way to tell the compiler/CPU that the variable is being modified in another thread and therefore it shouldn't cache the value but should read it from memory.
There are a couple of easy ways to do this. One if by surrounding all access to the stop variable in a lock statement. This will create a full barrier and ensure that each thread sees the current value. Another is to use the Interlocked class to read/write the variable, as this also puts up a full barrier.
There are also certain methods, such as Wait and Join, that also put up memory barriers in order to prevent reordering. The Albahari book lists these methods.

It doesn't fix any issues. It's a fake fix, rather dangerous in production code, as it may work, or it may not work.
The core problem is in this line
static bool stop = false;
The variable that stops a while loop is not volatile. Which means it may or may not be read from memory all the time. It can be cached, so that only the last read value is presented to a system (which may not be the actual current value).
This code
// Thread.MemoryBarrier() or Console.WriteLine() fixes issue
May or may not fix an issue on different platforms. Memory barrier or console write just happen to force application to read fresh values on a particular system. It may not be the same elsewhere.
Additionally, volatile and Thread.MemoryBarrier() only provide weak guarantees, which means they don't provide 100% assurance that a read value will always be the latest on all systems and CPUs.
Eric Lippert says
The true semantics of volatile reads
and writes are considerably more complex than I've outlined here; in
fact they do not actually guarantee that every processor stops what it
is doing and updates caches to/from main memory. Rather, they provide
weaker guarantees about how memory accesses before and after reads and
writes may be observed to be ordered with respect to each other.
Certain operations such as creating a new thread, entering a lock, or
using one of the Interlocked family of methods introduce stronger
guarantees about observation of ordering. If you want more details,
read sections 3.10 and 10.5.3 of the C# 4.0 specification.

That example does not have anything to do with out-of-order execution. It only shows the effect of possible compiler optimizing away the stop access, which should be addressed by simply marking the variable volatile. See Memory Reordering Caught in the Act for a better example.

Let us start with some definitions. The volatile keyword produces an acquire-fence on reads and a release-fence on writes. These are defined as follows.
acquire-fence: A memory barrier in which other reads and writes are not allowed to move before the fence.
release-fence: A memory barrier in which other reads and writes are not allowed to move after the fence.
The method Thread.MemoryBarrier generates a full-fence. That means it produces both the acquire-fence and the release-fence. Frustratingly the MSDN says this though.
Synchronizes memory access as follows: The processor executing the
current thread cannot reorder instructions in such a way that memory
accesses prior to the call to MemoryBarrier execute after memory
accesses that follow the call to MemoryBarrier.
Interpreting this leads us to believe that it only generates a release-fence though. So what is it? A full fence or half fence? That is probably a topic for another question. I am going to work under the assumption that it is a full fence because a lot of smart people have made that claim. But, more convincingly, the BCL itself uses Thread.MemoryBarrier as if it produced a full-fence. So in this case the documentation is probably wrong. Even more amusingly the statement actually implies that instructions before the call can somehow be sandwiched between the call and instructions after it. That would be absurd. I say this in jest (but not really) that it might benefit Microsoft to have a lawyer review all documentation regarding threading. I am sure their legalese skills could be put to good use in that area.
Now I am going to introduce an arrow notation to help illustrate the fences in action. An ↑ arrow will represent a release-fence and a ↓ arrow will represent an acquire-fence. Think of the arrow head as pushing memory access away in the direction of the arrow. But, and this is important, memory accesses can move past the tail. Read the definitions of the fences above and convince yourself that the arrows visually represent those definitions.
Next we will analyze the loop only as that is the most important part of the code. To do this I am going to unwind the loop. Here is what it looks like.
LOOP_TOP:
// Iteration 1
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
// Iteration 2
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
...
// Iteration N
read stop into register
jump-if-true to LOOP_BOTTOM
↑
full-fence // via Thread.MemoryBarrier
↓
read toggle into register
negate register
write register to toggle
goto LOOP_TOP
LOOP_BOTTOM:
Notice that the call to Thread.MemoryBarrier is constraining the movement of some of the memory access. For example, the read of toggle cannot move before the read of stop or vice-versa because those memory access are not allowed to move through an arrow head.
Now imagine what would happen if the full-fence were removed. The C# compiler, JIT compiler, or hardware are now have a lot more liberty in the moving the instructions around. In particular the lifting optimization, know formally as loop invariant code motion, is now allowed. Basically the compiler detects that stop is never modified and so the read is bubbled up out of the loop. It is now effectively cached into a register. If the memory barrier were in place then the read would have to push up through an arrow head and the specification specifically disallows that. This is much easier to visualize if you unwind the loop like I did above. Remember, the call to Thread.MemoryBarrier would be occurring on every iteration of the loop so you cannot simply draw conclusions about what would happen from only a single iteration.
The astute reader will notice that the compiler is free to swap the read of toggle and stop in such a manner that stop gets "refreshed" at the end of the loop instead of the beginning, but that is irrelevant to the contextual behavior of the loop. It has the exact same semantics and produces the same result.
My question is why, without adding a Thread.MemoryBarrier(), or even
Console.WriteLine() in the while loop fixes the issue?
Because the memory barrier places restrictions on the optimizations the compiler can perform. It would disallow loop invariant code motion. The assumption is that Console.WriteLine produces a memory barrier which is probably true. Without the memory barrier the C# compiler, JIT compiler, or hardware are free to hoist the read of stop up and outside of the loop itself.
I am guessing that because on a multi processor machine, the thread
runs with its own cache of values, and never retrieves the updated
value of stop because it has its value in cache?
In a nutshell...yes. Though keep in mind that it has nothing to do with the number of processors. This can be demonstrated with a single processor.
Or is it that the main thread does not commit this to memory?
No. The main thread will commit the write. The call to Thread.Join ensures that because it will create a memory barrier that disallows the movement of the write to fall below the join.
Also why does Console.WriteLine() fix this? Is it because it also
implements a MemoryBarrier?
Yes. It probably produces a memory barrier. I have been keeping a list of memory barrier generators here.

Threads communicating without locking

If I can guarantee myself that only one method in my entire app will ever write to a certain variable, then may I allow other methods in my app to safely read that value ?
If so, can I get away that stunt without locking the variable ?
In this context, what I'm doing (or, trying to do, or want to do) is for one method in one thread to put a value into the variable, and then other methods in other threads will read that value and make decisions.
A very nice option would be to lock against writes, while allowing reads.
Looked here MSDN page on lock and didn't see a way to do that.

As always, it depends a lot on the context.
a variable read in a tight loop may be stored in a register or local cache, so no change will be noticed unless you have a "fence"; volatile will fix this, but as a side-effect rather than by explicit intention; most people (including me) can't properly define what volatile means - so be very careful of using it as a "fix".
an oversize type (large struct) will not be atomic (for either read or write) - and cannot be handled safely without risk of tearing
an object or value might involve multiple sub-values; if they aren't changed atomically, it could cause problems
You might, however, find that Interlocked solves most of your problems without needing a lock. At the same time, an uncontested lock is insanely fast, and even a contested lock is still alarmingly fast. Frankly, I'm not sure that it is worth the thought you are giving it: a flat lock is almost certainly fast-enough, as long as you do the thinking first outside the lock, and only lock it when you know the changes you want to make.
There is also ReaderWriterLockSlim, but the number of cases where that actually improves performance is slim - in my experience, the simplest approach possible is usually the fastest, meaning either lock or Interlocked. ReaderWriterLockSlim is a more complex beast, designed for more complex scenarios, and has a little overhead because of it. Not massive amounts, but enough to make it worth looking carefully.

How to correctly read an Interlocked.Increment'ed int field?

Suppose I have a non-volatile int field, and a thread which Interlocked.Increments it. Can another thread safely read this directly, or does the read also need to be interlocked?
I previously thought that I had to use an interlocked read to guarantee that I'm seeing the current value, since, after all, the field isn't volatile. I've been using Interlocked.CompareExchange(int, 0, 0) to achieve that.
However, I've stumbled across this answer which suggests that actually plain reads will always see the current version of an Interlocked.Incremented value, and since int reading is already atomic, there's no need to do anything special. I've also found a request in which Microsoft rejects a request for Interlocked.Read(ref int), further suggesting that this is completely redundant.
So can I really safely read the most current value of such an int field without Interlocked?

If you want to guarantee that the other thread will read the latest value, you must use Thread.VolatileRead(). (*)
The read operation itself is atomic so that will not cause any problems but without volatile read you may get old value from the cache or compiler may optimize your code and eliminate the read operation altogether. From the compiler's point of view it is enough that the code works in single threaded environment. Volatile operations and memory barriers are used to limit the compiler's ability to optimize and reorder the code.
There are several participants that can alter the code: compiler, JIT-compiler and CPU. It does not really matter which one of them shows that your code is broken. The only important thing is the .NET memory model as it specifies the rules that must be obeyed by all participants.
(*) Thread.VolatileRead() does not really get the latest value. It will read the value and add a memory barrier after the read. The first volatile read may get cached value but the second would get an updated value because the memory barrier of the first volatile read has forced a cache update if it was necessary. In practice this detail has little importance when writing the code.

A bit of a meta issue, but a good aspect about using Interlocked.CompareExchange(ref value, 0, 0) (ignoring the obvious disadvantage that it's harder to understand when used for reading) is that it works regardless of int or long. It's true that int reads are always atomic, but long reads are not or may be not, depending on the architecture. Unfortunately, Interlocked.Read(ref value) only works if value is of type long.
Consider the case that you're starting with an int field, which makes it impossible to use Interlocked.Read(), so you'll read the value directly instead since that's atomic anyway. However, later in development you or somebody else decides that a long is required - the compiler won't warn you, but now you may have a subtle bug: The read access is not guaranteed to be atomic anymore. I found using Interlocked.CompareExchange() the best alternative here; It may be slower depending on the underlying processor instructions, but it is safer in the long run. I don't know enough about the internals of Thread.VolatileRead() though; It might be "better" regarding this use case since it provides even more signatures.
I would not try to read the value directly (i.e. without any of the above mechanisms) within a loop or any tight method call though, since even if the writes are volatile and/or memory barrier'd, nothing is telling the compiler that the value of the field can actually change between two reads. So, the field should be either volatile or any of the given constructs should be used.
My two cents.

You're correct that you do not need a special instruction to atomically read a 32bit integer, however, what that means is you will get the "whole" value (i.e. you won't get part of one write and part of another). You have no guarantees that the value won't have changed once you have read it.
It is at this point where you need to decide if you need to use some other synchronization method to control access, say if you're using this value to read a member from an array, etc.
In a nutshell, atomicity ensures an operation happens completely and indivisibly. Given some operation A that contained N steps, if you made it to the operation right after A you can be assured that all N steps happened in isolation from concurrent operations.
If you had two threads which executed the atomic operation A you are guaranteed you will see only the complete result of one of the two threads. If you want to coordinate the threads, atomic operations could be used to create the required synchronization. But atomic operations in and of themselves do not provide higher level synchronization. The Interlocked family of methods are made available to provide some fundamental atomic operations.
Synchronization is a broader kind of concurrency control, often built around atomic operations. Most processors include memory barriers which allow you to ensure all cache lines are flushed and you have a consistent view of memory. Volatile reads are a way to ensure consistent access to a given memory location.
While not immediately applicable to your problem, reading up on ACID (atomicity, consistency, isolation, and durability) with respect to databases may help you with the terminology.

Me, being paranoid, I do Interlocked.Add(ref incrementedField, 0) for int values

Yes everything you've read is correct. Interlocked.Increment is designed so that normal reads will not be false while making the changes to the field. Reading a field is not dangerous, writing a field is.

Interlocked and Memory Barriers

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

Why volatile is not enough?

I'm confused. Answers to my previous question seems to confirm my assumptions. But as stated here volatile is not enough to assure atomicity in .Net. Either operations like incrementation and assignment in MSIL are not translated directly to single, native OPCODE or many CPUs can simultaneously read and write to the same RAM location.
To clarify:
I want to know if writes and reads are atomic on multiple CPUs?
I understand what volatile is about. But is it enough? Do I need to use interlocked operations if I want to get latest value writen by other CPU?

Herb Sutter recently wrote an article on volatile and what it really means (how it affects ordering of memory access and atomicity) in the native C++. .NET, and Java environments. It's a pretty good read:
volatile vs. volatile

volatile in .NET does make access to the variable atomic.
The problem is, that's often not enough. What if you need to read the variable, and if it is 0 (indicating that the resource is free), you set it to 1 (indicating that it's locked, and other threads should stay away from it).
Reading the 0 is atomic. Writing the 1 is atomic. But between those two operations, anything might happen. You might read a 0, and then before you can write the 1, another thread jumps in, reads the 0, and writes an 1.
However, volatile in .NET does guarantee atomicity of accesses to the variable. It just doesn't guarantee thread safety for operations relying on multiple accesses to it. (Disclaimer: volatile in C/C++ does not even guarantee this. Just so you know. It is much weaker, and occasinoally a source of bugs because people assume it guarantees atomicity :))
So you need to use locks as well, to group together multiple operations as one thread-safe chunk. (Or, for simple operations, the Interlocked operations in .NET may do the trick)

I might be jumping the gun here but it sounds to me as though you're confusing two issues here.
One is atomicity, which in my mind means that a single operation (that may require multiple steps) should not come in conflict with another such single operation.
The other is volatility, when is this value expected to change, and why.
Take the first. If your two-step operation requires you to read the current value, modify it, and write it back, you're most certainly going to want a lock, unless this whole operation can be translated into a single CPU instruction that can work on a single cache-line of data.
However, the second issue is, even when you're doing the locking thing, what will other threads see.
A volatile field in .NET is a field that the compiler knows can change at arbitrary times. In a single-threaded world, the change of a variable is something that happens at some point in a sequential stream of instructions so the compiler knows when it has added code that changes it, or at least when it has called out to outside world that may or may not have changed it so that once the code returns, it might not be the same value it was before the call.
This knowledge allows the compiler to lift the value from the field into a register once, before a loop or similar block of code, and never re-read the value from the field for that particular code.
With multi-threading however, that might give you some problems. One thread might have adjusted the value, and another thread, due to optimization, won't be reading this value for some time, because it knows it hasn't changed.
So when you flag a field as volatile you're basically telling the compiler that it shouldn't assume that it has the current value of this at any point, except for grabbing snapshots every time it needs the value.
Locks solve multiple-step operations, volatility handles how the compiler caches the field value in a register, and together they will solve more problems.
Also note that if a field contains something that cannot be read in a single cpu-instruction, you're most likely going to want to lock read-access to it as well.
For instance, if you're on a 32-bit cpu and writing a 64-bit value, that write-operation will require two steps to complete, and if another thread on another cpu manages to read the 64-bit value before step 2 has completed, it will get half of the previous value and half of the new, nicely mixed together, which can be even worse than getting an outdated one.
Edit: To answer the comment, that volatile guarantees the atomicity of the read/write operation, that's well, true, in a way, because the volatile keyword cannot be applied to fields that are larger than 32-bit, in effect making the field single-cpu-instruction read/writeable on both 32 and 64-bit cpu's. And yes, it will prevent the value from being kept in a register as much as possible.
So part of the comment is wrong, volatile cannot be applied to 64-bit values.
Note also that volatile has some semantics regarding reordering of reads/writes.
For relevant information, see the MSDN documentation or the C# specification, found here, section 10.5.3.

On a hardware level, multiple CPUs can never write simultanously to the same atomic RAM location. The size of an atomic read/write operation dependeds on CPU architecture, but is typically 1, 2 or 4 bytes on a 32-bit architecture. However, if you try reading the result back there is always a chance that another CPU has made a write to the same RAM location inbetween. On a low level, spin-locks are typically used to synchronize access to shared memory. In a high level language, such mechanisms may be called e.g. critical regions.
The volatile type just makes sure the variable is written immediately back to memory when it is changed (even if the value is to be used in the same function). A compiler will usually keep a value in an internal register for as long as possible if the value is to be reused later in the same function, and it is stored back to RAM when all modifications are finished or when a function returns. Volatile types are mostly useful when writing to hardware registers, or when you want to be sure a value is stored back to RAM in e.g. a multithread system.

Your question doesn't entirely make sense, because volatile specifies the how the read happens, not atomicity of multi-step processes. My car doesn't mow my lawn, either, but I try not to hold that against it. :)

The problem comes in with register based cashed copies of your variable's values.
When reading a value, the cpu will first see if it's in a register (fast) before checking main memory (slower).
Volatile tells the compiler to push the value out to main memory asap, and not to trust the cached register value. It's only useful in certain cases.
If you're looking for single op code writes, you'll need to use the Interlocked.Increment related methods.. But they're fairly limited in what they can do in a single safe instruction.
Safest and most reliable bet is to lock() (if you can't do an Interlocked.*)
Edit: Writes and reads are atomic if they're in a lock or an interlocked.* statement. Volatile alone is not enough under the terms of your question

Volatile is a compiler keyword that tells the compiler what to do. It does not necessarily translate into (essentially) bus operations that are required for atomicity. That is usually left up to the operating system.
Edit: to clarify, volatile is never enough if you want to guarantee atomicity. Or rather, it's up to the compiler to make it enough or not.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.