Interlocked and Memory Barriers

Interlocked and Memory Barriers - c#

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

Related

Can a read instruction after an unrelated lock statement be moved before the lock?

This question is a follow-up to comments in this thread.
Let's assume we have the following code:
// (1)
lock (padlock)
{
// (2)
}
var value = nonVolatileField; // (3)
Furthermore, let's assume that no instruction in (2) has any effect on the nonVolatileField and vice versa.
Can the reading instruction (3) be reordered in such a way that in ends up before the lock statement (1) or inside it (2)?
As far as I can tell, nothing in the C# Specification (§3.10) and the CLI Specification (§I.12.6.5) prohibits such reordering.
Please note that this is not the same question as this one. Here I am asking specifically about read instructions, because as far as I understand, they are not considered side-effects and have weaker guarantees.

I believe this is partially guaranteed by the CLI spec, although it's not as clear as it might be. From I.12.6.5:
Acquiring a lock (System.Threading.Monitor.Enter or entering a synchronized method) shall implicitly perform a volatile read operation, and releasing a lock
(System.Threading.Monitor.Exit or leaving a synchronized method) shall implicitly perform a volatile write operation. See §I.12.6.7.
Then from I.12.6.7:
A volatile read has “acquire semantics” meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has “release semantics” meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence.
So entering the lock should prevent (3) from moving to (1). Reading from nonVolatileField still counts as a "reference to memory", I believe. However, the read could still be performed before the volatile write when the lock exits, so it could still be moved to (2).
The C#/CLI memory model leaves a lot to be desired at the moment. I'm hoping that the whole thing can be clarified significantly (and probably tightened up, to make some "theoretically valid but practically awful" optimizations invalid).

As far as .NET is concerned, entering a monitor (the lock statement) has acquire semantics, as it implicitly performs a volatile read, and exiting a monitor (the end of the lock block) has release semantics, as it implicitly performs a volatile write (see §12.6.5 Locks and Threads in Common Language Infrastructure (CLI) Partition I).
volatile bool areWeThereYet = false;
// In thread 1
// Accesses, usually writes: create objects, initialize them
areWeThereYet = true;
// In thread 2
if (areWeThereYet)
{
// Accesses, usually reads: use created and initialized objects
}
When you write a value to areWeThereYet, all accesses before it were performed and not reordered to after the volatile write.
When you read from areWeThereYet, subsequent accesses are not reordered to before the volatile read.
In this case, when thread 2 observes that areWeThereYet has changed, it has a guarantee that the following accesses, usually reads, will observe the other thread's accesses, usually writes. Assuming there is no other code messing with the affected variables.
As for other synchronization primitives in .NET, such as SemaphoreSlim, although not explicitly documented, it would be rather useless if they didn't have similar semantics. Programs based on them could, in fact, not even work correctly in platforms or hardware architectures with a weaker memory model.
Many people share the thought that Microsoft ought to enforce a strong memory model on such architectures, similar to x86/amd64 as to keep the current code base (Microsoft's own and those of their clients) compatible.
I cannot verify myself, as I don't have an ARM device with Microsoft Windows, much less with .NET Framework for ARM, but at least one MSDN magazine article by Andrew Pardoe, CLR - .NET Development for ARM Processors, states:
The CLR is allowed to expose a stronger memory model than the ECMA CLI specification requires. On x86, for example, the memory model of the CLR is strong because the processor’s memory model is strong. The .NET team could’ve made the memory model on ARM as strong as the model on x86, but ensuring the perfect ordering whenever possible can have a notable impact on code execution performance. We’ve done targeted work to strengthen the memory model on ARM—specifically, we’ve inserted memory barriers at key points when writing to the managed heap to guarantee type safety—but we’ve made sure to only do this with a minimal impact on performance. The team went through multiple design reviews with experts to make sure that the techniques applied in the ARM CLR were correct. Moreover, performance benchmarks show that .NET code execution performance scales the same as native C++ code when compared across x86, x64 and ARM.

Why don't all member variables need volatile for thread safety even when using Monitor? (why does the model really work?)

(I know they don't but I'm looking for the underlying reason this actually works without using volatile since there should be nothing preventing the compiler from storing a variable in a register without volatile... or is there...)
This question stems from the discord of thought that without volatile the compiler (can optimize in theory any variable in various ways including storing it in a CPU register.) While the docs say that is not needed when using synchronization like lock around variables. But there is really in some cases seemingly no way the compiler/jit can know you will use them or not in your code path. So the suspicion is something else is really happening here to make the memory model "work".
In this example what prevents the compiler/jit from optimizing _count into a register and thus having the increment done on the register rather then directly to memory (later writing to memory after the exit call)? If _count were volatile it would seem everything should be fine, but a lot of code is written without volatile. It makes sense the compiler could know not to optimize _count into a register if it saw a lock or synchronization object in the method.. but in this case the lock call is in another function.
Most documentation says you don't need to use volatile if you use a synchronization call like lock.
So what prevents the compiler from optimizing _count into a register and potentially updating just the register within the lock? I have a feeling that most member variables won't be optimized into registers for this exact reason as then every member variable would really need to be volatile unless the compiler could tell it shouldn't optimize (otherwise I suspect tons of code would fail). I saw something similar when looking at C++ years ago local function variables got stored in registers, class member variables did not.
So the main question is, is it really the only way this possibly works without volatile that the compiler/jit won't put class member variables in registers and thus volatile is then unnecessary?
(Please ignore the lack of exception handling and safety in the calls, but you get the gist.)
public class MyClass
{
object _o=new object();
int _count=0;
public void Increment()
{
Enter();
// ... many usages of count here...
count++;
Exit();
}
//lets pretend these functions are too big to inline and even call other methods
// that actually make the monitor call (for example a base class that implemented these)
private void Enter() { Monitor.Enter(_o); }
private void Exit() { Monitor.Exit(_o); } //lets pretend this function is too big to inline
// ...
// ...
}

Entering and leaving a Monitor causes a full memory fence. Thus the CLR makes sure that all writing operations before the Monitor.Enter / Monitor.Exit become visible to all other threads and that all reading operations after the method call "happen" after it. That also means that statements before the call cannot be moved after the call and vice versa.
See http://www.albahari.com/threading/part4.aspx.

The best guess answer to this question would appear to be that that any variables that are stored in CPU registers are saved to memory before any function would be called. This makes sense because compiler design viewpoint from a single thread would require that, otherwise the object might appear to be inconsistent if it were used by other functions/methods/objects.
So it may not be so much as some people/articles claim that synchronization objects/classes are detected by the compilers and non-volatile variables are made safe through their calls. (Perhaps they are when a lock is used or other synchronization objects in the same method, but once you have calls in another method that calls those synchronization objects probably not), instead it is likely that just the fact of calling another method is probably enough to cause the values stored in CPU registers to be saved to memory. Thus not requiring all variables to be volatile.
Also I suspect and others have suspected too that fields of a class are not optimized as much due to some of the threading concerns.
Some notes (my understanding):
Thread.MemoryBarrier() is mostly a CPU instruction to insure writes/reads don't bypass the barrier from a CPU perspective. (This is not directly related to values stored in registers) So this is probably not what directly causes to save variables from registers to memory (except just by the fact it is a method call as per our discussion here, would likely cause that to happen- It could have really been any method call though perhaps to affect all class fields that were used being saved from registers)
It is theoretically possible the JIT/Compiler could also take that method into an account in the same method to ensure variables are stored from CPU registers. But just following our simple proposed rule of any calls to another method or class would result in saving variables stored in registers to memory. Plus if someone wrapped that call in another method (maybe many methods deep), the compiler wouldn't likely analyze that deep to speculate on execution. The JIT could do something but again it likely wouldn't analyze that deep, and both cases need to ensure locks/synchronization work no matter what, thus the simplest optimization is the likely answer.
Unless we have anyone that writes the compilers that can confirm this its all a guess, but its likely the best guess we have of why volatile is not needed.
If that rule is followed synchronization objects just need to employ their own call to MemoryBarrier when they enter and leave to ensure the CPU has the most up to date values from its write caches so they get flushed so proper values can be read. On this site you will see that is what is suggested implicit memory barriers: http://www.albahari.com/threading/part4.aspx

So what prevents the compiler from optimizing _count into a register
and potentially updating just the register within the lock?
There is nothing in the documentation that I am aware that would preclude that from happening. The point is that the call to Monitor.Exit will effectively guarantee that the final value of _count will be committed to memory upon completion.
It makes sense the compiler could know not to optimize _count into a
register if it saw a lock or synchronization object in the method..
but in this case the lock call is in another function.
The fact that the lock is acquired and released in other methods is irrelevant from your point of view. The model memory defines a pretty rigid set of rules that must be adhered to regarding memory barrier generators. The only consequence of putting those Monitor calls in another method is that JIT compiler will have a harder time complying with those rules. But, the JIT compiler must comply; period. If the method calls get to complex or nested too deep then I suspect the JIT compiler punts on any heuristics it might have in this regard and says, "Forget it, I'm just not going to optimize anything!"
So the main question is, is it really the only way this possibly works
without volatile that the compiler/jit won't put class member
variables in registers and thus volatile is then unnecessary?
It works because the protocol is to acquire the lock prior to reading _count as well. If the readers do not do that then all bets are off.

Thread Safety General Rules

A few questions about thread safety that I think I understand, but would like clarification on, if you could be so kind. The specific languages I program in are C++, C#, and Java. Hopefully keep these in mind when describing specific language keywords/features.
1) Cases of 1 writer, n readers. In cases such as n threads reading a variable, such as in a polled loop, and 1 writer updating this variable, is explicit locking required?
Consider:
// thread 1.
volatile bool bWorking = true;
void stopWork() { bWorking = false; }
// thread n
while (bWorking) {...}
Here, should it be enough to just have a memory barrier, and accomplish this with volatile? Since as I understand, in my above mentioned languages, simple reads and writes to primitives will not be interleaved so explicit locking is not required, however memory consistency cannot be guaranteed without some explicit lock, or volatile. Are my assumptions correct here?
2) Assuming my assumption above is correct, then it is only correct for simple reads and writes. That is bWorking = x... and x = bWorking; are the ONLY safe operations? IE complex assignments such as unary operators (++, --) are unsafe here, as are +=, *=, etc... ?
3) I assume if case 1 is correct, then it is not safe to expand that statement to also be safe for n writers and n readers when only assignment and reading is involved?

For Java:
1) a volatile variable is updated from/to the "main memory" on each reading writing, which means that the change by the updater thread will be seen by all reading threads on their next read. Also, updates are atomic (independent of variable type).
2) Yes, combined operations like ++ are not thread safe if you have multiple writers. For a single writing thread, there is no problem. (The volatile keyword makes sure that the update is seen by the other threads.)
3) As long as you only assign and read, volatile is enough - but if you have multiple writers, you can't be sure which value is the "final" one, or which will be read by which thread. Even the writing threads themselves can't reliably know that their own value is set. (If you only have boolean and will only set from true to false, there is no problem here.)
If you want more control, have a look at the classes in the java.util.concurrent.atomic package.

Do the locking. You are going to need to have locking anyway if you are writing multi-threaded code. C# and Java make it fairly simple. C++ is a little more complex but you should be able to use boost or make your own RAII classes. Given that you are going to be locking all over the place don't try to see if there are a few places where you might be able to avoid it. All will work fine until you run the code on a 64-way processor using new INtel microcode on a Tuesday in march on some misison critical customer system. Then bang.
People think that locks are expensive; they really aren't. The kernel devs spend a lot of time optimizing them and compared to one disk read they are utterly trivial; yet nobody ever seems to expend this much effort analyzing every last disk read
Add the usual statements about performance tuning evils, wise saying from Knuth, Spolsky ...... etc, etc,

For C++
1) This is tempting to try, and will usually work. However, a few things to keep in mind:
You're doing it with a boolean, so that seems safest. Other POD types might nor be so safe. E.g. it may take two instructions to set a 64-bit double on a 32-bit machine. So that would clearly be not thread safe.
If the boolean is the only thing you care about the threads sharing, this could work. If you're using it as a variant of the Double-Checked Lock Paradigm, you run into all the pitfalls therein. Consider:
std::string failure_message; // shared across threads
// some thread triggers the stop, and also reports why
failure_message = "File not found";
stopWork();
// all the other threads
while (bWorking) {...}
log << "Stopped work: " << failure_message;
This looks ok at first, because failure_message is set before bWorking is set to false. However, that may not be the case in practice. The compiler can rearrange the statements, and set bWorking first, resulting in thread unsafe access of failure_message. Even if the compiler doesn't, the hardware might. Multi-core cpus have their own caches, and thus things aren't quite so simple.
If it's just a boolean, it's probably ok. If it's more than that, it might have issues once in a while. How important is the code you're writing, and can you take that risk?
2) Correct, ++/--, +=, other operators will take multiple cpu instructions and will be thread unsafe. Depending on your platform and compiler, you may be able to write non-portable code to do atomic increments.
3) Correct, this would be unsafe in a general case. You can kinda squeak by when you have one thread, writing a single boolean once. As soon as you introduce multiple writes, you'd better have some real thread synchronization.
Note about cpu instructions
If an operation takes multiple instructions, your thread could be preempted between them -- and the operation would be partially complete. This is clearly bad for thread safety, and this is one reason why ++, +=, etc are not thread safe.
However, even if an operation takes a single instruction, that does not necessarily mean that it's thread safe. With multi-core and multi-cpu you have to worry about the visibility of a change -- when is the cpu cache flushed to main memory.
So while multiple instructions does imply not thread safe, it's false to assume that single instruction implies thread safe

With a 1-byte bool, you might be able to get away without using locking, but since you cannot guarantee the internals of the processor it'd still be a bad idea. Certainly with anything beyond 1 byte such as an integer you couldn't. One processor could be updating it while another was reading it on another thread, and you could get inconsistent results. In C# I would use a lock { } statement around the access (read or write) to bWorking. If it was something more complex, for example IO access to a large memory buffer, I'd use ReaderWriterLock or some variant of it. In C++ volatile won't help much, because that just prevents certain kinds of optimizations such as register variables which would totally cause problems in multithreading. You still need to use a locking construct.
So in summary I would never read and write anything in a multithreaded program without locking it somehow.

Updating a bool is going to be atomic on any sensible extant system. However, once your writer has written, there's no telling how long before your reader will read, especially once you take into account multiple cores, caches, scheduler oddities, and so on.
Part of the problem with increments and decrements (++, --) and compound assignments (+=, *=) is that they are misleading. They imply something is happening atomically that is actually happening in several operations. But even simple assignments can be unsafe one you have stepped away from the purity of boolean variables. Guaranteeing that a write as simple as x=foo is atomic is up to the details of your platform.
I assume by thread safe, you mean that readers will always see a consistent object no matter what the writers do. In your example this will always be the case since booleans can only evaluate to two values, both valid, and the value is only transitions once from true to false. Thread safety is going to be more difficult in a more complicated scenario.

How to correctly read an Interlocked.Increment'ed int field?

Suppose I have a non-volatile int field, and a thread which Interlocked.Increments it. Can another thread safely read this directly, or does the read also need to be interlocked?
I previously thought that I had to use an interlocked read to guarantee that I'm seeing the current value, since, after all, the field isn't volatile. I've been using Interlocked.CompareExchange(int, 0, 0) to achieve that.
However, I've stumbled across this answer which suggests that actually plain reads will always see the current version of an Interlocked.Incremented value, and since int reading is already atomic, there's no need to do anything special. I've also found a request in which Microsoft rejects a request for Interlocked.Read(ref int), further suggesting that this is completely redundant.
So can I really safely read the most current value of such an int field without Interlocked?

If you want to guarantee that the other thread will read the latest value, you must use Thread.VolatileRead(). (*)
The read operation itself is atomic so that will not cause any problems but without volatile read you may get old value from the cache or compiler may optimize your code and eliminate the read operation altogether. From the compiler's point of view it is enough that the code works in single threaded environment. Volatile operations and memory barriers are used to limit the compiler's ability to optimize and reorder the code.
There are several participants that can alter the code: compiler, JIT-compiler and CPU. It does not really matter which one of them shows that your code is broken. The only important thing is the .NET memory model as it specifies the rules that must be obeyed by all participants.
(*) Thread.VolatileRead() does not really get the latest value. It will read the value and add a memory barrier after the read. The first volatile read may get cached value but the second would get an updated value because the memory barrier of the first volatile read has forced a cache update if it was necessary. In practice this detail has little importance when writing the code.

A bit of a meta issue, but a good aspect about using Interlocked.CompareExchange(ref value, 0, 0) (ignoring the obvious disadvantage that it's harder to understand when used for reading) is that it works regardless of int or long. It's true that int reads are always atomic, but long reads are not or may be not, depending on the architecture. Unfortunately, Interlocked.Read(ref value) only works if value is of type long.
Consider the case that you're starting with an int field, which makes it impossible to use Interlocked.Read(), so you'll read the value directly instead since that's atomic anyway. However, later in development you or somebody else decides that a long is required - the compiler won't warn you, but now you may have a subtle bug: The read access is not guaranteed to be atomic anymore. I found using Interlocked.CompareExchange() the best alternative here; It may be slower depending on the underlying processor instructions, but it is safer in the long run. I don't know enough about the internals of Thread.VolatileRead() though; It might be "better" regarding this use case since it provides even more signatures.
I would not try to read the value directly (i.e. without any of the above mechanisms) within a loop or any tight method call though, since even if the writes are volatile and/or memory barrier'd, nothing is telling the compiler that the value of the field can actually change between two reads. So, the field should be either volatile or any of the given constructs should be used.
My two cents.

You're correct that you do not need a special instruction to atomically read a 32bit integer, however, what that means is you will get the "whole" value (i.e. you won't get part of one write and part of another). You have no guarantees that the value won't have changed once you have read it.
It is at this point where you need to decide if you need to use some other synchronization method to control access, say if you're using this value to read a member from an array, etc.
In a nutshell, atomicity ensures an operation happens completely and indivisibly. Given some operation A that contained N steps, if you made it to the operation right after A you can be assured that all N steps happened in isolation from concurrent operations.
If you had two threads which executed the atomic operation A you are guaranteed you will see only the complete result of one of the two threads. If you want to coordinate the threads, atomic operations could be used to create the required synchronization. But atomic operations in and of themselves do not provide higher level synchronization. The Interlocked family of methods are made available to provide some fundamental atomic operations.
Synchronization is a broader kind of concurrency control, often built around atomic operations. Most processors include memory barriers which allow you to ensure all cache lines are flushed and you have a consistent view of memory. Volatile reads are a way to ensure consistent access to a given memory location.
While not immediately applicable to your problem, reading up on ACID (atomicity, consistency, isolation, and durability) with respect to databases may help you with the terminology.

Me, being paranoid, I do Interlocked.Add(ref incrementedField, 0) for int values

Yes everything you've read is correct. Interlocked.Increment is designed so that normal reads will not be false while making the changes to the field. Reading a field is not dangerous, writing a field is.

Why volatile is not enough?

I'm confused. Answers to my previous question seems to confirm my assumptions. But as stated here volatile is not enough to assure atomicity in .Net. Either operations like incrementation and assignment in MSIL are not translated directly to single, native OPCODE or many CPUs can simultaneously read and write to the same RAM location.
To clarify:
I want to know if writes and reads are atomic on multiple CPUs?
I understand what volatile is about. But is it enough? Do I need to use interlocked operations if I want to get latest value writen by other CPU?

Herb Sutter recently wrote an article on volatile and what it really means (how it affects ordering of memory access and atomicity) in the native C++. .NET, and Java environments. It's a pretty good read:
volatile vs. volatile

volatile in .NET does make access to the variable atomic.
The problem is, that's often not enough. What if you need to read the variable, and if it is 0 (indicating that the resource is free), you set it to 1 (indicating that it's locked, and other threads should stay away from it).
Reading the 0 is atomic. Writing the 1 is atomic. But between those two operations, anything might happen. You might read a 0, and then before you can write the 1, another thread jumps in, reads the 0, and writes an 1.
However, volatile in .NET does guarantee atomicity of accesses to the variable. It just doesn't guarantee thread safety for operations relying on multiple accesses to it. (Disclaimer: volatile in C/C++ does not even guarantee this. Just so you know. It is much weaker, and occasinoally a source of bugs because people assume it guarantees atomicity :))
So you need to use locks as well, to group together multiple operations as one thread-safe chunk. (Or, for simple operations, the Interlocked operations in .NET may do the trick)

I might be jumping the gun here but it sounds to me as though you're confusing two issues here.
One is atomicity, which in my mind means that a single operation (that may require multiple steps) should not come in conflict with another such single operation.
The other is volatility, when is this value expected to change, and why.
Take the first. If your two-step operation requires you to read the current value, modify it, and write it back, you're most certainly going to want a lock, unless this whole operation can be translated into a single CPU instruction that can work on a single cache-line of data.
However, the second issue is, even when you're doing the locking thing, what will other threads see.
A volatile field in .NET is a field that the compiler knows can change at arbitrary times. In a single-threaded world, the change of a variable is something that happens at some point in a sequential stream of instructions so the compiler knows when it has added code that changes it, or at least when it has called out to outside world that may or may not have changed it so that once the code returns, it might not be the same value it was before the call.
This knowledge allows the compiler to lift the value from the field into a register once, before a loop or similar block of code, and never re-read the value from the field for that particular code.
With multi-threading however, that might give you some problems. One thread might have adjusted the value, and another thread, due to optimization, won't be reading this value for some time, because it knows it hasn't changed.
So when you flag a field as volatile you're basically telling the compiler that it shouldn't assume that it has the current value of this at any point, except for grabbing snapshots every time it needs the value.
Locks solve multiple-step operations, volatility handles how the compiler caches the field value in a register, and together they will solve more problems.
Also note that if a field contains something that cannot be read in a single cpu-instruction, you're most likely going to want to lock read-access to it as well.
For instance, if you're on a 32-bit cpu and writing a 64-bit value, that write-operation will require two steps to complete, and if another thread on another cpu manages to read the 64-bit value before step 2 has completed, it will get half of the previous value and half of the new, nicely mixed together, which can be even worse than getting an outdated one.
Edit: To answer the comment, that volatile guarantees the atomicity of the read/write operation, that's well, true, in a way, because the volatile keyword cannot be applied to fields that are larger than 32-bit, in effect making the field single-cpu-instruction read/writeable on both 32 and 64-bit cpu's. And yes, it will prevent the value from being kept in a register as much as possible.
So part of the comment is wrong, volatile cannot be applied to 64-bit values.
Note also that volatile has some semantics regarding reordering of reads/writes.
For relevant information, see the MSDN documentation or the C# specification, found here, section 10.5.3.

On a hardware level, multiple CPUs can never write simultanously to the same atomic RAM location. The size of an atomic read/write operation dependeds on CPU architecture, but is typically 1, 2 or 4 bytes on a 32-bit architecture. However, if you try reading the result back there is always a chance that another CPU has made a write to the same RAM location inbetween. On a low level, spin-locks are typically used to synchronize access to shared memory. In a high level language, such mechanisms may be called e.g. critical regions.
The volatile type just makes sure the variable is written immediately back to memory when it is changed (even if the value is to be used in the same function). A compiler will usually keep a value in an internal register for as long as possible if the value is to be reused later in the same function, and it is stored back to RAM when all modifications are finished or when a function returns. Volatile types are mostly useful when writing to hardware registers, or when you want to be sure a value is stored back to RAM in e.g. a multithread system.

Your question doesn't entirely make sense, because volatile specifies the how the read happens, not atomicity of multi-step processes. My car doesn't mow my lawn, either, but I try not to hold that against it. :)

The problem comes in with register based cashed copies of your variable's values.
When reading a value, the cpu will first see if it's in a register (fast) before checking main memory (slower).
Volatile tells the compiler to push the value out to main memory asap, and not to trust the cached register value. It's only useful in certain cases.
If you're looking for single op code writes, you'll need to use the Interlocked.Increment related methods.. But they're fairly limited in what they can do in a single safe instruction.
Safest and most reliable bet is to lock() (if you can't do an Interlocked.*)
Edit: Writes and reads are atomic if they're in a lock or an interlocked.* statement. Volatile alone is not enough under the terms of your question

Volatile is a compiler keyword that tells the compiler what to do. It does not necessarily translate into (essentially) bus operations that are required for atomicity. That is usually left up to the operating system.
Edit: to clarify, volatile is never enough if you want to guarantee atomicity. Or rather, it's up to the compiler to make it enough or not.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.