Can memory reordering cause C# to access unallocated memory?

Can memory reordering cause C# to access unallocated memory? - c#

It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword. However, its memory model allows reordering when there is unsynchronized access between threads. This leads to race hazards where references to new instances appear to be available to racing threads before the instances have been fully initialized, and is a widely known problem for double-checked locking. Chris Brumme (from the CLR team) explains this in their Memory Model article:
Consider the standard double-locking protocol:
if (a == null)
{
lock(obj)
{
if (a == null)
a = new A();
}
}
This is a common technique for avoiding a lock on the read of ‘a’ in the typical case. It works just fine on X86. But it would be broken by a legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA spec, acquiring a lock has acquire semantics and releasing a lock has release semantics.
However, we have to assume that a series of stores have taken place during construction of ‘a’. Those stores can be arbitrarily reordered, including the possibility of delaying them until after the publishing store which assigns the new object to ‘a’. At that point, there is a small window before the store.release implied by leaving the lock. Inside that window, other CPUs can navigate through the reference ‘a’ and see a partially constructed instance.
I've always been confused by what "partially constructed instance" means. Assuming that the .NET runtime clears out memory on allocation rather than garbage collection (discussion), does this mean that the other thread might read memory that still contains data from garbage-collected objects (like what happens in unsafe languages)?
Consider the following concrete example:
byte[] buffer = new byte[2];
Parallel.Invoke(
() => buffer = new byte[4],
() => Console.WriteLine(BitConverter.ToString(buffer)));
The above has a race condition; the output would be either 00-00 or 00-00-00-00. However, is it possible that the second thread reads the new reference to buffer before the array's memory has been initialized to 0, and outputs some other arbitrary string instead?

Let's not bury the lede here: the answer to your question is no, you will never observe the pre-allocated state of memory in the CLR 2.0 memory model.
I'll now address a couple of your non-central points.
It is my understanding that C# is a safe language and doesn't allow one to access unallocated memory, other than through the unsafe keyword.
That is more or less correct. There are some mechanisms by which one can access bogus memory without using unsafe -- via unmanaged code, obviously, or by abusing structure layout. But in general, yes, C# is memory safe.
However, its memory model allows reordering when there is unsynchronized access between threads.
Again, that's more or less correct. A better way to think about it is that C# allows reordering at any point where the reordering would be invisible to a single threaded program, subject to certain constraints. Those constraints include introducing acquire and release semantics in certain cases, and preserving certain side effects at certain critical points.
Chris Brumme (from the CLR team) ...
The late great Chris's articles are gems and give a great deal of insight into the early days of the CLR, but I note that there have been some strengthenings of the memory model since 2003 when that article was written, particularly with respect to the issue you raise.
Chris is right that double-checked locking is super dangerous. There is a correct way to do double-checked locking in C#, and the moment you depart from it even slightly, you are off in the weeds of horrible bugs that only repro on weak memory model hardware.
does this mean that the other thread might read memory that still contains data from garbage-collected objects
I think your question is not specifically about the old weak ECMA memory model that Chris was describing, but rather about what guarantees are actually made today.
It is not possible for re-orderings to expose the previous state of objects. You are guaranteed that when you read a freshly-allocated object, its fields are all zeros.
This is made possible by the fact that all writes have release semantics in the current memory model; see this for details:
http://joeduffyblog.com/2007/11/10/clr-20-memory-model/
The write that initializes the memory to zero will not be moved forwards in time with respect to a read later.
I've always been confused by "partially constructed objects"
Joe discusses that here: http://joeduffyblog.com/2010/06/27/on-partiallyconstructed-objects/
Here the concern is not that we might see the pre-allocation state of an object. Rather, the concern here is that one thread might see an object while the constructor is still running on another thread.
Indeed, it is possible for the constructor and the finalizer to be running concurrently, which is super weird! Finalizers are hard to write correctly for this reason.
Put another way: the CLR guarantees you that its own invariants will be preserved. An invariant of the CLR is that newly allocated memory is observed to be zeroed out, so that invariant will be preserved.
But the CLR is not in the business of preserving your invariants! If you have a constructor which guarantees that field x is true if and only if y is non-null, then you are responsible for ensuring that this invariant is always observed to be true. If in some way this is observed by two threads, then one of those threads might observe the invariant being violated.

Related

Can a read instruction after an unrelated lock statement be moved before the lock?

This question is a follow-up to comments in this thread.
Let's assume we have the following code:
// (1)
lock (padlock)
{
// (2)
}
var value = nonVolatileField; // (3)
Furthermore, let's assume that no instruction in (2) has any effect on the nonVolatileField and vice versa.
Can the reading instruction (3) be reordered in such a way that in ends up before the lock statement (1) or inside it (2)?
As far as I can tell, nothing in the C# Specification (§3.10) and the CLI Specification (§I.12.6.5) prohibits such reordering.
Please note that this is not the same question as this one. Here I am asking specifically about read instructions, because as far as I understand, they are not considered side-effects and have weaker guarantees.

I believe this is partially guaranteed by the CLI spec, although it's not as clear as it might be. From I.12.6.5:
Acquiring a lock (System.Threading.Monitor.Enter or entering a synchronized method) shall implicitly perform a volatile read operation, and releasing a lock
(System.Threading.Monitor.Exit or leaving a synchronized method) shall implicitly perform a volatile write operation. See §I.12.6.7.
Then from I.12.6.7:
A volatile read has “acquire semantics” meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has “release semantics” meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence.
So entering the lock should prevent (3) from moving to (1). Reading from nonVolatileField still counts as a "reference to memory", I believe. However, the read could still be performed before the volatile write when the lock exits, so it could still be moved to (2).
The C#/CLI memory model leaves a lot to be desired at the moment. I'm hoping that the whole thing can be clarified significantly (and probably tightened up, to make some "theoretically valid but practically awful" optimizations invalid).

As far as .NET is concerned, entering a monitor (the lock statement) has acquire semantics, as it implicitly performs a volatile read, and exiting a monitor (the end of the lock block) has release semantics, as it implicitly performs a volatile write (see §12.6.5 Locks and Threads in Common Language Infrastructure (CLI) Partition I).
volatile bool areWeThereYet = false;
// In thread 1
// Accesses, usually writes: create objects, initialize them
areWeThereYet = true;
// In thread 2
if (areWeThereYet)
{
// Accesses, usually reads: use created and initialized objects
}
When you write a value to areWeThereYet, all accesses before it were performed and not reordered to after the volatile write.
When you read from areWeThereYet, subsequent accesses are not reordered to before the volatile read.
In this case, when thread 2 observes that areWeThereYet has changed, it has a guarantee that the following accesses, usually reads, will observe the other thread's accesses, usually writes. Assuming there is no other code messing with the affected variables.
As for other synchronization primitives in .NET, such as SemaphoreSlim, although not explicitly documented, it would be rather useless if they didn't have similar semantics. Programs based on them could, in fact, not even work correctly in platforms or hardware architectures with a weaker memory model.
Many people share the thought that Microsoft ought to enforce a strong memory model on such architectures, similar to x86/amd64 as to keep the current code base (Microsoft's own and those of their clients) compatible.
I cannot verify myself, as I don't have an ARM device with Microsoft Windows, much less with .NET Framework for ARM, but at least one MSDN magazine article by Andrew Pardoe, CLR - .NET Development for ARM Processors, states:
The CLR is allowed to expose a stronger memory model than the ECMA CLI specification requires. On x86, for example, the memory model of the CLR is strong because the processor’s memory model is strong. The .NET team could’ve made the memory model on ARM as strong as the model on x86, but ensuring the perfect ordering whenever possible can have a notable impact on code execution performance. We’ve done targeted work to strengthen the memory model on ARM—specifically, we’ve inserted memory barriers at key points when writing to the managed heap to guarantee type safety—but we’ve made sure to only do this with a minimal impact on performance. The team went through multiple design reviews with experts to make sure that the techniques applied in the ARM CLR were correct. Moreover, performance benchmarks show that .NET code execution performance scales the same as native C++ code when compared across x86, x64 and ARM.

How Instances of immutable types are inherently thread-safe

I search about Why .NET String is immutable? And got this answer:
Instances of immutable types are inherently thread-safe, since no
thread can modify it, the risk of a thread modifying it in a way that
interfers with another is removed (the reference itself is a different
matter).
So I want to know How Instances of immutable types are inherently thread-safe?

Why Instances of immutable types are inherently thread-safe?
Because an instance of a string type can't be mutated across multiple threads. This effectively means that one thread changing the string won't result in that same string being changed in another thread, since a new string is allocated in the place the mutation is taking place.
Generally, everything becomes easier when you create an object once, and then only observe it. Once you need to modify it, a new local copy gets created.
Wikipedia:
Immutable objects can be useful in multi-threaded applications.
Multiple threads can act on data represented by immutable objects
without concern of the data being changed by other threads. Immutable
objects are therefore considered to be more thread-safe than mutable
objects.
#xanatos (and wikipedia) point out that immutable isn't always thread-safe. We like to make that correlation because we say "any type which has persistent non-changing state is safe across thread boundaries", but may not be always the case. Assume a type is immutable from the "outside", but internally will need to modify it's state in a way which may not be safe when done in parallel from multiple threads, and may cause undetermined behavior. This means that although immutable, it is not thread safe.
To conclude, immutable != thread-safe. But immutability does take you one step closer, when done right, towards being able to do multi-threaded work correctly.

The short answer:
Because you only write the data in 1 thread and always read it after writing in multiple threads. Because there is no read/write conflict possible, it's thread safe.
The long answer:
A string is essentially a pointer to a buffer of memory. Basically what happens is that you create a buffer, fill it with characters and then expose the pointer to the outside world.
Note that you cannot access the contents of the string before the string object itself is constructed, which enforces this ordering of 'write data', then 'expose pointer'. If you would do it the other way around (I guess that's theoretically possible), problems might arrise.
If another thread (let's say: CPU) reads the pointer, it is a 'new pointer' for the CPU, which therefore requires the CPU to go to the 'real' memory and then read the data. If it would take the pointer contents from cache, we would have had a problem.
The last piece of the puzzle has to do with memory management: we have to know it's a 'new' pointer. In .NET we know this is the case: memory on the heap is basically never re-used until a GC occurs. The garbage collector then does a mark, sweep and compact.
Now, you might argue that the 'compact' phase reuses pointers, therefore changing the contents of the pointers. While this is true, the GC also has to stop the threads and force a full memory fence, which in simple terms, flushes the CPU cache. After that, all memory access is guaranteed, which ensures you always have to go to memory after the GC phase completes.
As you can see there is no way to read the data by not reading it directly from memory (the way it was written). Since it's immutable, the contents remain the same for all threads until it's eventually collected. As such, it's thread safe.
I've seen some discussion about immutable here, that suggests you can change an internal state. Of course, the moment you start changing things, you can potentially introduce read/write conflicts.
The definition of that I'm using here is to keep the contents constant after creation. That is: write once, read many, don't change (any) state after exposing the pointer. You get the picture.

One of the biggest problem in multi-threading code is two threads accessing the same memory cell at the same time with at least one of them modifying this memory cell.
If none of the threads can modify a memory cell, the problem does not exist any longer.
Because an immutable variable is not modifyable, it can be used from several threads without any further measures (for example locks).

Why can't I cut off the volatile from spinlock implementation?

According to this article, http://msdn.microsoft.com/en-us/magazine/cc163715.aspx,
this is the implementation of spinlock class:
class SpinLock
{
volatile int isEntered;
// non-zero if the lock is entered
public void Enter()
{
while (Interlocked.CompareExchange(ref isEntered, 1, 0) != 0)
{
Thread.Sleep(0); // force a thread context switch
}
}
public void Exit()
{
isEntered = 0;
}
}
I know what volatile means and does but I cant understand why its here.
Last thing I wanna ask in another topic- Does reading a object's property count as atomic operation? from my understanding, there are 2 reads here: first, the object reference and second the property reading.

After (re)studying memory models for a couple of days, I had the same vexing question about the use of volatile in SpinLock. I cannot find any problematic reordering of memory access that requires either of the fences entailed by the volatile access.
Acquire fence: Provided by Interlocked.CompareExchange().
Release fence: To get lock semantics, we want all side effects of the code "under the lock" to be visible when the lock is released. But the .NET memory model does not permit reordering of writes. Therefore, the writes from under the lock must become visible no later than the clearing of isEntered.
Update
After some more study based on the Core CLR discussions, I figured out where the disconnect is. First, the two popular sources that state writes also include a release fence or just that they cannot be reordered:
http://joeduffyblog.com/2007/11/10/clr-20-memory-model/
Rule 2: All stores have release semantics, i.e. no load or store may move after one.
https://learn.microsoft.com/en-us/archive/msdn-magazine/2005/october/understanding-low-lock-techniques-in-multithreaded-apps, in the "Strong Model 2: .NET Framework 2.0" section:
Writes cannot move past other writes from the same thread.
Turns out that the ".NET Framework 2 model" is not inherited by all later versions and ports of the framework. The most enlightening article ended up being this one: C# - The C# Memory Model in Theory and Practice, Part 2. The author explains the more relaxed CLR implementations for IA64 and ARM. Essentially, normal writes can be reordered, and thus memory fences are needed in some situations. (How they've decided to insert a release fence before a normal write to a reference type field on the heap strikes me as a desperate hack for maintaining compatibility with code written for x86, but it's well justified in the historical context and the context of the popular lock-free object publication pattern.)
Conclusion: volatile is needed for the SpinLock field in a generic ".NET" implementation to ensure writes under the lock occur before the lock is released--without taking dependence on the stronger ".NET Framework 2 model" or the CPU architecture.

After reading a few sources, I had the impression that volatile only affects reordering done by compiler/runtime during optimization. The C# 5.0 Reference indicates it's more than this:
These optimizations can be performed by the compiler, by the run-time system, or by hardware
And then I came across this little article
http://www.codeproject.com/Articles/31283/Volatile-fields-in-NET-A-look-inside

What is C#'s version of the GIL?

In the current implementation of CPython, there is an object known as the "GIL" or "Global Interpreter Lock". It is essentially a mutex that prevents two Python threads from executing Python code at the same time. This prevents two threads from being able to corrupt the state of the Python interpreter, but also prevents multiple threads from really executing together. Essentially, if I do this:
# Thread A
some_list.append(3)
# Thread B
some_list.append(4)
I can't corrupt the list, because at any given time, only one of those threads are executing, since they must hold the GIL to do so. Now, the items in the list might be added in some indeterminate order, but the point is that the list isn't corrupted, and two things will always get added.
So, now to C#. C# essentially faces the same problem as Python, so, how does C# prevent this? I'd also be interested in hearing Java's story, if anyone knows it.
Clarification: I'm interested in what happens without explicit locking statements, especially to the VM. I am aware that locking primitives exist for both Java & C# - they exist in Python as well: The GIL is not used for multi-threaded code, other than to keep the interpreter sane. I am interested in the direct equivalent of the above, so, in C#, if I can remember enough... :-)
List<String> s;
// Reference to s is shared by two threads, which both execute this:
s.Add("hello");
// State of s?
// State of the VM? (And if sane, how so?)
Here's another example:
class A
{
public String s;
}
// Thread A & B
some_A.s = some_other_value;
// some_A's state must change: how does it change?
// Is the VM still in good shape afterwards?
I'm not looking to write bad C# code, I understand the lock statements. Even in Python, the GIL doesn't give you magic-multi-threaded code: you must still lock shared resources. But the GIL prevents Python's "VM" from being corrupted - it is this behavior that I'm interested in.

Most other languages that support threading don't have an equivalent of the Python GIL; they require you to use mutexes, either implicitly or explicitly.

Using lock, you would do this:
lock(some_list)
{
some_list.Add(3);
}
and in thread 2:
lock(some_list)
{
some_list.Add(4);
}
The lock statement ensures that the object inside the lock statement, some_list in this case, can only be accessed by a single thread at a time. See http://msdn.microsoft.com/en-us/library/c5kehkcz(VS.80).aspx for more information.

C# does not have an equivalent of GIL to Python.
Though they face the same issue, their design goals make them
different.
With GIL, CPython ensures that suche operations as appending a list
from two threads is simple. Which also
means that it would allow only one
thread to run at any time. This
makes lists and dictionaries thread safe. Though this makes the job
simpler and intuitive, it makes it
harder to exploit the multithreading
advantage on multicores.
With no GIL, C# does the opposite. It ensures that the burden of integrity is on the developer of the
program but allows you to take
advantage of running multiple threads
simultaneously.
As per one of the discussion -
The GIL in CPython is purely a design choice of having
a big lock vs a lock per object
and synchronisation to make sure that objects are kept in a coherent state.
This consist of a trade off - Giving up the full power of
multithreading.
It has been that most problems do not suffer from this disadvantage
and there are libraries which help you exclusively solve this issue when
required.
That means for a certain class of problems, the burden to utilize the
multicore is
passed to developer so that rest can enjoy the more simpler, intuitive
approach.
Note: Other implementation like IronPython do not have GIL.

It may be instructive to look at the documentation for the Java equivalent of the class you're discussing:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method. This is best done at creation time, to prevent accidental unsynchronized access to the list:
List list = Collections.synchronizedList(new ArrayList(...));
The iterators returned by this class's iterator and listIterator methods are fail-fast: if the list is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.

Most complex datastructures(for example lists) can be corrupted when used without locking in multiple threads.
Since changes of references are atomic, a reference always stays a valid reference.
But there is a problem when interacting with security critical code. So any datastructures used by critical code most be one of the following:
Inaccessible from untrusted code, and locked/used correctly by trusted code
Immutable (String class)
Copied before use (valuetype parameters)
Written in trusted code and uses internal locking to guarantee a safe state
For example critical code cannot trust a list accessible from untrusted code. If it gets passed in a List, it has to create a private copy, do it's precondition checks on the copy, and then operate on the copy.

I'm going to take a wild guess at what the question really means...
In Python data structures in the interpreter get corrupted because Python is using a form of reference counting.
Both C# and Java use garbage collection and in fact they do use a global lock when doing a full heap collection.
Data can be marked and moved between "generations" without a lock. But to actually clean it up everything must come to a stop. Hopefully a very short stop, but a full stop.
Here is an interesting link on CLR garbage collection as of 2007:
http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry

Interlocked and Memory Barriers

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)
void Foo() // executed by thread #1, BEFORE Bar() is executed
{
Interlocked.Exchange(ref m_value, 1);
}
bool Bar() // executed by thread #2, AFTER Foo() is executed
{
return m_value == 1;
}
Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?
Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)
Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.
What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.
Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.
Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:
while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste,
// pause() instructions, and last-resort-give-up-and-blocking on a resource
// until the lock is made available.)
// Access to shared memory.
pShared->foo = 1
v = pShared-> goo
pShared->lock.clear_Release()
Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.
The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.
You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).
Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:
pShared->goo = 14
pShared->atomic.setBit_Release()
Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).
In the consumer
if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
v = pShared->goo
}
Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.
Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).
Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.
This function generates a full memory
barrier (or fence) to ensure that
memory operations are completed in
order.

The interlocked exchange operations guarantee a memory barrier.
The following synchronization functions use the appropriate barriers
to ensure memory ordering:
Functions that enter or leave critical sections
Functions that signal synchronization objects
Wait functions
Interlocked functions
(Source : link)
But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.
I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.
The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.
Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.
EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.