Does a thread's context refer to a thread's personal memory? If so, how is memory shared between multiple threads?
I'm not looking for code examples- I understand synchronization on a high level, I'm just confused about this term, and looking to gain some insight on what's actually happening behind scenes.
The reason I thought/think each thread has some kind of private memory was because of the volatile keyword in Java and .NET, and how different threads can have different values for the same primitive if its not used. That always implied private memory to me.
As I didn't realize the term was more general, I guess I'm asking how context-switching works in Java and C# specifically.
The reason I thought/think each thread has some kind of private memory was because of the volatile keyword in Java and .NET, and how different threads can have different values for the same primitive if its not used. That always implied private memory to me.
OK, now we're getting to the source of your confusion. This is one of the most confusing parts about modern programming. You have to wrap your head around this contradiction:
All threads in a process share the same virtual memory address space, but
Any two threads can disagree at any time on the contents of that space
How can that be? Because
processors make local copies of memory pages for performance reasons, and only infrequently compare notes to make sure that all their copies say the same thing. If two threads are on two different processors then they can have completely inconsistent views of "the same" memory.
memory in single-threaded scenarios is typically thought of as "still" unless something causes it to change. This intuition serves you poorly in multithreaded processes. If there are multiple threads accessing memory you are best to treat all memory as constantly in a state of flux unless something is forcing it to remain still. Once you start thinking of all memory as changing all the time it becomes clear that two threads can have an inconsistent view. No two movies of the ocean during a storm are alike, even if its the same storm.
compilers are free to make any optimization to code that would be invisible on a single threaded system. On a multi-threaded system, those optimizations can suddenly become visible, which can lead to inconsistent views of data.
If any of that is not clear, then start by reading my article explaining what "volatile" means in C#:
http://blogs.msdn.com/b/ericlippert/archive/2011/06/16/atomicity-volatility-and-immutability-are-different-part-three.aspx
And then read the section "The Need For Memory Models" in Vance's article here:
http://msdn.microsoft.com/en-us/magazine/cc163715.aspx
Now, as for the specific question as to whether a thread has its own block of memory, the answer is yes, in two ways. First, since a thread is a point of control, and since the stack is the reification of control flow, every thread has its own million-byte stack. That's why threads are so expensive. In .NET, those million bytes are actually committed to the page file every time you create a thread, so be careful about creating unnecessary threads.
Second, threads have the aptly named "thread local storage", which is a small section of memory associated with each thread that the thread can use to store interesting information. In C# you use the ThreadStatic attribute to mark a field as being local to a thread.
The actual make up of a "thread context" is implementation specific, but generally I have always understood a thread's context to refer to the current state of the thread and how it views memory at a specific time. This is what "context switching" is.. saving and restoring the state of a thread (it's context).
Memory is shared between the contexts.. they are part of the same process.
I don't consider myself a huge expert on the topic.. but this is what I have always understood that specific term to mean.
Related
I have a singleton object that process requests. Each request takes around one millisecond to be completed, usually less. This object is not thread-safe and it expects requests in a particular format, encapsulated in the Request class, and returns the result as Response. This processor has another producer/consumer that sends/receives through a socket.
I implemented the producer/consumer approach to work fast:
Client prepares a RequestCommand command object, that contains a TaskCompletionSource<Response> and the intended Request.
Client add the command to the "request queue" (Queue<>) and awaits command.Completion.Task.
A different thread (and actual background Thread) pulls the command from the "request queue", process the command.Request, generates Response and signals the command as done using command.Completion.SetResult(response).
Client continues working.
But when doing a small memory benchmark I see LOTS of these objects being created and topping the list of most common object in memory. Note that there is no memory leak, the GC can clean everything up nicely each time triggers, but obviously so many objects being created fast, makes Gen 0 very big. I wonder if a better memory usage may yield better performance.
I was considering convert some of these objects to structs to avoid allocations, specially now that there are some new features to work with them C# 7.1. But I do not see a way of doing it.
Value types can be instantiated in the stack, but if they pass from thread to thread, they must be copied to the stackA->heap and heap->stackB I guess. Also when enqueuing in the queue, it goes from stack to heap.
The singleton object is truly asynchronous. There is some in-memory processing, but 90% of the time it needs to call outside and going through the internal producer/consumer.
ValueTask<> does not seem to fit here, because things are asynchronous.
TaskCompletionSource<> has a state, but it is object, so it would be boxed.
The command also jumps from thread to thread.
Reciclying objects only works for the command itself, its content cannot be recycled (TaskCompletionSource<> and a string).
Is there any way I could leverage structs to reduce the memory usage or/and improve the performance? Any other option?
Value types can be instantiated in the stack, but if they pass from thread to thread, they must be copied to the stackA->heap and heap->stackB I guess.
No, that's not at all true. But you have a deeper problem in your thinking here:
Immediately stop thinking of structs as living on the stack. When you make an int array with a million ints, you think those four million bytes of ints live on your one-million-byte stack? Of course not.
The truth is that stack vs heap has nothing whatsoever to do with value types. Instead of "stack and heap", start saying "short term allocation pool" and "long term allocation pool". Variables that have short lifetimes are allocated from the short term allocation pool, regardless of whether that variable contains an int or a reference to an object. Once you start thinking about variable lifetime correctly then your reasoning becomes entirely straightforward. Short-lived things live in the short term pool, obviously.
So: when you pass a struct from one thread to another, does it ever live "on the heap"? The question is nonsensical because values are not things that live on the heap. Variables are things that are storage; variables store value.
So: Is it the case that turning classes into structs will improve performance because "those structs can live on the stack"? No, of course not. The relevant difference between reference types and value types is not where they live but how they are copied. Value types are copied by value, reference types are copied by reference, and reference copies are the fastest copies.
I see LOTS of these objects being created and topping the list of most common object in memory. Note that there is no memory leak, the GC can clean everything up nicely each time triggers, but obviously so many objects being created fast, makes Gen 0 very big. I wonder if a better memory usage may yield better performance.
OK, now we come to the sensible part of your question. This is an excellent observation and it is one which is testable with science. The first thing you should do is to use a profiler to determine what is the actual burden of gen 0 collections on the performance of your application.
It may be that this burden is not the slowest thing in your program and in fact it is irrelevant. In that case, you will now know to concentrate your efforts on the real problem, rather than chasing down memory allocation problems that aren't real problems.
Suppose you discover that gen 0 collections really are killing your performance; what can you do? Is the answer to make more things structs? That can work, but you have to be very careful:
If the structs themselves contain references, you've just pushed the problem off one level, you haven't solved it.
If the structs are larger than reference size -- and of course they almost always are -- then now you are copying them by copying the entire struct rather than copying a reference, and you've traded a GC time problem for a copy time problem. That might be a win, or a loss; use science to find out which it is.
When we were faced with this problem in Roslyn, we thought about it very carefully and did a lot of experiments. The strategy we went with was in general not to move things onto the stack. Rather, we identified how many small, short-lived objects there were active in memory at any one time, of each type -- using a profiler -- and then implemented a pooling strategy on those objects. You need a small object, you take it out of the pool. When you're done, you put it back in the pool. What happens is, you end up with O(number of objects active at any one time) in the pool, which quickly gets moved into the gen 2 heap; you then greatly lower your collection pressure on the gen 0 heap while increasing the cost of comparatively rare gen 2 collections.
I'm not saying that's the best choice for you. I'm saying that we had this same problem in Roslyn, and we solved it with science. You can do the same.
I search about Why .NET String is immutable? And got this answer:
Instances of immutable types are inherently thread-safe, since no
thread can modify it, the risk of a thread modifying it in a way that
interfers with another is removed (the reference itself is a different
matter).
So I want to know How Instances of immutable types are inherently thread-safe?
Why Instances of immutable types are inherently thread-safe?
Because an instance of a string type can't be mutated across multiple threads. This effectively means that one thread changing the string won't result in that same string being changed in another thread, since a new string is allocated in the place the mutation is taking place.
Generally, everything becomes easier when you create an object once, and then only observe it. Once you need to modify it, a new local copy gets created.
Wikipedia:
Immutable objects can be useful in multi-threaded applications.
Multiple threads can act on data represented by immutable objects
without concern of the data being changed by other threads. Immutable
objects are therefore considered to be more thread-safe than mutable
objects.
#xanatos (and wikipedia) point out that immutable isn't always thread-safe. We like to make that correlation because we say "any type which has persistent non-changing state is safe across thread boundaries", but may not be always the case. Assume a type is immutable from the "outside", but internally will need to modify it's state in a way which may not be safe when done in parallel from multiple threads, and may cause undetermined behavior. This means that although immutable, it is not thread safe.
To conclude, immutable != thread-safe. But immutability does take you one step closer, when done right, towards being able to do multi-threaded work correctly.
The short answer:
Because you only write the data in 1 thread and always read it after writing in multiple threads. Because there is no read/write conflict possible, it's thread safe.
The long answer:
A string is essentially a pointer to a buffer of memory. Basically what happens is that you create a buffer, fill it with characters and then expose the pointer to the outside world.
Note that you cannot access the contents of the string before the string object itself is constructed, which enforces this ordering of 'write data', then 'expose pointer'. If you would do it the other way around (I guess that's theoretically possible), problems might arrise.
If another thread (let's say: CPU) reads the pointer, it is a 'new pointer' for the CPU, which therefore requires the CPU to go to the 'real' memory and then read the data. If it would take the pointer contents from cache, we would have had a problem.
The last piece of the puzzle has to do with memory management: we have to know it's a 'new' pointer. In .NET we know this is the case: memory on the heap is basically never re-used until a GC occurs. The garbage collector then does a mark, sweep and compact.
Now, you might argue that the 'compact' phase reuses pointers, therefore changing the contents of the pointers. While this is true, the GC also has to stop the threads and force a full memory fence, which in simple terms, flushes the CPU cache. After that, all memory access is guaranteed, which ensures you always have to go to memory after the GC phase completes.
As you can see there is no way to read the data by not reading it directly from memory (the way it was written). Since it's immutable, the contents remain the same for all threads until it's eventually collected. As such, it's thread safe.
I've seen some discussion about immutable here, that suggests you can change an internal state. Of course, the moment you start changing things, you can potentially introduce read/write conflicts.
The definition of that I'm using here is to keep the contents constant after creation. That is: write once, read many, don't change (any) state after exposing the pointer. You get the picture.
One of the biggest problem in multi-threading code is two threads accessing the same memory cell at the same time with at least one of them modifying this memory cell.
If none of the threads can modify a memory cell, the problem does not exist any longer.
Because an immutable variable is not modifyable, it can be used from several threads without any further measures (for example locks).
I don't know if the question is stupid or not, locking and the Monitor is kind a black box to me.
But I'm dealing with a situation where I can either use the same lock object to lock everything all the time or use a indefinite number of object to lock at a more fine grain level.
I know that the second way will reduce the lock contention, but I may end up using 10K objects as locks and I don't know if it has an impact or not.
Bottom line: does too many locks hurt locking or it has no impact?
Edit
I wrote a lib that maintain a graph of objects, the number could be very high. For now it's not thread safe, mainly for the reason Eric stated in his comment.
I initially thought that if the user wanted to do some multi-threading then he/she would have to take care of the locking.
But now I'm wondering that if I would have to make it thread-safe, what would be the best way to do it (note that making it thread-safe wouldn't be a short and easy ride for me so testing both solutions is something I can't do easily)?
As the purpose is to make each object of the graph thread-safe, then I could use the instance of the object for the lock when I want to access/modify its properties. I know it's the best way to reduce contention, but I don't know if it would scale as much as having only one lock for the whole graph.
I know there's a lot to consider, how many threads and especially (I think) the chance of an object being accessed/changed by multiple threads at a time (which I estimate to be pretty low). But I can't find accurate information about locks and their overhead in such case.
To get a more clearer view of what's going on I looked at the source code of the Monitor class and its C++ counterpart in clr/src/vm/syncblk.cpp in the Shared Source Common Language Infrastructure released by Microsoft.
To answer my own question: no, having a lot of locks doesn't hurt in any harmful way I could think of.
What I learned:
1) A lock that's is already taken by the same thread is processed "almost free".
2) A lock that's taken for the first time is basically the cost of an InterlockedCompareExchange.
3) Multiple threads waiting for a lock is fairly cheap to track (a link list is maintained, O(1) complexity).
4) A thread waiting for a lock to release is by far the most costly use case, the implem first spinwaits to try to get out, but if it's not enough a thread switch will occurs, putting the thread to sleep until a mutex signals it's time to wake up because of the lock release.
I got my answer by digging for the 2): if you're always locking with the same object or 10K different one, it's basically the same (extra initialization is performed the first time you lock a given object, but it's not too bad). The InterlockedCompareExchange doesn't care about being called on the same or different memory location (AFAIK).
Contention is by far the most critical concern. Having many locks would reduce (drastically in my case) the chance of contention, so it can only be a good thing.
1) is also an important learned lesson: if I lock/unlock for each property change/access I can improve performances by locking the object first, then changing many properties and release the lock. This way there will be only one InterlockedCompareExchange and the lock/unlock inside the implementation of the property change/access will only increment an internal counter.
To dig deeper I would have to find more information about the implementation of the InterlockedCompareExchange, I think it relies on the CPU specific assembly instruction...
Typically, performance concerns around locking are related to contention. Acquiring an uncontested lock is on the order of 10s of nanoseconds. Contention is the real performance killer. As you point out, having more locks (higher lock granularity) can improve performance by decreasing contention.
The drawback to having multiple locks is typically lock management must be more complex. If multiple locks are required to perform an operation there is the increased possibility of resource starvation issues like deadlock or livelock. Proper lock management, such as enforcing lock acquisition order, can alleviate these issues.
Absent more details, I would probably go with one lock, since implementation is simpler and monitor performance of my application closely. Specifically there are .NET performance counters related to lock contention which can help diagnose/detect lock contention related perf issues.
As with all performance related answers I'd like to refer to this excepional blog post by Eric Lippert, it depends. Have a look at his six questions, what are the answers in your case? Try what happens during your conditions.
Number of cores, contention, caching etc, all matters, so see what happens for you in your case, it's really impossible to know beforehand.
For those not clicking on the link; run them horses!
I'm not talking about performance as in speed here, but rather as in what happens when the application has been running for a while. According to Lock (Monitor) internal implementation in .NET the Monitor implementation is quite smart in .NET, so the having internal locks for each object might seem a viable approach, since you said objects in the tens of thousands and not millions.
Bottom line: does too many locks hurt locking or it has no impact?
Not on it's own, but it might be a reason to have a look at the architecture of your program, having a gazillion objects locked at the same time will cause overhead though.
There are a lot of articles and discussions explaining why it is good to build thread-safe classes. It is said that if multiple threads access e.g. a field at the same time, there can only be some bad consequences. So, what is the point of keeping non thread-safe code? I'm focusing mostly on .NET, but I believe the main reasons are not language-dependent.
E.g. .NET static fields are not thread-safe. What would be the result if they were thread-safe by default? (without a need to perform "manual" locking). What are the benefits of using (actually defaulting to) non-thread-safety?
One thing that comes to my mind is performance (more of a guess, though). It's rather intuitive that, when a function or field doesn't need to be thread-safe, it shouldn't be. However, the question is: what for? Is thread-safety just an additional amount of code you always need to implement? In what scenarios can I be 100% sure that e.g. a field won't be used by two threads at once?
Writing thread-safe code:
Requires more skilled developers
Is harder and consumes more coding efforts
Is harder to test and debug
Usually has bigger performance cost
But! Thread-safe code is not always needed. If you can be sure that some piece of code will be accessed by only one thread the list above becomes huge and unnecessary overhead. It is like renting a van when going to neighbor city when there are two of you and not much luggage.
Thread safety comes with costs - you need to lock fields that might cause problems if accessed simultaneously.
In applications that have no use of threads, but need high performance when every cpu cycle counts, there is no reason to have safe-thread classes.
So, what is the point of keeping non thread-safe code?
Cost. Like you assumed, there usually is a penalty in performance.
Also, writing thread-safe code is more difficult and time consuming.
Thread safety is not a "yes" or "no" proposition. The meaning of "thread safety" depends upon context; does it mean "concurrent-read safe, concurrent write unsafe"? Does it mean that the application just might return stale data instead of crashing? There are many things that it can mean.
The main reason not to make a class "thread safe" is the cost. If the type won't be accessed by multiple threads, there's no advantage to putting in the work and increase the maintenance cost.
Writing threadsafe code is painfully difficult at times. For example, simple lazy loading requires two checks for '== null' and a lock. It's really easy to screw up.
[EDIT]
I didn't mean to suggest that threaded lazy loading was particularly difficult, it's the "Oh and I didn't remember to lock that first!" moments that come fast and hard once you think you're done with the locking that are really the challenge.
There are situations where "thread-safe" doesn't make sense. This consideration is in addition to the higher developer skill and increased time (development, testing, and runtime all take hits).
For example, List<T> is a commonly-used non-thread-safe class. If we were to create a thread-safe equivalent, how would we implement GetEnumerator? Hint: there is no good solution.
Turn this question on its head.
In the early days of programming there was no Thread-Safe code because there was no concept of threads. A program started, then proceeded step by step to the end. Events? What's that? Threads? Huh?
As hardware became more powerful, concepts of what types of problems could be solved with software became more imaginative and developers more ambitious, the software infrastructure became more sophisticated. It also became much more top-heavy. And here we are today, with a sophisticated, powerful, and in some cases unnecessarily top-heavy software ecosystem which includes threads and "thread-safety".
I realize the question is aimed more at application developers than, say, firmware developers, but looking at the whole forest does offer insights into how that one tree evolved.
So, what is the point of keeping non thread-safe code?
By allowing for code that isn't thread safe you're leaving it up to the programmer to decide what the correct level of isolation is.
As others have mentioned this allows for complexity reduction and improved performance.
Rico Mariani wrote two articles entitled "Putting your synchronization at the correct level" and
Putting your synchronization at the correct level -- solution that have a nice example of this in action.
In the article he has a method called DoWork(). In it he calls other classes Read twice Write twice and then LogToSteam.
Read, Write, and LogToSteam all shared a lock and were thread safe. This is good except for the fact that because DoWork was also thread safe all the synchronizing work in each Read, Write and LogToSteam was a complete waste of time.
This is all related to the nature Imperative Programming. Its side effects cause the need for this.
However if you had an development platform where applications could be expressed as pure functions where there were no dependencies or side effects then it would be possible to create applications where the threading was managed without developer intervention.
So, what is the point of keeping non thread-safe code?
The rule of thumb is to avoid locking as much as possible. The Ideal code is re-entrant and thread safe with out any locking. But that would be utopia.
Coming back to reality, a good programmer tries his level best to have a sectional locking as opposed to locking the entire context. An example would be to lock few lines of code at a time in various routines than locking everything in a function.
So Also, one has to refactor the code to come up with a design that would minimize the locking if not get rid of it in entirity.
e.g. consider a foobar() function that gets new data on each call and uses switch() case on a type of data to changes a node in a tree. The locking can be mostly avoided (if not completely) As each case statement would touch a different node in a tree. This may be a more specific example but i think it elaborates my point.
I'm confused. Answers to my previous question seems to confirm my assumptions. But as stated here volatile is not enough to assure atomicity in .Net. Either operations like incrementation and assignment in MSIL are not translated directly to single, native OPCODE or many CPUs can simultaneously read and write to the same RAM location.
To clarify:
I want to know if writes and reads are atomic on multiple CPUs?
I understand what volatile is about. But is it enough? Do I need to use interlocked operations if I want to get latest value writen by other CPU?
Herb Sutter recently wrote an article on volatile and what it really means (how it affects ordering of memory access and atomicity) in the native C++. .NET, and Java environments. It's a pretty good read:
volatile vs. volatile
volatile in .NET does make access to the variable atomic.
The problem is, that's often not enough. What if you need to read the variable, and if it is 0 (indicating that the resource is free), you set it to 1 (indicating that it's locked, and other threads should stay away from it).
Reading the 0 is atomic. Writing the 1 is atomic. But between those two operations, anything might happen. You might read a 0, and then before you can write the 1, another thread jumps in, reads the 0, and writes an 1.
However, volatile in .NET does guarantee atomicity of accesses to the variable. It just doesn't guarantee thread safety for operations relying on multiple accesses to it. (Disclaimer: volatile in C/C++ does not even guarantee this. Just so you know. It is much weaker, and occasinoally a source of bugs because people assume it guarantees atomicity :))
So you need to use locks as well, to group together multiple operations as one thread-safe chunk. (Or, for simple operations, the Interlocked operations in .NET may do the trick)
I might be jumping the gun here but it sounds to me as though you're confusing two issues here.
One is atomicity, which in my mind means that a single operation (that may require multiple steps) should not come in conflict with another such single operation.
The other is volatility, when is this value expected to change, and why.
Take the first. If your two-step operation requires you to read the current value, modify it, and write it back, you're most certainly going to want a lock, unless this whole operation can be translated into a single CPU instruction that can work on a single cache-line of data.
However, the second issue is, even when you're doing the locking thing, what will other threads see.
A volatile field in .NET is a field that the compiler knows can change at arbitrary times. In a single-threaded world, the change of a variable is something that happens at some point in a sequential stream of instructions so the compiler knows when it has added code that changes it, or at least when it has called out to outside world that may or may not have changed it so that once the code returns, it might not be the same value it was before the call.
This knowledge allows the compiler to lift the value from the field into a register once, before a loop or similar block of code, and never re-read the value from the field for that particular code.
With multi-threading however, that might give you some problems. One thread might have adjusted the value, and another thread, due to optimization, won't be reading this value for some time, because it knows it hasn't changed.
So when you flag a field as volatile you're basically telling the compiler that it shouldn't assume that it has the current value of this at any point, except for grabbing snapshots every time it needs the value.
Locks solve multiple-step operations, volatility handles how the compiler caches the field value in a register, and together they will solve more problems.
Also note that if a field contains something that cannot be read in a single cpu-instruction, you're most likely going to want to lock read-access to it as well.
For instance, if you're on a 32-bit cpu and writing a 64-bit value, that write-operation will require two steps to complete, and if another thread on another cpu manages to read the 64-bit value before step 2 has completed, it will get half of the previous value and half of the new, nicely mixed together, which can be even worse than getting an outdated one.
Edit: To answer the comment, that volatile guarantees the atomicity of the read/write operation, that's well, true, in a way, because the volatile keyword cannot be applied to fields that are larger than 32-bit, in effect making the field single-cpu-instruction read/writeable on both 32 and 64-bit cpu's. And yes, it will prevent the value from being kept in a register as much as possible.
So part of the comment is wrong, volatile cannot be applied to 64-bit values.
Note also that volatile has some semantics regarding reordering of reads/writes.
For relevant information, see the MSDN documentation or the C# specification, found here, section 10.5.3.
On a hardware level, multiple CPUs can never write simultanously to the same atomic RAM location. The size of an atomic read/write operation dependeds on CPU architecture, but is typically 1, 2 or 4 bytes on a 32-bit architecture. However, if you try reading the result back there is always a chance that another CPU has made a write to the same RAM location inbetween. On a low level, spin-locks are typically used to synchronize access to shared memory. In a high level language, such mechanisms may be called e.g. critical regions.
The volatile type just makes sure the variable is written immediately back to memory when it is changed (even if the value is to be used in the same function). A compiler will usually keep a value in an internal register for as long as possible if the value is to be reused later in the same function, and it is stored back to RAM when all modifications are finished or when a function returns. Volatile types are mostly useful when writing to hardware registers, or when you want to be sure a value is stored back to RAM in e.g. a multithread system.
Your question doesn't entirely make sense, because volatile specifies the how the read happens, not atomicity of multi-step processes. My car doesn't mow my lawn, either, but I try not to hold that against it. :)
The problem comes in with register based cashed copies of your variable's values.
When reading a value, the cpu will first see if it's in a register (fast) before checking main memory (slower).
Volatile tells the compiler to push the value out to main memory asap, and not to trust the cached register value. It's only useful in certain cases.
If you're looking for single op code writes, you'll need to use the Interlocked.Increment related methods.. But they're fairly limited in what they can do in a single safe instruction.
Safest and most reliable bet is to lock() (if you can't do an Interlocked.*)
Edit: Writes and reads are atomic if they're in a lock or an interlocked.* statement. Volatile alone is not enough under the terms of your question
Volatile is a compiler keyword that tells the compiler what to do. It does not necessarily translate into (essentially) bus operations that are required for atomicity. That is usually left up to the operating system.
Edit: to clarify, volatile is never enough if you want to guarantee atomicity. Or rather, it's up to the compiler to make it enough or not.