Multicore programming: the hard parts

Multicore programming: the hard parts - c#

I'm writing a book on multicore programming using .NET 4 and I'm curious to know what parts of multicore programming people have found difficult to grok or anticipate being difficult to grok?

What's a useful unit of work to parallelize, and how do I find/organize one?
All these parallelism primitives aren't helpful if you fork a piece of work that is smaller than the forking overhead; in fact, that buys you a nice slowdown instead of what you are expecting.
So one of the big problems is finding units of work that are obviously more expensive than the parallelism primitives. A key problem here is that nobody knows what anything costs to execute, including the parallelism primitives themselves. Clearly calibrating these costs would be very helpful. (As an aside, we designed, implemented, and daily use a parallel programming langauge, PARLANSE whose objective was to minimize the cost of the parallelism primitives by allowing the compiler to generate and optimize them, with the goal of making smaller bits of work "more parallelizable").
One might also consider discussion big-Oh notation and its applications. We all hope that the parallelism primitives have cost O(1). If that's the case, then if you find work with cost O(x) > O(1) then that work is a good candidate for parallelization. If your proposed work is also O(1), then whether it is effective or not depends on the constant factors and we are back to calibration as above.
There's the problem of collecting work into large enough units, if none of the pieces are large enough. Code motion, algorithm replacement, ... are all useful ideas to achieve this effect.
Lastly, there's the problem of synchnonization: when do my parallel units have to interact, what primitives should I use, and how much do those primitives cost? (More than you expect!).

I guess some of it depends on how basic or advanced the book/audience is. When you go from single-threaded to multi-threaded programming for the first time, you typically fall off a huge cliff (and many never recover, see e.g. all the muddled questions about Control.Invoke).
Anyway, to add some thoughts that are less about the programming itself, and more about the other related tasks in the software process:
Measuring: deciding what metric you are aiming to improve, measuring it correctly (it is so easy to accidentally measure the wrong thing), using the right tools, differentiating signal versus noise, interpreting the results and understanding why they are as they are.
Testing: how to write tests that tolerate unimportant non-determinism/interleavings, but still pin down correct program behavior.
Debugging: tools, strategies, when "hard to debug" implies feedback to improve your code/design and better partition mutable state, etc.
Physical versus logical thread affinity: understanding the GUI thread, understanding how e.g. an F# MailboxProcessor/agent can encapsulate mutable state and run on multiple threads but always with only a single logical thread (one program counter).
Patterns (and when they apply): fork-join, map-reduce, producer-consumer, ...
I expect that there will be a large audience for e.g. "help, I've got a single-threaded app with 12% CPU utilization, and I want to learn just enough to make it go 4x faster without much work" and a smaller audience for e.g. "my app is scaling sub-linearly as we add cores because there seems to be contention here, is there a better approach to use?", and so a bit of the challenge may be serving each of those audiences.

Since you write a whole book for multi-core programming in .Net.
I think you can also go beyond multi-core a little bit.
For example, you can use a chapter talking about parallel computing in a distributed system in .Net. Unlikely, there is no mature frameworks in .Net yet. DryadLinq is the closest. (On the other side, Hadoop and its friends in Java platform are really good.)
You can also use a chapter demonstrating some GPU computing stuff.

One thing that has tripped me up is which approach to use to solve a particular type of problem. There's agents, there's tasks, async computations, MPI for distribution - for many problems you could use multiple methods but I'm having difficulty understanding why I should use one over another.

To understand: low level memory details like the difference between acquire and release semantics of memory.
Most of the rest of the concepts and ideas (anything can interleave, race conditions, ...) are not that difficult with a little usage.
Of course the practice, especially if something is failing sometimes, is very hard as you need to work at multiple levels of abstraction to understand what is going on, so keep your design simple and as far as possible design out the need for locking etc. (e.g. using immutable data and higher level abstractions).

Its not so much theoretical details, but more the practical implementation details which trips people up.
What's the deal with immutable data structures?
All the time, people try to update a data structure from multiple threads, find it too hard, and someone chimes in "use immutable data structures!", and so our persistent coder writes this:
ImmutableSet set;
ThreadLoop1()
foreach(Customer c in dataStore1)
set = set.Add(ProcessCustomer(c));
ThreadLoop2()
foreach(Customer c in dataStore2)
set = set.Add(ProcessCustomer(c));
Coder has heard all their lives that immutable data structures can be updated without locking, but the new code doesn't work for obvious reasons.
Even if your targeting academics and experienced devs, a little primer on the basics of immutable programming idioms can't hurt.
How to partition roughly equal amounts of work between threads?
Getting this step right is hard. Sometimes you break up a single process into 10,000 steps which can be executed in parallel, but not all steps take the same amount of time. If you split the work on 4 threads, and the first 3 threads finish in 1 second, and the last thread takes 60 seconds, your multithreaded program isn't much better than the single-threaded version, right?
So how do you partition problems with roughly equal amounts of work between all threads? Lots of good heuristics on solving bin packing problems should be relevant here..
How many threads?
If your problem is nicely parallelizable, adding more threads should make it faster, right? Well not really, lots of things to consider here:
Even a single core processor, adding more threads can make a program faster because more threads gives more opportunities for the OS to schedule your thread, so it gets more execution time than the single-threaded program. But with the law of diminishing returns, adding more threads increasing context-switching, so at a certain point, even if your program has the most execution time the performance could still be worse than the single-threaded version.
So how do you spin off just enough threads to minimize execution time?
And if there are lots of other apps spinning up threads and competing for resources, how do you detect performance changes and adjust your program automagically?

I find the conceptions of synchronized data moving across worker nodes in complex patterns very hard to visualize and program.
Usually I find debugging to be a bear, also.

Related

Thread Affinity

I have a multithreaded program which consist of a C# interop layer over C++ code.
I am setting threads affinity (like in this post) and it works on part of my code, however on second part it doesn't work. Can Intel Compiler / IPP / MKL libs / inline assembly interfere with external affinity setting?
UPDATE:
I can't post code as it is whole environment with many many dlls. I set environment values: OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 IPP_NUM_THREADS=1. When it runs in single thread, it runs ok, but when I use number of C# threads and set affinity per thread (on a quad core machine), the initialization is going fine on separate cores, but during processing all threads start using the same core. Hope I am clear enough.
Thanks.

We've had this exact problem; we'd set our thread affinity to what we wanted, and the IPP/MKL functions would blow that away! The answer to your question is 'yes'.
Auto Parallelism
The issue is that, by default, the Intel libraries like to automatically execute multi-threaded versions of the routines. So, a single FFT gets computed by a number of threads setup by the library specifically for this purpose.
Intel's intent is that the programmer could get on with the job of writing a single threaded application, and the library would allow that single thread to benefit from a multicore processor by creating a number of threads for the maths work. A noble intent (your source code then need know nothing about the runtime hardware to extract the best achievable performance - handy sometimes), but a right bloody nuisance when one is doing one's own threading for one's own reasons.
Controlling the Library's Behaviour
Take a look at these Intel docs, section Support Functions / Threading Support Functions. You can either programmatically control the library's threading tendancies, or there's some environment variables you can set (like MKL_NUM_THREADS) before your program runs. Setting the number of threads was (as far as I recall) enough to stop the library doing its own thing.
Philosophical Essay Inspired By Answering Your Question (best ignored)
More or less everything Intel is doing in CPU design and software (e.g. IPP/MKL) is aimed at making it unnecessary for the programmer to Worry About Threads. You want good math performance? Use MKL. You want that for loop to go fast? Turn on Auto Parallelisation in ICC. You want to make the best use of cache? That's what Hyperthreading is for.
It's not a bad approach, and personally speaking I think that they've done a pretty good job. AMD too. Their architectures are quite good at delivering good real world performance improvements to the "Average Programmer" for the minimal investment in learning, re-writing and code development.
Irritation
However, the thing that irritates me a little bit (though I don't want to appear ungrateful!) is that whilst this approach works for the majority of programmers out there (which is where the profitable market is), it just throws more obstacles in the way of those programmers who want to spin their own parallelism. I can't blame Intel for that of course, they've done exactly the right thing; they're a market led company, they need to make things that will sell.
By offering these easy features the situation of there being too many under skilled and under trained programmers becomes more entrenched. If all programmers can get good performance without having to learn what auto parallelism is actually doing, then we'll never move on. The pool of really good programmers who actually know that stuff will remain really small.
Problem
I see this as a problem (though only a small one, I'll explain later). Computing needs to become more efficient for both economic and environmental reasons. Intel's approach allows for increased performance, and better silicon manufacturing techniques produces lower power consumption, but I always feel like it's not quite as efficient as it could be.
Example
Take the Cell processor at the heart of the PS3. It's something that I like to witter on about endlessly! However, IBM developed that with a completely different philosophy to Intel. They gave you no cache (just some fast static RAM instead to use as you saw fit), the architecture was pretty much pure NUMA, you had to do all your own parallelisation, etc. etc. The result was that if you really knew what you were doing you could get about 250GFLOPS out of the thing (I think the none-PS3 variants went to 320GLOPS), for 80Watts, all the way back in 2005.
It's taken Intel chips about another 6 or 7 years or so for a single device to get to that level of performance. That's a lot of Moores law growth. If the Cell got manufactured on Intel's latest silicon fab and was given as many transistors as Intel put into their big Xeons, it would still blow everything else away.
No Market
However, apart from PS3, Cell was a none-starter market proposition. IBM decided that it would never be a big enough seller to be worth their while. There just wasn't enough programmers out there who could really use it, and to indulge the few of us who could makes no commercial sense, which wouldn't please the shareholders.
Small Problem, Bigger Problem
I said earlier that this was only a small problem. Well, most of the world's computing isn't about high maths performance, it's become Facebook, Twitter, etc. That sort is all about I/O performance, and for that you don't need high maths performance. So in that sense the dependence on Intel Doing Everything For You so that the average programmer to get good math performance matters very little. There's just not enough maths being done to warrant a change in design philosophy.
In fact, I strongly suspect that the world will ultimately decide that you don't need a large chip at all, an ARM should do just fine. If that does come to pass then the market for Intel's very large chips with very good general purpose maths compute performance will vanish. Effectively those of use who want good maths performance are being heavily subsidised by those who want to fill enourmous data centres with Intel based hardware and put Intel PCs on every desktop.
We're simply lucky that Intel apparently has a desire to make sure that every big CPU they build is good at maths regardless of whether most of their users actually use that maths performance. I'm sure that desire has its foundations in marketing prowess and wanting the bragging rights, but those are not hard, commercially tangible artifacts that bring shareholder value.
So if those data centre guys decide that, actually, they'd rather save electricity and fill their data centres with ARMs, where does that leave Intel? ARMs are fine devices for the purpose for which they're intended, but they're not at the top of my list of Supercomputer chips. So where does that leave us?
Trend
My take on the current market trend is that 'Workstations' (PCs as we call them now) are going to start costing lots and lots of money, just like they did in the 1980s / early 90s.
I think that better supercomputers will become unaffordable because no one can spare the $10billions it would take to do the next big chip. If people stop having PCs there won't be a mass market for large all-out GPUs, so we won't even be able to use those instead. They're an exclusive thing, but super computers do play a vital role in our world and we do need them to get better. So who is going to pay for that? Not me, that's for sure.
Oops, that went on for quite a while...

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

Alternative to threads for implementing massive parallel calculation engine in c# or java?

Suppose there is a need for building a spreadsheet-like engine that needs to be ultra fast, each cell dependencies could be on parallel calculation branch. Could thread be created for each parallel branch ? Isn't thread costfull in term of memory. Easily you could think that with 1000 formulas rows or even 1 million you would have to create same number of threads is it realistic ?
If it isn't realistic is there an alternative to threads for this kind of scenario ?

In modern Java programming, you should avoid threads altogether, and instead use executors. The rest of the world calls them working queues. See Item 68 in Effective Java by Joshua Bloch.
Personally, I strongly prefer the APIs of Grand Central Dispatch. The Java version is called HawtDispatch. That API is simpler, and just works.

For CPU intensive tasks, the optimal number of threads is usually the same number of CPUs. The overhead of creating threads can be much higher than the work that thread does if you are not careful.
Its worth nothing that CPU is often not the main issue. Often memory bandwidth or cache utilisation is more of an issue, in which case having one thread efficiently written can out perform attempting to distribute work across many thread. If the work each thread does is CPU intensive, and uses relatively less memory bandwidth, having multiple threads can help.

Your best bet is Task Parallel Library or Fork/Join Framework in Java. They do use threads but optimize the number of threads and put work items on a work queue for you. They take care of a lot of low level optimization problems in really clever ways. You just use constructs like Parallel.For, etc.

The only thing besides threads that comes to mind are SIMD commands (unless you want to use special hardware which means you'd have to use a lower lvl Language). You'd have to use a external Library for these tough to gain access to the Processors/Gaphic Cards functions. Also CUDA or OpenCL might interest you.
On the other Hand you normally don't want to create that many threads as you described, you could use a thread Pool, with a fixed or dynamic amount of threads, that manages how many threads are created and executes tasks from a queue. Also there is a Fork/Join Feature in Java 7 which helps with thread management.
I'd say have a look at thread pools, with these you can balance out the overhead created from too many threads.
Since you are looking for Information this might help out a bit for threads too.

Please take also a look at Ateji PX. It's an extension to the java language for parallelization that may help you. It was a commercial product but meanwhile it has become available for free.

The Task Parallel Library can help you utilize the CPU as much as possible, and does most of the heavy lifting of thread creation for you.
If you have a very large number of (very) parallelizable computations, and you need the absolute best performance you can have, you will have too look beyond the cpu. There are alternatives that combines LINQ/TPL with the GPU such as MS. Accelerator and Brahma. See for example Utilizing the GPU with c#

How to compile C# for multiple processor machines? (With VS 2010 or csc.exe)

Greetings!
I've searched for compiler (csc.exe) options at MSDN and I found an answer here, at Stackoverflow, about compiling with multiple processors. But my problem is about compiling for multiple processors, as follows.
The university where I'm graduating has a 11 machine cluster (which has 6 quad-cores and 5 four-core bi-processed machines). It runs under linux, but I can install MONO there. And instead of compiling with multiple processors or cores, I want to compile for multiple processors machine. So:
Is there any particular detail on how to do it or the CLR on that system should handle the execution to spread it across the cores?
If there's a way to do this, how can I do it with VS2010 or with csc.exe command line compiler?
Thanks in advance and I'm sorry if this question makes no sense. I really don't know how to handle multiple cores, as I'm a mere physicist, not a computer scientist! :)

You don't need to compile any differently to take account of multiple cores. You need to write your code differently though, to use multiple threads. If you can use classes from .NET 4 in your environment (a recent version of Mono should support this) you can use the Task Parallel Library which makes this a bit easier.
Basically you don't get concurrency for free - you have to think about which bits of your code can sensibly run in parallel. You might want to read the output of the Patterns and Practices group for parallel programming. (The book is a very good starting point.)

Your supposition is correct; your question makes no sense.
It is not possible to magically parallelize arbitrary code; you need to modify the code to use multiple threads.
You can use explicitly multiple cores in C# by using the Thread or ThreadPool classes, or by using Parallel LINQ or the TPL.
There is no special compiler involved.

The CLR, by default, will not do anything special to spread the work out across multiple cores. YOU, in developing the application, are responsible for making the best use of your machine's resources. The .NET Framework does have several libraries and technologies that make multithreaded operations simple to implement: look up the Thread class, Delegate.BeginInvoke/EndInvoke, and the Task Parallel Library.

Since it is a cluster, you have to rely on some form of a message-passing parallelism, no compiler will transform your code automatically. At least, a good old MPI is supported: http://osl.iu.edu/research/mpi.net/

The answer to your question comes in the form of two seemingly-contradictory statements:
1: It already does
and
2: You can't
Modern operating systems and, thus, development environments, use threads. A thread, fundamentally, represents a single series of sequential steps (and words that don't start with "S") that the processor will execute. These threads are managed by the operating system and by the processor architecture, wherein the processor will execute some portion (or all) of a thread, save its state, then switch to another thread.
In the presence of multiple cores (whether by multi-core processors or simply multiple processors or both), it's actually possible for the computer to execute two threads at the same time, assuming that they read and write different locations in memory (threads that utilize the same resources require synchronization, which is a complex ballgame inside this one), by distributing threads across cores.
At the risk of using an overly simplistic simile, think of it this way: your code, as it stands right now, is just a very long list of steps to execute to accomplish a particular task. You've now taken this list of instructions into a room full of people (each representing a processing core), and you'd like to use each of these people as efficiently as possible. While a room full of PhD students might have the context and subject-matter knowledge to figure out how to break out your instructions into tasks for each individual person, you've taken your list to a room full of people who are excellent at following directions but entirely stupid when it comes to deduction. In this case, you need to bring a different set of instructions for each person that when all of them are executed, you end up with the same result.
Put simply, in order to have your code take advantage of multiple cores or processors, you have to break your work down into small, preferably atomic chunks of code. The specific method that you use to break up your code into multiple threads can vary; using System.Threading.ThreadPool or the more recently introduced Task Parallel Library can make some of these things easier, though there's always a tradeoff (as with everything) in either efficiency or control.
Going into much more detail than that would require looking at your actual code. You'd do better finding someone with experience writing solid, performant multithreaded code (if possible, someone with recent .NET experience doing this, as this will help make the determination about which libraries would be appropriate).

What is non-thread-safety for?

There are a lot of articles and discussions explaining why it is good to build thread-safe classes. It is said that if multiple threads access e.g. a field at the same time, there can only be some bad consequences. So, what is the point of keeping non thread-safe code? I'm focusing mostly on .NET, but I believe the main reasons are not language-dependent.
E.g. .NET static fields are not thread-safe. What would be the result if they were thread-safe by default? (without a need to perform "manual" locking). What are the benefits of using (actually defaulting to) non-thread-safety?
One thing that comes to my mind is performance (more of a guess, though). It's rather intuitive that, when a function or field doesn't need to be thread-safe, it shouldn't be. However, the question is: what for? Is thread-safety just an additional amount of code you always need to implement? In what scenarios can I be 100% sure that e.g. a field won't be used by two threads at once?

Writing thread-safe code:
Requires more skilled developers
Is harder and consumes more coding efforts
Is harder to test and debug
Usually has bigger performance cost
But! Thread-safe code is not always needed. If you can be sure that some piece of code will be accessed by only one thread the list above becomes huge and unnecessary overhead. It is like renting a van when going to neighbor city when there are two of you and not much luggage.

Thread safety comes with costs - you need to lock fields that might cause problems if accessed simultaneously.
In applications that have no use of threads, but need high performance when every cpu cycle counts, there is no reason to have safe-thread classes.

So, what is the point of keeping non thread-safe code?
Cost. Like you assumed, there usually is a penalty in performance.
Also, writing thread-safe code is more difficult and time consuming.

Thread safety is not a "yes" or "no" proposition. The meaning of "thread safety" depends upon context; does it mean "concurrent-read safe, concurrent write unsafe"? Does it mean that the application just might return stale data instead of crashing? There are many things that it can mean.
The main reason not to make a class "thread safe" is the cost. If the type won't be accessed by multiple threads, there's no advantage to putting in the work and increase the maintenance cost.

Writing threadsafe code is painfully difficult at times. For example, simple lazy loading requires two checks for '== null' and a lock. It's really easy to screw up.
[EDIT]
I didn't mean to suggest that threaded lazy loading was particularly difficult, it's the "Oh and I didn't remember to lock that first!" moments that come fast and hard once you think you're done with the locking that are really the challenge.

There are situations where "thread-safe" doesn't make sense. This consideration is in addition to the higher developer skill and increased time (development, testing, and runtime all take hits).
For example, List<T> is a commonly-used non-thread-safe class. If we were to create a thread-safe equivalent, how would we implement GetEnumerator? Hint: there is no good solution.

Turn this question on its head.
In the early days of programming there was no Thread-Safe code because there was no concept of threads. A program started, then proceeded step by step to the end. Events? What's that? Threads? Huh?
As hardware became more powerful, concepts of what types of problems could be solved with software became more imaginative and developers more ambitious, the software infrastructure became more sophisticated. It also became much more top-heavy. And here we are today, with a sophisticated, powerful, and in some cases unnecessarily top-heavy software ecosystem which includes threads and "thread-safety".
I realize the question is aimed more at application developers than, say, firmware developers, but looking at the whole forest does offer insights into how that one tree evolved.

So, what is the point of keeping non thread-safe code?
By allowing for code that isn't thread safe you're leaving it up to the programmer to decide what the correct level of isolation is.
As others have mentioned this allows for complexity reduction and improved performance.
Rico Mariani wrote two articles entitled "Putting your synchronization at the correct level" and
Putting your synchronization at the correct level -- solution that have a nice example of this in action.
In the article he has a method called DoWork(). In it he calls other classes Read twice Write twice and then LogToSteam.
Read, Write, and LogToSteam all shared a lock and were thread safe. This is good except for the fact that because DoWork was also thread safe all the synchronizing work in each Read, Write and LogToSteam was a complete waste of time.
This is all related to the nature Imperative Programming. Its side effects cause the need for this.
However if you had an development platform where applications could be expressed as pure functions where there were no dependencies or side effects then it would be possible to create applications where the threading was managed without developer intervention.

So, what is the point of keeping non thread-safe code?
The rule of thumb is to avoid locking as much as possible. The Ideal code is re-entrant and thread safe with out any locking. But that would be utopia.
Coming back to reality, a good programmer tries his level best to have a sectional locking as opposed to locking the entire context. An example would be to lock few lines of code at a time in various routines than locking everything in a function.
So Also, one has to refactor the code to come up with a design that would minimize the locking if not get rid of it in entirity.
e.g. consider a foobar() function that gets new data on each call and uses switch() case on a type of data to changes a node in a tree. The locking can be mostly avoided (if not completely) As each case statement would touch a different node in a tree. This may be a more specific example but i think it elaborates my point.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.