Background
My colleague thinks reads in multithreaded C# are reliable and will always give you the current, fresh value of a field, but I've always used locks because I was sure I'd experienced problems otherwise.
I spent some time googling and reading articles, but I mustn't be able to provide google with correct search input, because I didn't find exactly what I was after.
So I wrote the below program without locks in an attempt to prove why that's bad.
Question
I'm assuming the below is a valid test, then the results show that the reads aren't reliable/fresh.
Can someone explain what this is caused by? (reordering, staleness or something else)?
And link me to official Microsoft documentation/section explaining why this happens and what is the recommended solution?
If the below isn't a valid test, what would be?
Program
If there are two threads, one calls SetA and the other calls SetB, if the reads are unreliable without locks, then intermittently Foo's field "c" will be false.
using System;
using System.Threading.Tasks;
namespace SetASetBTestAB
{
class Program
{
class Foo
{
public bool a;
public bool b;
public bool c;
public void SetA()
{
a = true;
TestAB();
}
public void SetB()
{
b = true;
TestAB();
}
public void TestAB()
{
if (a && b)
{
c = true;
}
}
}
static void Main(string[] args)
{
int timesCWasFalse = 0;
for (int i = 0; i < 100000; i++)
{
var f = new Foo();
var t1 = Task.Run(() => f.SetA());
var t2 = Task.Run(() => f.SetB());
Task.WaitAll(t1, t2);
if (!f.c)
{
timesCWasFalse++;
}
}
Console.WriteLine($"timesCWasFalse: {timesCWasFalse}");
Console.WriteLine("Finished. Press Enter to exit");
Console.ReadLine();
}
}
}
Output
Release mode. Intel Core i7 6700HQ:
Run 1: timesCWasFalse: 8
Run 2: timesCWasFalse: 10
Of course it is not fresh. The average CPU nowadays has 3 layers of Caches between each cores Registers and the RAM. And it can take quite some time for a write to one cache to be propagate to all of them.
And then there is the JiT Compiler. Part of it's job is dead code dection. And one of the first things it will do is cut out "useless" variables. For example this code tried to force a OOM excpetion by running into the 2 GiB Limit on x32 Systems:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace OOM_32_forced
{
class Program
{
static void Main(string[] args)
{
//each short is 2 byte big, Int32.MaxValue is 2^31.
//So this will require a bit above 2^32 byte, or 2 GiB
short[] Array = new short[Int32.MaxValue];
/*need to actually access that array
Otherwise JIT compiler and optimisations will just skip
the array definition and creation */
foreach (short value in Array)
Console.WriteLine(value);
}
}
}
The thing is that if you cut out the output stuff, there is a decent chance that the JiT will remove the variable Array inlcuding the instantionation order. The JiT has a decent chance to reduce this programming to doing nothing at all at runtime.
volatile is first preventing the JiT from doing any optimisations on that value. And it might even have some effect on how the CPU processes stuff.
Related
I've been using time with high precision to log the events of my program in console. But soon I noticed that the program sometimes displays time rounded to milliseconds and sometimes not! It occurs completely sporadically, it's the SAME CODE, NOT RECOMPILED, NOT EDITED BETWEEN RUNS:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
namespace DateTimePrecisionTest
{
class Program
{
static DateTime ProgramStartTimeGlobal;
static void PrintConsoleLogGlobal()
{
string TimeStampText = ((DateTime.Now - ProgramStartTimeGlobal).TotalMilliseconds / 1000).ToString("0.000000");
Console.WriteLine(String.Format("Global var: [ {0,10} ] ", TimeStampText));
}
static void PrintConsoleLogLocal(DateTime StartTime)
{
string TimeStampText = ((DateTime.Now - StartTime).TotalMilliseconds / 1000).ToString("0.000000");
Console.WriteLine(String.Format("Local var: [ {0,10} ] ", TimeStampText));
}
static void Main(string[] args)
{
ProgramStartTimeGlobal = DateTime.Now;
for (int i = 0; i < 20; i++)
{
PrintConsoleLogGlobal();
PrintConsoleLogLocal(ProgramStartTimeGlobal);
Thread.Sleep(512);
}
Console.ReadLine();
}
}
}
First I thought it depends whether I'm printing global or local variable. But it doesn't seem to be the case.
The output is (values braces are in seconds):
This chaotic precision changes occur in other programs employing this logging. This program, for instance, executes tasks at remote server (with unpredictable delays between them):
Why??
Internally, DateTime.Now (via UtcNow) depends on the Windows API GetSystemTimeAsFileTime. Unfortunately, it would appear (see community comments at bottom of that page) that the resolution of this clock can vary based on the activity of other programs on your system.
The timeBeginPeriod function is documented as:
This function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process.
I'm kind of new to the parallel programing classes in C# 4.0; Was trying a simple for loop, where I should normally get longs from 0 to 99 printed in sequential fashion with the usual for loop, but with the Parallel.For, I'm getting incosistent outputs in random jumbled-up orders.
Code :
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Parallel.For(0, 100, i =>
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
});
Console.Read();
}
}
}
One output on console :
Writing0
Writing1
Writing2
Writing3
Writing4
Writing5
Writing6
Writing7
Writing8
Writing9
Writing10
Writing11
Writing12
Writing13
Writing14
Writing15
Writing16
Writing17
Writing18
Writing19
Writing20
Writing21
Writing22
Writing23
Writing24
Writing25
Writing26
Writing27
Writing28
Writing29
Writing30
Writing31
Writing32
Writing33
Writing34
Writing35
Writing36
Writing37
Writing38
Writing39
Writing40
Writing41
Writing42
Writing43
Writing44
Writing45
Writing46
Writing47
Writing48
Writing49
Writing50
Writing66
Writing67
Writing68
Writing70
Writing71
Writing72
Writing73
Writing74
Writing75
Writing76
Writing77
Writing78
Writing69
Writing82
Writing83
Writing84
Writing85
Writing86
Writing87
Writing88
Writing89
Writing90
Writing51
Writing52
Writing53
Writing54
Writing55
Writing91
Writing92
Writing93
Writing94
Writing95
Writing56
Writing57
Writing79
Writing80
Writing81
Writing58
Writing59
Writing96
Writing97
Writing98
Writing99
Writing60
Writing61
Writing62
Writing63
Writing64
Writing65
Thanks in advance, for whatever help you guys get me.
That is parallel computing. The tasks are queued up and each available processor gets one. If it is done, the next queued task will be issued to it. There is no guarantuee about the order in which the tasks will be delivered to the processing units as well as no guarantuee which one will be finished next, therefore parallizable code is not the same as sequential code + the keyword parallel. The algorithms need to be developed to be able to run in parallel. In your simple example all numbers from 1 to 100 are written, but the parallel loop does not write them in the natural order you'd expect.
This is to be expected, the way you are doing it there the work is partitioned using the current thread as well as a number of others pulled from the thread pool.
If you wanted to do the same thing on a different thread but make the writting synchronous then you could try :
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var t = Task.Factory.StartNew(() => {
for(var i = 0;i<100;i++)
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
}
});
Console.Read();
}
}
}
In this example https://stackoverflow.com/a/9980346/93647 and here Why is my disruptor example so slow? (at the end of the question) there is 1 publisher which publish items and 1 consumer.
But in my case consumer work is much more complicated and takes some time. So I want 4 consumers that process data in parallel.
So for example if producer produce numbers: 1,2,3,4,5,6,7,8,9,10,11..
I want consumer1 to catch 1,5,9,... consumer2 to catch 2,6,10,... consumer3 to catch 3,7,11,... consumer4 to catch 4,8,12... (well not exactly these numbers, the idea is that data should be processed in parallel, i don't care which certain number is processed on which consumer)
And remember this need to be done parallel because in real application consumer work is pretty expensive. I expect consumers to be executed in different threads to use power of multicore systems.
Of course I can just create 4 ringbuffers and attach 1 consumer to 1 ring-buffer. This way I can use original example. But I feel it wouldn't be correct. Likely it would be correct to create 1 publisher (1 ringbuffer) and 4 consumers - as this is what i need.
Adding link to a very simular question in google groups: https://groups.google.com/forum/#!msg/lmax-disruptor/-CLapWuwWLU/GHEP4UkxrAEJ
So we have two options:
one ring many consumers (each consumer will "wake-up" on every addition, all consumer should have the same WaitStrategy)
many "one ring - one consumer" (each consumer will wake-up only on data that it should process. each consumer can have own WaitStrategy).
EDIT: I forgot to mention the code is partially taken from the FAQ. I have no idea if this approach is better or worse than Frank's suggestion.
The project is severely under documented, that's a shame as it looks nice.
Anyway try the following snip (based on your first link) - tested on mono and seems to be OK:
using System;
using System.Threading.Tasks;
using Disruptor;
using Disruptor.Dsl;
namespace DisruptorTest
{
public sealed class ValueEntry
{
public long Value { get; set; }
}
public class MyHandler : IEventHandler<ValueEntry>
{
private static int _consumers = 0;
private readonly int _ordinal;
public MyHandler()
{
this._ordinal = _consumers++;
}
public void OnNext(ValueEntry data, long sequence, bool endOfBatch)
{
if ((sequence % _consumers) == _ordinal)
Console.WriteLine("Event handled: Value = {0}, event {1} processed by {2}", data.Value, sequence, _ordinal);
else
Console.WriteLine("Event {0} rejected by {1}", sequence, _ordinal);
}
}
class Program
{
private static readonly Random _random = new Random();
private const int SIZE = 16; // Must be multiple of 2
private const int WORKERS = 4;
static void Main()
{
var disruptor = new Disruptor.Dsl.Disruptor<ValueEntry>(() => new ValueEntry(), SIZE, TaskScheduler.Default);
for (int i=0; i < WORKERS; i++)
disruptor.HandleEventsWith(new MyHandler());
var ringBuffer = disruptor.Start();
while (true)
{
long sequenceNo = ringBuffer.Next();
ringBuffer[sequenceNo].Value = _random.Next();;
ringBuffer.Publish(sequenceNo);
Console.WriteLine("Published entry {0}, value {1}", sequenceNo, ringBuffer[sequenceNo].Value);
Console.ReadKey();
}
}
}
}
From the specs of the ring-buffer you will see that every consumer will try to process your ValueEvent. in your case you don't need that.
I solved it like this:
Add a field processed to your ValueEvent and when a consumer takes the event he test on that field, if it is already processed he moves on to the next field.
Not the most pretty way, but it's how the buffer works.
I need some advice on how to do the following in either C# and VB.net.
In C++, in my header file I do the following:
#define StartButtonPressed Input[0]==1 // Input is an array declared in .cpp file
In my .cpp file, i have a code something like this:
if(StartButtonPressed)
// do something
The reason of me doing so is so that my code is easier to read.
I tried the same thing in C# but it got error. How could I do the same thing in C# and VB.Net?
Please advice. Thanks.
There is no good reason to use a macro for this in C++; you could just as easily make it a function and the code would be far cleaner:
bool IsStartButtonPressed()
{
return Input[0] == 1;
}
Input should also probably be passed as an argument to the function, but it's hard to tell exactly where that is coming from.
You're best off creating a property in your class
protected bool StartButtonPressed {
get { return Input[0] == 1; }
}
then your code can be as before
.
.
.
if(StartButtonPressed) {
.
.
.
}
However for consistency with the .net framework I'd suggest calling the property IsStartButtonPressed
If you need to to be evaluated at the point of the if statement then you really need a function or a property. However is this is one time evaluation you can use a field
bool isStartButtonPressed = Input[0] ==1;
If you want may classes to have this functionality then I'd recommend a static function from another class, something like
public static class ButtonChecker {
public static bool IsPressed(int[] input) {
return input[0] == 1;
}
}
Then you call it anywhere with
if(ButtonChecker.IsPressed(Input)) {
.
.
}
But ultimately you cannot use macro's like you're used in C/C++. You shouldn't be worried about performance of properties and functions like this as the CLR jit compiler implementation is very very good for them
Here is an example program:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Diagnostics;
namespace ConsoleApplication1 {
public static class ButtonChecker {
public static bool IsPressed(int[] input) {
return input[0] == 1;
}
}
static class Program {
public static void Main(){
int[] Input = new int[6] { 1, 0, 2, 3,4 , 1 };
for(int i = 0; i < Input.Length; ++i){
Console.WriteLine("{0} Is Pressed = {1}", i, ButtonChecker.IsPressed(Input));
}
Console.ReadKey();
}
}
}
You could use an enum
public enum buttonCode
{
startButton = 0,
stopButton = 1
// more button definitions
}
Then maybe one function
public bool IsButtonPressed(b as buttoncode)
{
return Input[b] == 1;
}
Then your calls look like:
if IsButtonPressed(buttonCode.StartButton) { }
The only changes needed to switch button codes are then in the enum, not spread across multiple functions.
Edited to Add:
If you want individually named functions, you could do this:
public bool IsStartButtonPressed()
{
return Input[buttonCode.StartButton] == 1;
}
Still, all of the edits would be in the enum, not the functions.
Bjarne Stroustrup wrote:
The first rule about macros is: Do not use them if you do not have to. Almost every macro demonstrates a flaw in the programming language, in the program, or in the programmer.
It's worth noting two things here before saying anything else. The first is that "macro" can mean a very different thing in some other languages; one would not make the same statement about Lisp. the second is that Stroustrup is willing to take his share of the blame in saying that one reason for using macros is "a flaw in the programming language", so it's not like he's just being superior in condemning their use.
This case though isn't a flaw in the programming language, except that the language lets you do it in the first place (but has to, to allow other macros). The only purpose of this macro is to make the code harder to read. Just get rid of it. Replace it with some actual C# code like:
private bool StartButtonPressed
{
get
{
return Input[0]==1
}
}
Edit:
Seeing the comment above about wanting to be faster to code, I would do something like:
private enum Buttons
{
Start = 0,
Stop = 1,
Pause = 2,
/* ... */
}
private bool IsPressed(Buttons button)
{
return Input[(int)button] == 1;
}
And then call e.g. IsPressed(Buttons.Start). Then I'd fix the C++ to use the same approach too (in C++ I would even be able to leave out the Buttons. where I wanting particularly great concision).
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.