In this example https://stackoverflow.com/a/9980346/93647 and here Why is my disruptor example so slow? (at the end of the question) there is 1 publisher which publish items and 1 consumer.
But in my case consumer work is much more complicated and takes some time. So I want 4 consumers that process data in parallel.
So for example if producer produce numbers: 1,2,3,4,5,6,7,8,9,10,11..
I want consumer1 to catch 1,5,9,... consumer2 to catch 2,6,10,... consumer3 to catch 3,7,11,... consumer4 to catch 4,8,12... (well not exactly these numbers, the idea is that data should be processed in parallel, i don't care which certain number is processed on which consumer)
And remember this need to be done parallel because in real application consumer work is pretty expensive. I expect consumers to be executed in different threads to use power of multicore systems.
Of course I can just create 4 ringbuffers and attach 1 consumer to 1 ring-buffer. This way I can use original example. But I feel it wouldn't be correct. Likely it would be correct to create 1 publisher (1 ringbuffer) and 4 consumers - as this is what i need.
Adding link to a very simular question in google groups: https://groups.google.com/forum/#!msg/lmax-disruptor/-CLapWuwWLU/GHEP4UkxrAEJ
So we have two options:
one ring many consumers (each consumer will "wake-up" on every addition, all consumer should have the same WaitStrategy)
many "one ring - one consumer" (each consumer will wake-up only on data that it should process. each consumer can have own WaitStrategy).
EDIT: I forgot to mention the code is partially taken from the FAQ. I have no idea if this approach is better or worse than Frank's suggestion.
The project is severely under documented, that's a shame as it looks nice.
Anyway try the following snip (based on your first link) - tested on mono and seems to be OK:
using System;
using System.Threading.Tasks;
using Disruptor;
using Disruptor.Dsl;
namespace DisruptorTest
{
public sealed class ValueEntry
{
public long Value { get; set; }
}
public class MyHandler : IEventHandler<ValueEntry>
{
private static int _consumers = 0;
private readonly int _ordinal;
public MyHandler()
{
this._ordinal = _consumers++;
}
public void OnNext(ValueEntry data, long sequence, bool endOfBatch)
{
if ((sequence % _consumers) == _ordinal)
Console.WriteLine("Event handled: Value = {0}, event {1} processed by {2}", data.Value, sequence, _ordinal);
else
Console.WriteLine("Event {0} rejected by {1}", sequence, _ordinal);
}
}
class Program
{
private static readonly Random _random = new Random();
private const int SIZE = 16; // Must be multiple of 2
private const int WORKERS = 4;
static void Main()
{
var disruptor = new Disruptor.Dsl.Disruptor<ValueEntry>(() => new ValueEntry(), SIZE, TaskScheduler.Default);
for (int i=0; i < WORKERS; i++)
disruptor.HandleEventsWith(new MyHandler());
var ringBuffer = disruptor.Start();
while (true)
{
long sequenceNo = ringBuffer.Next();
ringBuffer[sequenceNo].Value = _random.Next();;
ringBuffer.Publish(sequenceNo);
Console.WriteLine("Published entry {0}, value {1}", sequenceNo, ringBuffer[sequenceNo].Value);
Console.ReadKey();
}
}
}
}
From the specs of the ring-buffer you will see that every consumer will try to process your ValueEvent. in your case you don't need that.
I solved it like this:
Add a field processed to your ValueEvent and when a consumer takes the event he test on that field, if it is already processed he moves on to the next field.
Not the most pretty way, but it's how the buffer works.
Related
Background
My colleague thinks reads in multithreaded C# are reliable and will always give you the current, fresh value of a field, but I've always used locks because I was sure I'd experienced problems otherwise.
I spent some time googling and reading articles, but I mustn't be able to provide google with correct search input, because I didn't find exactly what I was after.
So I wrote the below program without locks in an attempt to prove why that's bad.
Question
I'm assuming the below is a valid test, then the results show that the reads aren't reliable/fresh.
Can someone explain what this is caused by? (reordering, staleness or something else)?
And link me to official Microsoft documentation/section explaining why this happens and what is the recommended solution?
If the below isn't a valid test, what would be?
Program
If there are two threads, one calls SetA and the other calls SetB, if the reads are unreliable without locks, then intermittently Foo's field "c" will be false.
using System;
using System.Threading.Tasks;
namespace SetASetBTestAB
{
class Program
{
class Foo
{
public bool a;
public bool b;
public bool c;
public void SetA()
{
a = true;
TestAB();
}
public void SetB()
{
b = true;
TestAB();
}
public void TestAB()
{
if (a && b)
{
c = true;
}
}
}
static void Main(string[] args)
{
int timesCWasFalse = 0;
for (int i = 0; i < 100000; i++)
{
var f = new Foo();
var t1 = Task.Run(() => f.SetA());
var t2 = Task.Run(() => f.SetB());
Task.WaitAll(t1, t2);
if (!f.c)
{
timesCWasFalse++;
}
}
Console.WriteLine($"timesCWasFalse: {timesCWasFalse}");
Console.WriteLine("Finished. Press Enter to exit");
Console.ReadLine();
}
}
}
Output
Release mode. Intel Core i7 6700HQ:
Run 1: timesCWasFalse: 8
Run 2: timesCWasFalse: 10
Of course it is not fresh. The average CPU nowadays has 3 layers of Caches between each cores Registers and the RAM. And it can take quite some time for a write to one cache to be propagate to all of them.
And then there is the JiT Compiler. Part of it's job is dead code dection. And one of the first things it will do is cut out "useless" variables. For example this code tried to force a OOM excpetion by running into the 2 GiB Limit on x32 Systems:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace OOM_32_forced
{
class Program
{
static void Main(string[] args)
{
//each short is 2 byte big, Int32.MaxValue is 2^31.
//So this will require a bit above 2^32 byte, or 2 GiB
short[] Array = new short[Int32.MaxValue];
/*need to actually access that array
Otherwise JIT compiler and optimisations will just skip
the array definition and creation */
foreach (short value in Array)
Console.WriteLine(value);
}
}
}
The thing is that if you cut out the output stuff, there is a decent chance that the JiT will remove the variable Array inlcuding the instantionation order. The JiT has a decent chance to reduce this programming to doing nothing at all at runtime.
volatile is first preventing the JiT from doing any optimisations on that value. And it might even have some effect on how the CPU processes stuff.
How do I get the number of students in this school at any given point in time using the Rx idiom and without having to maintain state in the School class myself?
using System;
using System.Reactive.Subjects;
namespace SchoolManagementSystem
{
public class School
{
private ISubject<Student> _subject = null;
private int _maxNumberOfSeats;
private int _numberOfStudentsAdmitted;
public string Name { get; set; }
public School(string name, int maxNumberOfSeats)
{
Name = name;
_maxNumberOfSeats = maxNumberOfSeats;
_numberOfStudentsAdmitted = 0;
_subject = new ReplaySubject<Student>();
}
public void AdmitStudent(Student student)
{
try
{
if (student == null)
throw new ArgumentNullException("student");
if (_numberOfStudentsAdmitted == _maxNumberOfSeats)
{
_subject.OnCompleted();
}
// Obviously can't do this because this will
// create a kind of dead lock in that it will
// wait for the _subject to complete, but I am
// using the same _subject to issue notifications.
// _numberOfStudentsAdmitted = _subject.Count().Wait();
// OR to keep track of state myself
Interlocked.Increment(ref _numberOfStudentsAdmitted);
_subject?.OnNext(student);
}
catch(Exception ex)
{
_subject.OnError(ex);
}
}
public IObservable<Student> Students
{
get
{
return _subject;
}
}
}
}
Or is this just not in tandem with the principles of components designed using Rx?
Is this something that should be the responsibility of the client (to get the count and do all side-effects in the onNext handler)? And that the observables should simply act as stateless signal-sources or gates much like the hardware interrupt routines that simply signal to the CPU that something of interest has happened?
In that case, we lose the criteria for the observable to signal completion. How then it is supposed to know when to complete?
You can use the Count() method on your _subject sequence. It will itself create an observable sequence where each value produced represents the latest total number of students in _subject.
You could then react to this sequence of student count values. The Zip() operation could be useful in that regard, since it has the advantage on completing the resulting sequence when any of its inner sequences complete, which you can force with a TakeWhile.
The result looks something like this
Observable.Zip(
_subject.Select(student => != null ? student ? throw new ArgumentNullException("student")),
_subject.Count().TakeWhile(studentCount => studentCount < _maxNumberOfSeats),
(student, count) => student
);
All that would be left to do in the AdmitStudent method body would simply be to push any new student to the sequence with _subject?.OnNext(student) (like you already do), but without the extra logic. You could also modify this a bit to make sure that _subject itself also gets completed once the maximum student count is reached, but I'm not certain about your business rules, so I'll leave that for you to decide.
One last thing I can recommend if to play with the extensions for Rx types and to have a look around this website, which uses them liberally.
I would like to known an alternative to do a toProcess.RemoveAll, but in parallel. Today my code like my exemplo is working well, but in sequencial, and I'd like to be in paralle.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ParallelTest
{
using System.Threading;
using System.Threading.Tasks;
class Program
{
static void Main(string[] args)
{
List<VerifySomethingFromInternet> foo = new List<VerifySomethingFromInternet>();
foo.Add(new VerifySomethingFromInternet(#"id1", true));
foo.Add(new VerifySomethingFromInternet(#"id2", false));
foo.Add(new VerifySomethingFromInternet(#"id3", true));
foo.Add(new VerifySomethingFromInternet(#"id4", false));
foo.Add(new VerifySomethingFromInternet(#"id5", true));
foo.Add(new VerifySomethingFromInternet(#"id6", false));
DoSomethingFromIntert bar = new DoSomethingFromIntert();
bar.DoesWork(foo);
Console.ReadLine();
}
}
public class DoSomethingFromIntert
{
bool RemoveIFTrueFromInternet(VerifySomethingFromInternet vsfi)
{
Console.WriteLine(String.Format("Identification : {0} - Thread : {1}", vsfi.Identification, Thread.CurrentThread.ManagedThreadId));
// Do some blocking work at internet
return vsfi.IsRemovable;
}
public void DoesWork(List<VerifySomethingFromInternet> toProcess)
{
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
//Remove all true return
toProcess.RemoveAll(f => this.RemoveIFTrueFromInternet(f));
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
}
}
public class VerifySomethingFromInternet
{
public VerifySomethingFromInternet(string id, bool remove)
{
this.Identification = id;
this.IsRemovable = remove;
}
public string Identification { get; set; }
public bool IsRemovable { get; set; }
}
}
var newList = toProcess.AsParallel ()
.Where (f => !this.RemoveIFTrueFromInternet(f))
.ToList ();
toProcess = newList;
Probably this answers your question, but I'm not sure that it's really faster. Try and measure.
Note that this may change the order of the elements in the list. If you care about order, add AsOrdered after AsParallel. (Thanks to weston for the [implicit] hint).
List<T> isn't thread safe so there is no way to do this in parallel with this type of list.
You can use thread safe ConcurrentBag instead, but that one doesn't have a RemoveAll method, obviously.
You can also convert the list to an array, edit that one, and pass it to list again.
I tried to restructure your code a bit
I used BlockingCollection to implement a producer consumer scenario
this is not removing in parallel but it may solve your problem by processing them in parallel, give it a try you may love it
class Program
{
static void Main(string[] args)
{
DoSomethingFromIntert bar = new DoSomethingFromIntert();
bar.Verify(#"id1", true);
bar.Verify(#"id2", false);
bar.Verify(#"id3", true);
bar.Verify(#"id4", false);
bar.Verify(#"id5", true);
bar.Verify(#"id6", false);
bar.Complete();
Console.ReadLine();
}
}
public class DoSomethingFromIntert
{
BlockingCollection<VerifySomethingFromInternet> toProcess = new BlockingCollection<VerifySomethingFromInternet>();
ConcurrentBag<VerifySomethingFromInternet> workinglist = new ConcurrentBag<VerifySomethingFromInternet>();
public DoSomethingFromIntert()
{
//init four consumers you may choose as many as you want
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
ThreadPool.QueueUserWorkItem(DoesWork);
}
public void Verify(string param, bool flag)
{
//add to the processing list
toProcess.TryAdd(new VerifySomethingFromInternet(param, flag));
}
public void Complete()
{
//mark producer as complete and let the threads exit when finished verifying
toProcess.CompleteAdding();
}
bool RemoveIFTrueFromInternet(VerifySomethingFromInternet vsfi)
{
Console.WriteLine(String.Format("Identification : {0} - Thread : {1}", vsfi.Identification, Thread.CurrentThread.ManagedThreadId));
// Do some blocking work at internet
return vsfi.IsRemovable;
}
private void DoesWork(object state)
{
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
foreach (var item in toProcess.GetConsumingEnumerable())
{
//do work
if (!RemoveIFTrueFromInternet(item))
{
//add to list if working
workinglist.TryAdd(item);
}
//no need to remove as it is removed from the list automatically
}
//this line will only reach after toProcess.CompleteAdding() and when items are consumed(verified)
Console.WriteLine(String.Format("total : {0}", toProcess.Count));
}
}
in short it will start verifying the items as soon as you add them and will keep the successful items in a separate list
Edit
as the foreach loop for GetConsumingEnumerable() does not end by default it keep waiting for the next element forever until CompleteAdding() is called. so I added Complete() method in the wrapper class to finish the verification loop once we have pushed all the elements.
the idea is to keep adding the verification elements to the class and let the consumer loop verify each of them in parallel and once you are done will all of the elements call Complete() to know the consumers that there are no more elements to be added so they can terminate the foreach loop once the list is empty.
in your code the removal of the element is not the actual issue of performance but the synchronous loop of the verification process if the hot spot. removing from list a just a cost of few ms however the expensive part of the code is the blocking work at internet so if we can make it parallel we are able to cut some of the precious time.
be careful with the number of consumers threads you initialize, however I used thread pool but still may affect performance if excessively used. so decide a number based on the machine capability eg. number or cores / processors
more about BlockingCollection
I'm kind of new to the parallel programing classes in C# 4.0; Was trying a simple for loop, where I should normally get longs from 0 to 99 printed in sequential fashion with the usual for loop, but with the Parallel.For, I'm getting incosistent outputs in random jumbled-up orders.
Code :
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Parallel.For(0, 100, i =>
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
});
Console.Read();
}
}
}
One output on console :
Writing0
Writing1
Writing2
Writing3
Writing4
Writing5
Writing6
Writing7
Writing8
Writing9
Writing10
Writing11
Writing12
Writing13
Writing14
Writing15
Writing16
Writing17
Writing18
Writing19
Writing20
Writing21
Writing22
Writing23
Writing24
Writing25
Writing26
Writing27
Writing28
Writing29
Writing30
Writing31
Writing32
Writing33
Writing34
Writing35
Writing36
Writing37
Writing38
Writing39
Writing40
Writing41
Writing42
Writing43
Writing44
Writing45
Writing46
Writing47
Writing48
Writing49
Writing50
Writing66
Writing67
Writing68
Writing70
Writing71
Writing72
Writing73
Writing74
Writing75
Writing76
Writing77
Writing78
Writing69
Writing82
Writing83
Writing84
Writing85
Writing86
Writing87
Writing88
Writing89
Writing90
Writing51
Writing52
Writing53
Writing54
Writing55
Writing91
Writing92
Writing93
Writing94
Writing95
Writing56
Writing57
Writing79
Writing80
Writing81
Writing58
Writing59
Writing96
Writing97
Writing98
Writing99
Writing60
Writing61
Writing62
Writing63
Writing64
Writing65
Thanks in advance, for whatever help you guys get me.
That is parallel computing. The tasks are queued up and each available processor gets one. If it is done, the next queued task will be issued to it. There is no guarantuee about the order in which the tasks will be delivered to the processing units as well as no guarantuee which one will be finished next, therefore parallizable code is not the same as sequential code + the keyword parallel. The algorithms need to be developed to be able to run in parallel. In your simple example all numbers from 1 to 100 are written, but the parallel loop does not write them in the natural order you'd expect.
This is to be expected, the way you are doing it there the work is partitioned using the current thread as well as a number of others pulled from the thread pool.
If you wanted to do the same thing on a different thread but make the writting synchronous then you could try :
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var t = Task.Factory.StartNew(() => {
for(var i = 0;i<100;i++)
{
//object sync = new object();
//lock (sync)
{
Console.WriteLine("Writing" + i);
}
}
});
Console.Read();
}
}
}
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.