I am trying to deep clone a list of 100 multi-property objects, I am using the code below to perform the deep clone. The list creation and the list cloning happen in a loop so each time around the loop the list changes its contents but remains fixed at 100 objects.
The problem is each time around the loop, cloning the list takes what seems to be exponentially longer than the last time it executed.
public static Main ()
{
List<Chromosome<Gene>> population = Population.randomHalfAndHalf(rand, data, 100, opset);
for (int i = 0; i < numOfGenerations; i++)
{
offSpring = epoch.runEpoch(rand, population, SelectionMethod.Tournament, opset);
clone = DeepClone<List<Chromosome<Gene>>>(population);
clone.AddRange(offSpring);
population = fixPopulation(rand, clone, SelectionMethod.Tournament, 100);
}
//....REST OF CODE....
}
public static T DeepClone<T>(T obj)
{
object result = null;
using (var ms = new MemoryStream())
{
var formatter = new BinaryFormatter();
formatter.Serialize(ms, obj);
ms.Position = 0;
result = (T)formatter.Deserialize(ms);
ms.Close();
}
return (T)result;
}
Some of you may be thinking why am I even cloning the object if I can write over the original population. This is a valid point and one that I have explored but what happens when I do that is that loop executes perfectly for about 8 iterations the first time I run it, then it idles and does nothing so I stop it. The the next time I execute it it goes to 9 iterations and stops, ideals, does nothing etc etc each time around the loop. If any one has any ideas as to why this is happening also please share as I really dont get why that is happening.
But my main problem is that the time to clone the object takes notablely longer each time around the above loop first by a few seconds then eventually up to 5 mins etc.
Any body have any ideas as to why this is happening?
EDIT I have profiled the application while it was running the majority of the work over 90% is being done by BinaryFormatter.Deserialize(memoryStream) and here is fix population its doing nothing overly complex that would contribute to this problem.
private static List<Chromosome<Gene>> fixPopulation(Random rand, List<Chromosome<Gene>> oldPopulation, SelectionMethod selectionMethod, int populationSize)
{
if (selectionMethod == SelectionMethod.Tournament)
{
oldPopulation.Sort();
}
else
{
//NSGAII sorting method
}
oldPopulation.RemoveRange(populationSize, (oldPopulation.Count - populationSize));
for (int i = 0, n = populationSize / 2; i < n; i++)
{
int c1 = rand.Next(populationSize);
int c2 = rand.Next(populationSize);
// swap two chromosomes
Chromosome<Gene> temp = oldPopulation[c1];
oldPopulation[c1] = oldPopulation[c2];
oldPopulation[c2] = temp;
}
return oldPopulation;
}
You can use binary serialization to create a fast clone of your objects:
look at this :
public Entity Copy()
{
System.IO.MemoryStream memoryStream = new System.IO.MemoryStream();
System.Runtime.Serialization.Formatters.Binary.BinaryFormatter bFormatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
bFormatter.Serialize(memoryStream, this);
memoryStream.Seek(0, System.IO.SeekOrigin.Begin);
IEntityForm entity= (IEntityForm)bFormatter.Deserialize(memoryStream);
return entity;
}
really easy and working!
Related
I have a task of deserialization that happens when the game starts. I need to basically pull some images from the persistent path and create bunch of assets from them. The images can be large (10-50MB) and there can be lots of them, so basically this can freeze my frame on the single task for ever. I tried using Coroutines but I might misunderstand how to work them properly.
Since Coroutines are really single threaded, they are not exactly going to let me finish creating these assets while the UI is running. I can't also just create a new thread to do this work, and jump back on the main thread when done with a callback because Unity won't let me access their API (I am creating Texture2D, Button(), parenting objects etc.).
How the hell do I go about this? Do I really need to create a massive IEnumerable function and put bunch of yield return null every other line of code? That seems a little excessive. Is there a way to call a time consuming method that requires access to the main thread in Unity, and have Unity spread it across as many frames as needed so that it doesn't bog down the UI?
Here's an example of a Deserialize method:
public IEnumerator Deserialize()
{
// (Konrad) Deserialize Images
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
try
{
var images = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json")));
if (images != null)
{
foreach (var i in images)
{
if (!File.Exists(Path.Combine(dataPath, i.Value.Name))) continue;
var bytes = File.ReadAllBytes(Path.Combine(dataPath, i.Value.Name));
var texture = new Texture2D(2, 2);
if (bytes.Length <= 0) continue;
if (!texture.LoadImage(bytes)) continue;
i.Value.Texture = texture;
}
}
Images = images;
}
catch (Exception e)
{
Debug.Log("Failed to deserialize Images: " + e.Message);
}
}
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
try
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
catch (Exception e)
{
Debug.Log("Failed to deserialize Projects: " + e.Message);
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
So this would take like 10s to complete. It needs to deserialize images from drive, create button assets, set bunch of parenting relationships etc. I would appreciate any ideas.
Ps. I haven't updated to the experimental .NET 4.6 so I am still on .NET 3.5.
OK, reading your comments below I figured I can give this a try. I put the IO operations into a different thread. They don't need Unity API so I can finish those and store the byte[] and load the bytes into the Texture when done. Here's a try:
public IEnumerator Deserialize()
{
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
var bytes = new Dictionary<Item, byte[]>();
var done = false;
new Thread(() => {
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
var items = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json"))).Values;
foreach (var i in items)
{
if (!File.Exists(Path.Combine(dataPath, i.Name))) continue;
var b = File.ReadAllBytes(Path.Combine(dataPath, i.Name));
if (b.Length <= 0) continue;
bytes.Add(i, b);
}
}
done = true;
}).Start();
while (!done)
{
yield return null;
}
var result = new Dictionary<string, Item>();
foreach (var b in bytes)
{
var texture = new Texture2D(2, 2);
if (!texture.LoadImage(b.Value)) continue;
b.Key.Texture = texture;
result.Add(b.Key.Id, b.Key);
}
Debug.Log("Finished loading images!");
Images = result;
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
I have to concede that it helps a little, but it's still not great. Looking at the profiler I am getting pretty big stall right out of the gate:
That's my Deserialize routine that is causing it:
Any way to work around this?
There are two main ways to spread work across multiple frames:
multithreading and
coroutines
Multithreading has the limitation you pointed out, so a coroutine seems appropriate.
The key thing to remember with coroutines is that they will not allow the next frame to begin until a yield statement is run. The other thing to remember is that if you yield too frequently, there is a cap on how many times you will hit a yield return per second, based on your framerate, so you don't want too yield to early, or it will take too much real time for the work to finish.
What you want is a frequent opportunity for the function to yield, but you don't want the opportunity to always be taken. The best way to do this is to use the Stopwatch class (be sure to use the full name or add a "using" statement at the top of your file) or something similar.
Here is an example modification of your second code snippet.
public IEnumerator Deserialize()
{
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
var bytes = new Dictionary<Item, byte[]>();
var done = false;
new Thread(() => {
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
var items = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json"))).Values;
foreach (var i in items)
{
if (!File.Exists(Path.Combine(dataPath, i.Name))) continue;
var b = File.ReadAllBytes(Path.Combine(dataPath, i.Name));
if (b.Length <= 0) continue;
bytes.Add(i, b);
}
}
done = true;
}).Start();
while (!done)
{
yield return null;
}
// MOD: added stopwatch and started
System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
int MAX_MILLIS = 5; // tweak this to prevent frame rate reduction
watch.Start();
var result = new Dictionary<string, Item>();
foreach (var b in bytes)
{
// MOD: Check if enough time has passed since last yield
if (watch.ElapsedMilliseconds() > MAX_MILLIS)
{
watch.Reset();
yield return null;
watch.Start();
}
var texture = new Texture2D(2, 2);
if (!texture.LoadImage(b.Value)) continue;
b.Key.Texture = texture;
result.Add(b.Key.Id, b.Key);
}
Debug.Log("Finished loading images!");
Images = result;
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
// MOD: check if enough time has passed since the last yield
if (watch.ElapsedMilliseconds() > MAX_MILLIS)
{
watch.Reset();
yield return null;
watch.Start();
}
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
Edit: Further notes for those wanting more general advice...
The two main systems are multithreading and coroutines. Their pros and cons are:
Coroutine Advantages:
Little setup.
No data-sharing or locking concerns.
Can perform any unity main-thread operation.
Multithreading Advantages:
Doesn't take time away from the main thread, leaving you as much CPU power as possible
Can utilize a full CPU core rather than whatever is left over from the main thread.
To sum up, coroutines are best for quick-and-dirty solutions or when modifications to unity objects need to be made. However, if large amounts of processing need to be performed, it's best to offload as much as possible to another thread. Very few devices have fewer than two cores these days (safe to say non that are used to play games?).
In this case, a hybrid solution was possible, offloading some work to separate thread and keeping the unity dependent work on the main thread. This is a powerful solution, and coroutines can make it easy.
tauting accomplishments As an example, I made a voxel engine which offloaded running of the algorithm onto a separate thread and then created the actual meshes on the main thread, allowing for a 50-70% reduction in how long it took to generate meshes, and perhaps more importantly reducing the impact to the game's end performance. It did this with queues of jobs that were passed back and forth between the threads.
I have inherited a WCF web service application that requires to have much better error tracking. What we do is query data from one system (AcuODBC), and send that data to another system (Salesforce). This query will return 10's of thousands of complex objects as a List<T>. We then process this List<T> in batches of 200 records at a time to map the fields to another object type, then send that batch to Salesforce. After this is completed, the next batch starts. Here's a brief example:
int intStart = 0, intEnd = 200;
//done in a loop, snipped for brevity
var leases = from i in trleases.GetAllLeases(branch).Skip(intStart).Take(intEnd)
select new sforceObject.SFDC_Lease() {
LeaseNumber = i.LeaseNumber.ToString(),
AccountNumber = i.LeaseCustomer,
Branch = i.Branch
(...)//about 150 properties
//do stuff with list and increment to next batch
intStart += 200;
However, the problem is if one object has a bad field mapping (Invalid Cast Exception), I would like to print out the object that failed to a log.
Question
Is there any way I can decipher which object of the 200 threw the exception? I could forgo the batch concept that was given to me, but I'd rather avoid that if possible for performance reasons.
This should accomplish what you are looking for with very minor code changes:
int intStart = 0, intEnd = 200, count = 0;
List<SDFC_Lease> leases = new List<SDFC_Lease>();
//done in a loop, snipped for brevity
foreach(var i in trleases.GetAllLeases(branch).Skip(intStart).Take(intEnd)) {
try {
count++;
leases.Add(new sforceObject.SFDC_Lease() {
LeaseNumber = i.LeaseNumber.ToString(),
AccountNumber = i.LeaseCustomer,
Branch = i.Branch
(...)//about 150 properties);
} catch (Exception ex) {
// you now have you culprit either as 'i' or from the index 'count'
}
}
//do stuff with 'leases' and increment to next batch
intStart += 200;
I think that you could use a flag in each set method of the properties of the class SFDC_Lease, and use a static property for this like:
public class SFDC_Lease
{
public static string LastPropertySetted;
public string LeaseNumber
{
get;
set
{
LastPropertySetted = "LeaseNumber";
LeaseNumber = value;
}
}
}
Plz, feel free to improve this design.
Any idea why referencing the Position property of a stream dratically increases IO time ?
The execution time of:
sw.Restart();
fs = new FileStream("tmp", FileMode.Open);
var br = new BinaryReader(fs);
for (int i = 0; i < n; i++)
{
fs.Position+=0; //Should be a NOOP
a[i] = br.ReadDouble();
}
Debug.Write("\n");
Debug.Write(sw.ElapsedMilliseconds.ToString());
Debug.Write("\n");
fs.Close();
sw.Stop();
Debug.Write(a[0].ToString() + "\n");
Debug.Write(a[n - 1].ToString() + "\n");
is ~100 times slower than the equivalent loop without the "fs.Position+=0;".
Normally the intention with using Seek (or manipulating the Position property) is to speed up things when you dont need all data in the file.
But if, e.g., you only need every second value in the file, it is aparently much faster to read the entire file and discard the data that you dont need, than to skip every second value in the file by moving Stream.Position
You're doing two things:
Fetching the position
Setting it
It's entirely possible that each of those performs a direct interaction with the underlying Win32 APIs, whereas normally you can read a fair amount of data without having to interop with native code at all, due to buffering.
I'm slightly surprised at the extent to which it's worse, but I'm not surprised that it is worse. I think it would be worth you doing separate tests to find out which has more effect - the read or the write. So write similar code which either just reads or just writes. Note that that should affect the code you write later, but it may satisfy your curiosity a bit further.
You might also try
fs.Seek(0, SeekOrigin.Current);
... which is more likely to be ignored as a genuine no-op. But even so, skipping a single byte using fs.Seek(1, SeekOrigin.Current) may well be expensive again.
From the reflector:
public override long Position {
[SecuritySafeCritical]
get {
if (this._handle.IsClosed) {
__Error.FileNotOpen();
}
if (!this.CanSeek) {
__Error.SeekNotSupported();
}
if (this._exposedHandle) {
this.VerifyOSHandlePosition();
}
return this._pos + (long)(this._readPos - this._readLen + this._writePos);
}
set {
if (value < 0L) {
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_NeedNonNegNum"));
}
if (this._writePos > 0) {
this.FlushWrite(false);
}
this._readPos = 0;
this._readLen = 0;
this.Seek(value, SeekOrigin.Begin);
}
}
(self explanatory) - essentially every set causes flush and there is no check if you're setting the same value to the position; if you're not happy with the FileStream, make your own stream proxy that would handle position update more gracefully :)
I have an array list which is continuously updated every second. I have to use the same array list in two other threads and make local copies of it. I have done all of this but i get weird exceptions of index out of bound , What i have found out so far is that i have to ensure some synchronization mechanism for the array list to be used across multiple threads.
this is how i am making it synchronized:
for (int i = 0; i < Globls.iterationCount; i++)
{
if (bw_Obj.CancellationPending)
{
eve.Cancel = true;
break;
}
byte[] rawData4 = DMM4.IO.Read(4 * numReadings);
TempDisplayData_DMM4.Add(rawData4);
Globls.Display_DataDMM4 = ArrayList.Synchronized(TempDisplayData_DMM4);
Globls.Write_DataDMM4 = ArrayList.Synchronized(TempDisplayData_DMM4);
}
in other thread i do the following to make local copies:
ArrayList Local_Write_DMM4 = new ArrayList();
Local_Write_DMM4 = new ArrayList(Globls.Write_DataDMM4);
Am i synchronizing the arraylist in right way?, also do i need to lock while copying array-list as well:
lock (Globls.Display_DataDMM4.SyncRoot){Local_Temp_Display1 = new ArrayList(Globls.Display_DataDMM4);}
or for single operations its safe?. I haven't actually ran this code i need to run it over the weekend and i don't want to see another exception :(. please help me on this!
as #Trickery stated assignment needs to be locked since the source array Globls.Write_DataDMM4 can be modified by another thread during enumeration.
It is essential therefore to lock both when populating the original array and when making your copy
for (int i = 0; i < Globls.iterationCount; i++)
{
if (bw_Obj.CancellationPending)
{
eve.Cancel = true;
break;
}
byte[] rawData4 = DMM4.IO.Read(4 * numReadings);
TempDisplayData_DMM4.Add(rawData4);
lock (Globls.Display_DataDMM4.SyncRoot)
{
Globls.Write_DataDMM4 = ArrayList.Synchronized(TempDisplayData_DMM4);
}
}
and
lock (Globls.Display_DataDMM4.SyncRoot)
{
Local_Temp_Display1 = new ArrayList(Globls.Display_DataDMM4);
}
Yes, all operations on your ArrayList need to use Lock.
EDIT: Sorry, the site won't let me add a comment to your question for some reason.
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.