I'm curious as to how IEnumerable differs from IObservable under the hood. I understand the pull and push patterns respectively but how does C#, in terms of memory etc, notify subscribers (for IObservable) that it should receive the next bit of data in memory to process? How does the observed instance know it's had a change in data to push to the subscribers.
My question comes from a test I was performing reading in lines from a file. The file was about 6Mb in total.
Standard Time Taken: 4.7s, lines: 36587
Rx Time Taken: 0.68s, lines: 36587
How is Rx able to massively improve a normal iteration over each of the lines in the file?
private static void ReadStandardFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
foreach (var l in ReadLines(new FileStream(_filePath, FileMode.Open)))
{
var s = l.Split(',');
linesProcessed++;
}
timer.Stop();
_log.DebugFormat("Standard Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static void ReadRxFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
var query = ReadLines(new FileStream(_filePath, FileMode.Open)).ToObservable();
using (query.Subscribe((line) =>
{
var s = line.Split(',');
linesProcessed++;
}));
timer.Stop();
_log.DebugFormat("Rx Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static IEnumerable<string> ReadLines(Stream stream)
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
}
My hunch is the behavior you're seeing is reflecting the OS caching the file. I would imagine if you reversed the order of the calls you would see a similar difference in speeds, just swapped.
You could improve this benchmark by performing a few warm-up runs or by copying the input file to a temp file using File.Copy prior to testing each one. This way the file would not be "hot" and you would get a fair comparison.
I'd suspect that you're seeing some kind of internal optimization of the CLR. It probably caches the content of the file in memory between the two calls so that ToObservable can pull the content much faster...
Edit: Oh, the good colleague with the crazy nickname eeh ... #sixlettervariables was faster and he's probably right: it's rather the OS who's optimizing than the CLR.
Related
I am working on a realtime simulation model. The models are written in unmanaged code, but the models are controlled by C# managed code, called the ExecutiveManager. An ExecutiveManager runs multiple models at a time, and controls the timing of the running models (like if a model has a "framerate" of 20 per second, the executive will tell the models when to start it's next frame).
We are seeing a consistently high load on the CPU when running the simulation, it can get up to 100% and stay there on a machine that should be totally appropriate. I have used a processor profiler to determine where the issues are, and it pointed me to two methods: WriteMemoryRegion and ReadMemoryRegion. The ExecutiveManager makes the calls to these methods. Models have shared memory regions, and the ExecutiveManager is used to read and write these regions using these Methods. Both read and write make calls to Marshal.Copy, and my gut tells me that's where the issue is, but I don't want to trust my gut! We are going to do further testing to narrow things down more, but I wanted to do a quick sanity check on Marshal.Copy. WriteMemoryRegion and ReadMemoryRegion are called each frame, and furthermore they're called by each model in the ExecutiveManager, and each model typically has 6 shared regions. So for 10 models each with 6 regions running at 20 frames per second calling both WriteMemoryRegion and ReadMemoryRegion, that's 2400 calls of Marshal.Copy per second. Is this unreasonable, or could my problem lie elsewhere?
public async Task ReadMemoryRegion(MemoryRegionDefinition g) {
if (!cache.ContainsKey(g.Name)) {
cache.Add(g.Name, mmff.CreateOrOpen(g.Name, g.Size));
}
var mmf = cache[g.Name];
using (var stream = mmf.CreateViewStream())
using (var reader = brf.Create(stream)) {
var buffer = reader.ReadBytes(g.Size);
await WriteIcBuffer(g, buffer).ConfigureAwait(false);
}
}
private Task WriteIcBuffer(MemoryRegionDefinition g, byte[] buffer) {
Marshal.Copy(buffer, 0, new IntPtr(g.BaseAddress),
buffer.Length);
return Task.FromResult(0);
}
public async Task WriteMemoryRegion(MemoryRegionDefinition g) {
if (!cache.ContainsKey(g.Name)) {
if (g.Size > 0) {
cache.Add(g.Name, mmff.CreateOrOpen(g.Name, g.Size));
} else if (g.Size == 0){
throw new EmptyGlobalException($#"Global {g.Name} not
created as it does not contain any variables.");
} else {
throw new NegativeSizeGlobalException($#"Global {g.Name}
not created as it has a negative size.");
}
}
var mmf = cache[g.Name];
using (var stream = mmf.CreateViewStream())
using (var writer = bwf.Create(stream)) {
var buffer = await ReadIcBuffer(g);
writer.Write(buffer);
}
}
private Task<byte[]> ReadIcBuffer(MemoryRegionDefinition g) {
var buffer = new byte[g.Size];
Marshal.Copy(new IntPtr(g.BaseAddress), buffer, 0, g.Size);
return Task.FromResult(buffer);
}
I need to come up with a solution so that my processor isn't catching on fire. I'm very green in this area so all ideas are welcome. Again, I'm not sure Marshal.Copy is the issue, but it seems possible. Please let me know if you see other issues that could contribute to the processor problem.
I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string.
And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation?
I have around 100GB of files to search trough.
It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}
Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?
Take a look at Lucene .NET
You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.
Try to:
Do .ToLower one time for a line instead of do .ToLower for each element in searchList.
Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.
If you know position where to look for (in your sample you know) you can scan from position, not in all string
Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock
If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically
So here a sample (not tested):
using(var searcher = new FilesSearcher(
searchFolder: "path",
searchList: toLookFor))
{
searcher.SearchContentListInFiles();
}
here is the searcher:
public class FilesSearcher : IDisposable
{
private readonly BlockingCollection<string[]> filesInMemory;
private readonly string searchFolder;
private readonly string[] searchList;
public FilesSearcher(string searchFolder, string[] searchList)
{
// reader thread stores lines here
this.filesInMemory = new BlockingCollection<string[]>(
// limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
boundedCapacity: 100);
this.searchFolder = searchFolder;
this.searchList = searchList;
}
public IEnumerable<string> SearchContentListInFiles()
{
// start read,
// we not need many threads here, probably 1 thread by 1 storage device is the optimum
var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);
// at least one proccessing thread, because reader thread is IO bound
var taskCount = Math.Max(1, Environment.ProcessorCount - 1);
// start search threads
var tasks = Enumerable
.Range(0, taskCount)
.Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
.ToArray();
// await for results
Task.WaitAll(tasks);
// combine results
return tasks
.SelectMany(t => t.Result)
.ToArray();
}
private string[] Search()
{
// if you always get unique results use list
var results = new List<string>();
//var results = new HashSet<string>();
foreach (var content in this.filesInMemory.GetConsumingEnumerable())
{
// one pass by a file
var currentFileMatches = content
.Where(sourceLine =>
{
// to lower one time for a line, and we don't need to make lowerd copy of file
var lower = sourceLine.ToLower();
return this.searchList.Any(sourceLine.Contains);
});
// store current file matches
foreach (var currentMatch in currentFileMatches)
{
results.Add(currentMatch);
}
}
return results.ToArray();
}
private void ReadFiles()
{
var files = Directory.EnumerateFiles(this.searchFolder);
try
{
foreach (var file in files)
{
var fileContent = File.ReadLines(file);
// add file, or wait if filesInMemory are full
this.filesInMemory.Add(fileContent.ToArray());
}
}
finally
{
this.filesInMemory.CompleteAdding();
}
}
public void Dispose()
{
if (filesInMemory != null)
filesInMemory.Dispose();
}
}
This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.
If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).
While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.
Which of the following approaches is better? I meant to ask, is it better to copy the stream locally, close it and do whatever operations that are needed to be done using the data? or just perform operations with the stream open? Assume that the input from the stream is huge.
First method:
public static int calculateSum(string filePath)
{
int sum = 0;
var list = new List<int>();
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
list.Add(int.Parse(sr.ReadLine()));
}
}
foreach(int item in list)
sum += item;
return sum;
}
Second method:
public static int calculateSum(string filePath)
{
int sum = 0;
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
sum += int.Parse(sr.ReadLine());
}
}
return sum;
}
If the file is modified often, then read the data in and then work with it. If it is not accessed often, then you are fine to read the file one line at a time and work with each line separately.
In general, if you can do it in a single pass, then do it in a single pass. You indicate that the input is huge, so it might not all fit into memory. If that's the case, then your first option isn't even possible.
Of course, there are exceptions to every rule of thumb. But you don't indicate that there's anything special about the file or the access pattern (other processes wanting to access it, for example) that prevents you from keeping it open longer than absolutely necessary to copy the data.
I don't know if your example is a real-world scenario or if you're just using the sum thing as a placeholder for more complex processing. In any case, if you're processing a file line-by-line, you can save yourself a lot of trouble by using File.ReadLines:
int sum = 0;
foreach (var line in File.ReadLines(filePath))
{
sum += int.Parse(line);
}
This does not read the entire file into memory at once. Rather, it uses an enumerator to present one line at a time, and only reads as much as it must to maintain a relatively small (probably four kilobyte) buffer.
I have a method like this :
public ConcurrentBag<FileModel> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
for (var i = 0; i < foldersPath.Count; i++)
{
var index = i;
new Thread(delegate()
{
foreach (var file in BrowseFiles(foldersPath[index]))
{
if (file.Name.Contains(someList[0]) || file.Name.Contains(someList[1]))
{
using (var fileStream = File.Open(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
To give you more details:
This methods starts n threads (= foldersPath.Count) and each thread is going to read all the files which contains the strings listed in someList.
Right now my list contains only 2 strings (conditions), this is why im doing :
file.Name.Contains(someList[0]) || file.Name.Contains(someList[1])
What I want to do now is to replace this line with something that check all elements in the list someList
How can I do that?
Edit
Now that I replaced that line by if (someList.Any(item => file.Name.Contains(item)))
The next question is how can I optimize the performance of this code, knowing that each item in foldersPath is a separate hard drive in my network (which is always not more that 5 hard drives).
You could use something like if (someList.Any(item => file.Name.Contains(item)))
This will iterate each item in someList, and check if any of the items are contained in the file name, returning a boolean value to indicate whether any matches were found or not
Fristly.
There is an old saying is computer science, "There are two hard problems in CS, Naming, Cache Invalidation and Off by One Errors."
Don't use for loops, unless you absolutely have to, the tiny perf gain you get isn't worth the debug time (assuming there is any perf gain in this version of .net).
Secondly
new Thread. Don't do that. The creation of a thread is extremely slow and takes up lots of resources, especially for a short lived process like this. Added to the fact, there is overhead in passing data between threads. Use the ThreadPool.QueueUserWorkItem(WaitCallback) instead, if you MUST do short lived threads.
However, as I previously alluded to. Threads are an abstraction for CPU resources. I honestly doubt you are CPU bound. Threading is going to cost you more than you think. Stick to single threads. However you ARE I/O bound, therefore make full usage of asynchronous I/O.
public async Task<IEnumerable<FileModel>> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
foreach (var folderPath in foldersPath)
foreach (var file in BrowseFiles(folderPath))
{
if (false == someList.Any(x => file.Name.Contains(x, StringComparer.InvariantCultureCaseIgnore)))
continue;
using (var fileStream = await File.OpenTaskAsync(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
yield return new FileModel();
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.