element size influencing C# collection performance? - c#

Given the task to improve the performance of a piece of code, I have came across the following phenomenon. I have a large collection of reference types in a generic Queue and I'm removing and processing the element one by one, then add them to another generic collection.
It seems the larger the elements are the more time it takes to add the element to the collection.
Trying to narrow down the problem to the relevant part of the code, I've written a test (omitting the processing of elements, just doing the insert):
class Small
{
public Small()
{
this.s001 = "001";
this.s002 = "002";
}
string s001;
string s002;
}
class Large
{
public Large()
{
this.s001 = "001";
this.s002 = "002";
...
this.s050 = "050";
}
string s001;
string s002;
...
string s050;
}
static void Main(string[] args)
{
const int N = 1000000;
var storage = new List<object>(N);
for (int i = 0; i < N; ++i)
{
//storage.Add(new Small());
storage.Add(new Large());
}
List<object> outCollection = new List<object>();
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = N-1; i > 0; --i)
{
outCollection.Add(storage[i];);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
On the test machine, using the Small class, it takes about 25-30 ms to run, while it takes 40-45 ms with Large.
I know that the outCollection has to grow from time to time to be able to store all the items, so there is some dynamic memory allocation. But given an initial collection size even makes the difference more obvious: 11-12 ms with Small and 35-38 ms with Large objects.
I am somewhat surprised, as these are reference types, so I was expecting the collections to work only with references to the Small/Large instances. I have read Eric Lippert's relevant article that and know that references should not be treated as pointers. At the same time, AFAIK currently they are implemented as pointers and their size and the collection's performance should be independent of element size.
I've decided to put up a question here hoping that someone could explain or help me to understand what's happening here. Aside the performance improvement, I'm really curious what is happening behind the scenes.
Update:
Profiling data using the diagnostic tools didn't help me much, although I have to admit I'm not an expert using the profiler. I'll collect more data later today to find where the bottleneck is.
The pressure on the GC is quite high of course, especially with the Large instances. But once the instances are created and stored in the storage collection, and the program enters the loop, there was no collection triggered any more, and memory usage hasn't increased significantly (outCollction already pre-allocated).
Most of the CPU time is of course spent with memory allocation (JIT_New), around 62% and the only other significant entry is Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name
System.Collections.Generic.List`1[System.__Canon].Add with about 7%.
With 1 million items the preallocated outCollection size is 8 million bytes (the same as the size of storage); one can suspect 64 bit addresses being stored in the collections.
Probably I'm not using the tools properly or don't have the experience to interpret the results correctly, but the profiler didn't help me to get closer to the cause.
If the loop is not triggering collections and it only copies pointers between 2 pre-allocated collections, how could the item size cause any difference? Cache hit/miss ratio is supposed to be the more or less the same in both cases, as the loop is iteration over a list of "addresses" in both cases.
Thanks for all the help so far, I will collect more data, and put an update here if anything found.

I suspect that at least one action in the above (maybe some type checks) will require a de-reference. Then the fact that many Smalls are probably sat close together on the heap and thus sharing cache lines could account for some amount of difference (certainly many more of them could share a single cache line than Larges).
Added to which you are also accessing them in the reverse order in which they were allocated which maximises such a benefit.

Related

Declaring a jagged array succeeds, but out of memory when declaring a multi-dimen array of same size

I get an out of memory exception when running this line of code:
double[,] _DataMatrix = new double[_total_traces, _samples_per_trace];
But this code completes successfully:
double[][] _DataMatrix = new double[_total_traces][];
for (int i = 0; i < _total_traces; i++)
{
_DataMatrix[i] = new double[_samples_per_trace];
}
My first question is why is this happening?
As a followup question, my ultimate goal is to run Principal Component Analysis (PCA) on this data. It's a pretty large dataset. The number of "rows" in the matrix could be a couple million. The number of "columns" will be around 50. I found a PCA library in the Accord.net framework that seems popular. It takes a jagged array as input (which I can successfully create and populate with data), but I run out of memory when I pass it to PCA - I guess because it is passing by value and creating a copy of the data(?). My next thought was to just write my own method to do the PCA so I wouldn't have to copy the data, but I haven't got that far yet. I haven't really had to deal with memory management much before, so I'm open to tips.
Edit: This is not a duplicate of the topic linked below, because that link did not explain how the memory of the two was stored differently and why one would cause memory issues despite them both being the same size.
In 32bits it is complex to have a continuous range of addresses of more than some hundred mb (see for example https://stackoverflow.com/a/30035977/613130). But it is easy to have scattered pieces of memory totalling some hundred mb (or even 1gb)...
The multidimensional array is a single slab of continuous memory, the jagged array is a collection of small arrays (so of small pieces of memory).
Note that in 64bits it is much easier to create an array of the maximum size permitted by .NET (around 2gb or even more... see https://stackoverflow.com/a/2338797/613130)

Random access on .NET lists is slow, but what if I always reference the first element?

I know that in general, .NET Lists are not good for random access. I've always been told that an array would be best for that. I have a program that needs to continually (like more than a billion times) access the first element of a .NET list, and I am wondering if this will slow anything down, or it won't matter because it's the first element in the list. I'm also doing a lot of other things like adding and removing items from the list as I go along, but the List is never empty.
I'm using F#, but I think this applies to any .NET language (I am using .NET Lists, not F# Lists). My list is about 100 elements long.
In F#, the .NET list (System.Collections.Generic.List) is aptly aliased as ResizeArray, which leaves little doubt as to what to expect. It's an array that can resize itself, and not really a list in the CS-classroom understanding of the term. Any performance differences between it and a simple array most likely come from the fact that compiler can be more aggressive about optimizing array usage.
Back to your question. If you only access the first element of a list, it doesn't matter what you choose. Both a ResizeArray and a list (using F# lingo) have O(1) access to the first element (head).
A list would be a preferable choice if your other operations also work on the head element, i.e. you only add elements from the head. If you want to append elements to the end of the list, or mutate some elements that already in, you'd get better mileage out of a ResizeArray.
That said, a ResizeArray in idomatic F# code is a rare sight. The usual approach favors (and doesn't suffer from using) immutable data structures, so seeing one usually would be a minor red flag for me.
There is not much difference between the performance of random access for an array and a list. Here's a test on my machine.
var list = Enumerable.Range(1, 100).ToList();
var array = Enumerable.Range(1, 100).ToArray();
int total = 0;
var sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; i++) {
total ^= list[0];
}
Console.WriteLine("Time for list: {0}", sw.Elapsed);
sw.Restart();
for (int i = 0; i < 1000000000; i++) {
total ^= array[0];
}
Console.WriteLine("Time for list: {0}", sw.Elapsed);
This produces this output:
Time for list: 00:00:05.2002620
Time for array: 00:00:03.0159816
If you know you have a fixed size list, it makes sense to use an array, otherwise, there's not much cost to the list. (see update)
Update!
I found some pretty significant new information. After executing the script in release mode, the story changes quite a bit.
Time for list: 00:00:02.3048339
Time for array: 00:00:00.0805705
In this case, the performance of the array totally dominates the list. I'm pretty surprised, but the numbers don't lie.
Go with the array.

Struct vs class memory overhead

I'm writing an app that will create thousands of small objects and store them recursively in array. By "recursively" I mean that each instance of K will have an array of K instances which will have and array of K instances and so on, and this array + one int field are the only properties + some methods. I found that memory usage grows very fast for even small amount of data - about 1MB), and when the data I'm processing is about 10MB I get the "OutOfMemoryException", not to mention when it's bigger (I have 4GB of RAM) :). So what do you suggest me to do? I figured, that if I'd create separate class V to process those objects, so that instances of K would have only array of K's + one integer field and make K as a struct, not a class, it should optimize things a bit - no garbage collection and stuff... But it's a bit of a challenge, so I'd rather ask you whether it's a good idea, before I start a total rewrite :).
EDIT:
Ok, some abstract code
public void Add(string word) {
int i;
string shorter;
if (word.Length > 0) {
i = //something, it's really irrelevant
if (t[i] == null) {
t[i] = new MyClass();
}
shorterWord = word.Substring(1);
//end of word
if(shorterWord.Length == 0) {
t[i].WordEnd = END;
}
//saving the word letter by letter
t[i].Add(shorterWord);
}
}
}
For me already when researching deeper into this I had the following assumptions (they may be inexact; i'm getting old for a programmer). A class has extra memory consumption because a reference is required to address it. Store the reference and an Int32 sized pointer is needed on a 32bit compile. Allocated always on the heap (can't remember if C++ has other possibilities, i would venture yes?)
The short answer, found in this article, Object has a 12bytes basic footprint + 4 possibly unused bytes depending on your class (has no doubt something to do with padding).
http://www.codeproject.com/Articles/231120/Reducing-memory-footprint-and-object-instance-size
Other issues you'll run into is Arrays also have an overhead. A possibility would be to manage your own offset into a larger array or arrays. Which in turn is getting closer to something a more efficient language would be better suited for.
I'm not sure if there are libraries that may provide Storage for small objects in an efficient manner. Probably are.
My take on it, use Structs, manage your own offset in a large array, and use proper packing instructions if it serves you (although i suspect this comes at a cost at runtime of a few extra instructions each time you address unevenly packed data)
[StructLayout(LayoutKind.Sequential, Pack = 1)]
Your stack is blowing up.
Do it iteratively instead of recursively.
You're not blowing the system stack up, your blowing the code stack up, 10K function calls will blow it out of the water.
You need proper tail recursion, which is just an iterative hack.
Make sure you have enough memory in your system. Over 100mb+ etc. It really depends on your system. Linked list, recursive objects is what you are looking at. If you keep recursing, it is going to hit the memory limit and nomemoryexception will be thrown. Make sure you keep track of the memory usage on any program. Nothing is unlimited, especially memory. If memory is limited, save it to a disk.
Looks like there is infinite recursion in your code and out of memory is thrown. Check the code. There should be start and end in recursive code. Otherwise it will go over 10 terrabyte memory at some point.
You can use a better data structure
i.e. each letter can be a byte (a-0, b-1 ... ). each word fragment can be in indexed also especially substrings - you should get away with significantly less memory (though a performance penalty)
Just list your recursive algorithm and sanitize variable names. If you are doing BFS type of traversal and keep all objects in memory, you will run out of mem. For example, in this case, replace it with DFS.
Edit 1:
You can speed up the algo by estimating how many items you will generate then allocate that much memory at once. As the algo progresses, fill up the allocated memory. This reduces fragmentation and reallocation & copy-on-full-array operations.
Nonetheless, after you are done operating on these generated words you should delete them from your datastructure so they can be GC-ed so you don't run out of mem.

Microsoft Visual C# 2008 Reducing number of loaded dlls

How can I reduce the number of loaded dlls When debugging in Visual C# 2008 Express Edition?
When running a visual C# project in the debugger I get an OutOfMemoryException due to fragmentation of 2GB virtual address space and we assume that the loaded dlls might be the reason for the fragmentation.
Brian Rasmussen, you made my day! :)
His proposal of "disabling the visual studio hosting process" solved the problem.
(for more information see history of question-development below)
Hi,
I need two big int-arrays to be loaded in memory with ~120 million elements (~470MB) each, and both in one Visual C# project.
When I'm trying to instantiate the 2nd Array I get an OutOfMemoryException.
I do have enough total free memory and after doing a web-search I thought my problem is that there aren't big enough contiguous free memory blocks on my system.
BUT! - when I'm instantiating only one of the arrays in one Visual C# instance and then open another Visual C# instance, the 2nd instance can instantiate an array of 470MB.
(Edit for clarification: In the paragraph above I meant running it in the debugger of Visual C#)
And the task-manager shows the corresponding memory usage-increase just as you would expect it.
So not enough contiguous memory blocks on the whole system isn't the problem. Then I tried running a compiled executable that instantiates both arrays which works also (memory usage 1GB)
Summary:
OutOfMemoryException in Visual C# using two big int arrays, but running the compiled exe works (mem usage 1GB) and two separate Visual C# instances are able to find two big enough contiguous memory blocks for my big arrays, but I need one Visual C# instance to be able to provide the memory.
Update:
First of all special thanks to nobugz and Brian Rasmussen, I think they are spot on with their prediction that "the Fragmentation of 2GB virtual address space of the process" is the problem.
Following their suggestions I used VMMap and listdlls for my short amateur-analysis and I get:
* 21 dlls listed for the "standalone"-exe. (the one that works and uses 1GB of memory.)
* 58 dlls listed for vshost.exe-version. (the version which is run when debugging and that throws the exception and only uses 500MB)
VMMap showed me the biggest free memory blocks for the debugger version to be 262,175,167,155,108MBs.
So VMMap says that there is no contiguous 500MB block and according to the info about free blocks I added ~9 smaller int-arrays which added up to more than 1,2GB memory usage and actually did work.
So from that I would say that we can call "fragmentation of 2GB virtual address space" guilty.
From the listdll-output I created a small spreadsheet with hex-numbers converted to decimal to check free areas between dlls and I did find big free space for the standalone version inbetween (21) dlls but not for the vshost-debugger-version (58 dlls). I'm not claiming that there can't be anything else between and I'm not really sure if what I'm doing there makes sense but it seems consistent with VMMaps analysis and it seems as if the dlls alone already fragment the memory for the debugger-version.
So perhaps a solution would be if I would be able to reduce the number of dlls used by the debugger.
1. Is that possible?
2. If yes how would I do that?
You are battling virtual memory address space fragmentation. A process on the 32-bit version of Windows has 2 gigabytes of memory available. That memory is shared by code as well as data. Chunks of code are the CLR and the JIT compiler as well as the ngen-ed framework assemblies. Chunks of data are the various heaps used by .NET, including the loader heap (static variables) and the garbage collected heaps. These chunks are located at various addresses in the memory map. The free memory is available for you to allocate your arrays.
Problem is, a large array requires a contiguous chunk of memory. The "holes" in the address space, between chunks of code and data, are not large enough to allow you to allocate such large arrays. The first hole is typically between 450 and 550 Megabytes, that's why your first array allocation succeeded. The next available hole is a lot smaller. Too small to fit another big array, you'll get OOM even though you've got an easy gigabyte of free memory left.
You can look at the virtual memory layout of your process with the SysInternals' VMMap utility. Okay for diagnostics, but it isn't going to solve your problem. There's only one real fix, moving to a 64-bit version of Windows. Perhaps better: rethink your algorithm so it doesn't require such large arrays.
3rd update: You can reduce the number of loaded DLLs significantly by disabling the Visual Studio hosting process (project properties, debug). Doing so will still allow you to debug the application, but it will get rid of a lot of DLLs and a number of helper threads as well.
On a small test project the number of loaded DLLs went from 69 to 34 when I disabled the hosting process. I also got rid of 10+ threads. All in all a significant reduction in memory usage which should also help reduce heap fragmentation.
Additional info on the hosting process: http://msdn.microsoft.com/en-us/library/ms242202.aspx
The reason you can load the second array in a new application is that each process gets a full 2 GB virtual address space. I.e. the OS will swap pages to allow each process to address the total amount of memory. When you try to allocate both arrays in one process the runtime must be able to allocate two contiguous chunks of the desired size. What are you storing in the array? If you store objects, you need additional space for each of the objects.
Remember an application doesn't actually request physical memory. Instead each application is given an address space from which they can allocate virtual memory. The OS then maps the virtual memory to physical memory. It is a rather complex process (Russinovich spends 100+ pages on how Windows handle memory in his Windows Internal book). For more details on how Windows does this please see http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx
Update: I've been pondering this question for a while and it does sound a bit odd. When you run the application through Visual Studio, you may see additional modules loaded depending on your configuration. On my setup I get a number of different DLLs loaded during debug due to profilers and TypeMock (which essentially does its magic via the profiler hooks).
Depending on the size and load address of these they may prevent the runtime from allocating contiguous memory. Having said that, I am still a bit surprised that you get an OOM after allocating just two of those big arrays as their combined size is less than 1 GB.
You can look at the loaded DLLs using the listdlls tools from SysInternals. It will show you load addresses and size. Alternatively, you can use WinDbg. The lm command shows loaded modules. If you want size as well, you need to specify the v option for verbose output. WinDbg will also allow you to examine the .NET heaps, which may help you to pinpoint why memory cannot be allocated.
2nd Update: If you're on Windows XP, you can try to rebase some of the loaded DLLs to free up more contiguous space. Vista and Windows 7 uses ASLR, so I am not sure you'll benefit from rebasing on those platforms.
This isn't an answer per se, but perhaps an alternative might work.
If the problem is indeed that you have fragmented memory, then perhaps one workaround would be to just use those holes, instead of trying to find a hole big enough for everything consecutively.
Here's a very simple BigArray class that doesn't add too much overhead (some overhead is introduced, especially in the constructor, in order to initialize the buckets).
The statistics for the array is:
Main executes in 404ms
static Program-constructor doesn't show up
The statistics for the class is:
Main took 473ms
static Program-constructor takes 837ms (initializing the buckets)
The class allocates a bunch of 8192-element arrays (13 bit indexes), which on 64-bit for reference types will fall below the LOB limit. If you're only going to use this for Int32, you can probably up this to 14 and probably even make it nongeneric, although I doubt it will improve performance much.
In the other direction, if you're afraid you're going to have a lot of holes smaller than the 8192-element arrays (64KB on 64-bit or 32KB on 32-bit), you can just reduce the bit-size for the bucket indexes through its constant. This will add more overhead to the constructor, and add more memory-overhead, since the outmost array will be bigger, but the performance should not be affected.
Here's the code:
using System;
using NUnit.Framework;
namespace ConsoleApplication5
{
class Program
{
// static int[] a = new int[100 * 1024 * 1024];
static BigArray<int> a = new BigArray<int>(100 * 1024 * 1024);
static void Main(string[] args)
{
int l = a.Length;
for (int index = 0; index < l; index++)
a[index] = index;
for (int index = 0; index < l; index++)
if (a[index] != index)
throw new InvalidOperationException();
}
}
[TestFixture]
public class BigArrayTests
{
[Test]
public void Constructor_ZeroLength_ThrowsArgumentOutOfRangeException()
{
Assert.Throws<ArgumentOutOfRangeException>(() =>
{
new BigArray<int>(0);
});
}
[Test]
public void Constructor_NegativeLength_ThrowsArgumentOutOfRangeException()
{
Assert.Throws<ArgumentOutOfRangeException>(() =>
{
new BigArray<int>(-1);
});
}
[Test]
public void Indexer_SetsAndRetrievesCorrectValues()
{
BigArray<int> array = new BigArray<int>(10001);
for (int index = 0; index < array.Length; index++)
array[index] = index;
for (int index = 0; index < array.Length; index++)
Assert.That(array[index], Is.EqualTo(index));
}
private const int PRIME_ARRAY_SIZE = 10007;
[Test]
public void Indexer_RetrieveElementJustPastEnd_ThrowsIndexOutOfRangeException()
{
BigArray<int> array = new BigArray<int>(PRIME_ARRAY_SIZE);
Assert.Throws<IndexOutOfRangeException>(() =>
{
array[PRIME_ARRAY_SIZE] = 0;
});
}
[Test]
public void Indexer_RetrieveElementJustBeforeStart_ThrowsIndexOutOfRangeException()
{
BigArray<int> array = new BigArray<int>(PRIME_ARRAY_SIZE);
Assert.Throws<IndexOutOfRangeException>(() =>
{
array[-1] = 0;
});
}
[Test]
public void Constructor_BoundarySizes_ProducesCorrectlySizedArrays()
{
for (int index = 1; index < 16384; index++)
{
BigArray<int> arr = new BigArray<int>(index);
Assert.That(arr.Length, Is.EqualTo(index));
arr[index - 1] = 42;
Assert.That(arr[index - 1], Is.EqualTo(42));
Assert.Throws<IndexOutOfRangeException>(() =>
{
arr[index] = 42;
});
}
}
}
public class BigArray<T>
{
const int BUCKET_INDEX_BITS = 13;
const int BUCKET_SIZE = 1 << BUCKET_INDEX_BITS;
const int BUCKET_INDEX_MASK = BUCKET_SIZE - 1;
private readonly T[][] _Buckets;
private readonly int _Length;
public BigArray(int length)
{
if (length < 1)
throw new ArgumentOutOfRangeException("length");
_Length = length;
int bucketCount = length >> BUCKET_INDEX_BITS;
bool lastBucketIsFull = true;
if ((length & BUCKET_INDEX_MASK) != 0)
{
bucketCount++;
lastBucketIsFull = false;
}
_Buckets = new T[bucketCount][];
for (int index = 0; index < bucketCount; index++)
{
if (index < bucketCount - 1 || lastBucketIsFull)
_Buckets[index] = new T[BUCKET_SIZE];
else
_Buckets[index] = new T[(length & BUCKET_INDEX_MASK)];
}
}
public int Length
{
get
{
return _Length;
}
}
public T this[int index]
{
get
{
return _Buckets[index >> BUCKET_INDEX_BITS][index & BUCKET_INDEX_MASK];
}
set
{
_Buckets[index >> BUCKET_INDEX_BITS][index & BUCKET_INDEX_MASK] = value;
}
}
}
}
I had a similar issue once and what I ended up doing was using a list instead of an array. When creating the lists I set the capacity to the required sizes and I defined both lists BEFORE I tried adding values to them. I'm not sure if you can use lists instead of arrays but it might be something to consider. In the end I had to run the executable on a 64 bit OS, because when I added the items to the list the overall memory usage went above 2GB, but at least I wa able to run and debug locally with a reduced set of data.
A question: Are all elements of your array occupied? If many of them contain some default value then maybe you could reduce memory consumption using an implementation of a sparse array that only allocates memory for the non-default values. Just a thought.
Each 32bit process has a 2GB address space (unless you ask the user to add /3GB in boot options), so if you can accept some performance drop-off, you can start a new process to get 2GB more in address space - well, a little less than that. The new process would be still fragmented with all the CLR dlls plus all the Win32 DLLs they use, so you can get rid of all address space fragmentation caused by CLR dlls by writing the new process in a native language e.g. C++. You can even move some of your calculation to the new process so you get more address space in your main app and less chatty with your main process.
You can communicate between your processes using any of the interprocess communication methods. You can find many IPC samples in the All-In-One Code Framework.
I have experience with two desktop applications and one moble application hitting out-of-memory limits. I understand the issues. I do not know your requirements, but I suggest moving your lookup arrays into SQL CE. Performance is good, you will be surprised, and SQL CE is in-process. With the last desktop application, I was able to reduce my memory footprint from 2.1GB to 720MB, which had the benefit of speeding up the application due to significantly reducing page faults. (Your problem is fragmentation of the AppDomain's memory, which you have no control over.)
Honestly, I do not think you will be satisfied with performance after squeezing these arrays into memory. Don't forget, excessive page faults has a significant impact on performance.
If you do go SqlServerCe, make sure to keep the connection open to improve performance. Also, single row lookups (scalar) may be slower than returning a result set.
If you really want to know what is going on with memory, use CLR Profiler. VMMap is not going to help. The OS does not allocate memory to your application. The Framework does by grabbing large chucks of OS memory for itself (caching the memory) then allocating, when needed, pieces of this memory to applications.
CLR Profiler for the .NET Framework 2.0 at
https://github.com/MicrosoftArchive/clrprofiler

Back to basics; for-loops, arrays/vectors/lists, and optimization

I was working on some code recently and came across a method that had 3 for-loops that worked on 2 different arrays.
Basically, what was happening was a foreach loop would walk through a vector and convert a DateTime from an object, and then another foreach loop would convert a long value from an object. Each of these loops would store the converted value into lists.
The final loop would go through these two lists and store those values into yet another list because one final conversion needed to be done for the date.
Then after all that is said and done, The final two lists are converted to an array using ToArray().
Ok, bear with me, I'm finally getting to my question.
So, I decided to make a single for loop to replace the first two foreach loops and convert the values in one fell swoop (the third loop is quasi-necessary, although, I'm sure with some working I could also put it into the single loop).
But then I read the article "What your computer does while you wait" by Gustav Duarte and started thinking about memory management and what the data was doing while it's being accessed in the for-loop where two lists are being accessed simultaneously.
So my question is, what is the best approach for something like this? Try to condense the for-loops so it happens in as little loops as possible, causing multiple data access for the different lists. Or, allow the multiple loops and let the system bring in data it's anticipating. These lists and arrays can be potentially large and looping through 3 lists, perhaps 4 depending on how ToArray() is implemented, can get very costy (O(n^3) ??). But from what I understood in said article and from my CS classes, having to fetch data can be expensive too.
Would anyone like to provide any insight? Or have I completely gone off my rocker and need to relearn what I have unlearned?
Thank you
The best approach? Write the most readable code, work out its complexity, and work out if that's actually a problem.
If each of your loops is O(n), then you've still only got an O(n) operation.
Having said that, it does sound like a LINQ approach would be more readable... and quite possibly more efficient as well. Admittedly we haven't seen the code, but I suspect it's the kind of thing which is ideal for LINQ.
For referemce,
the article is at
What your computer does while you wait - Gustav Duarte
Also there's a guide to big-O notation.
It's impossible to answer the question without being able to see code/pseudocode. The only reliable answer is "use a profiler". Assuming what your loops are doing is a disservice to you and anyone who reads this question.
Well, you've got complications if the two vectors are of different sizes. As has already been pointed out, this doesn't increase the overall complexity of the issue, so I'd stick with the simplest code - which is probably 2 loops, rather than 1 loop with complicated test conditions re the two different lengths.
Actually, these length tests could easily make the two loops quicker than a single loop. You might also get better memory fetch performance with 2 loops - i.e. you are looking at contiguous memory - i.e. A[0],A[1],A[2]... B[0],B[1],B[2]..., rather than A[0],B[0],A[1],B[1],A[2],B[2]...
So in every way, I'd go with 2 separate loops ;-p
Am I understanding you correctly in this?
You have these loops:
for (...){
// Do A
}
for (...){
// Do B
}
for (...){
// Do C
}
And you converted it into
for (...){
// Do A
// Do B
}
for (...){
// Do C
}
and you're wondering which is faster?
If not, some pseudocode would be nice, so we could see what you meant. :)
Impossible to say. It could go either way. You're right, fetching data is expensive, but locality is also important. The first version may be better for data locality, but on the other hand, the second has bigger blocks with no branches, allowing more efficient instruction scheduling.
If the extra performance really matters (as Jon Skeet says, it probably doesn't, and you should pick whatever is most readable), you really need to measure both options, to see which is fastest.
My gut feeling says the second, with more work being done between jump instructions, would be more efficient, but it's just a hunch, and it can easily be wrong.
Aside from cache thrashing on large functions, there may be benefits on tiny functions as well. This applies on any auto-vectorizing compiler (not sure if Java JIT will do this yet, but you can count on it eventually).
Suppose this is your code:
// if this compiles down to a raw memory copy with a bitmask...
Date morningOf(Date d) { return Date(d.year, d.month, d.day, 0, 0, 0); }
Date timestamps[N];
Date mornings[N];
// ... then this can be parallelized using SSE or other SIMD instructions
for (int i = 0; i != N; ++i)
mornings[i] = morningOf(timestamps[i]);
// ... and this will just run like normal
for (int i = 0; i != N; ++i)
doOtherCrap(mornings[i]);
For large data sets, splitting the vectorizable code out into a separate loop can be a big win (provided caching doesn't become a problem). If it was all left as a single loop, no vectorization would occur.
This is something that Intel recommends in their C/C++ optimization manual, and it really can make a big difference.
... working on one piece of data but with two functions can sometimes make it so that code to act on that data doesn't fit in the processor's low level caches.
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // pushes functionreallybig2 out of cache
myObject.functionreallybig2(); // pushes functionreallybig1 out of cache
}
vs
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // this stays in the cache next time through loop
}
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig2(); // this stays in the cache next time through loop
}
But it was probably a mistake (usually this type of trick is commented)
When data is cycicly loaded and unloaded like this, it is called cache thrashing, btw.
This is a seperate issue from the data these functions are working on, as typically the processor caches that separately.
I apologize for not responding sooner and providing any kind of code. I got sidetracked on my project and had to work on something else.
To answer anyone still monitoring this question;
Yes, like jalf said, the function is something like:
PrepareData(vectorA, VectorB, xArray, yArray):
listA
listB
foreach(value in vectorA)
convert values insert in listA
foreach(value in vectorB)
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
I changed it to:
PrepareData(vectorA, vectorB, ref xArray, ref yArray):
listA
listB
for(int i = 0; i < vectorA.count && vectorB.count; i++)
convert values insert in listA
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
Keeping in mind that the vectors can potentially have a large number of items. I figured the second one would be better, so that the program wouldnt't have to loop n times 2 or 3 different times. But then I started to wonder about the affects (effects?) of memory fetching, or prefetching, or what have you.
So, I hope this helps to clear up the question, although a good number of you have provided excellent answers.
Thank you every one for the information. Thinking in terms of Big-O and how to optimize has never been my strong point. I believe I am going to put the code back to the way it was, I should have trusted the way it was written before instead of jumping on my novice instincts. Also, in the future I will put more reference so everyone can understand what the heck I'm talking about (clarity is also not a strong point of mine :-/).
Thank you again.

Categories