There is a function that multiplies two matrices as usual
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
for (long i = 0; i < m1.RowCount; i++)
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(i, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(i, j, sum);
}
}
return resultMatrix;
}
This function should be rewritten using Parallel.ForEach Threading, I tried this way
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
// todo: feel free to add your code here
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
Parallel.ForEach(m1.RowCount, row =>
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(row, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(row, j, sum);
}
});
return resultMatrix;
}
But there is an error with the type argument in the loop. How can I fix it?
Just use Parallel.For instead of Parallel.Foreach, that should let you keep the exact same body as the non-parallel version:
Parallel.For(0, m1.RowCount, i =>{
...
}
Note that only fairly large matrices will benefit from parallelization, so if you are working with 4x4 matrices for graphics, this is not the approach to take.
One problem with multiplying matrices is that you need to access one value for each row for one of the matrices in your innermost loop. This access pattern may be difficult to cache by your processor, causing lots of cache misses. So a fairly easy optimization is to copy an entire column to a temporary array and do all computations that need this column before reading the next. This lets all memory access be nice and linear and easy to cache. this will do more work overall, but better cache utilization easily makes it a win. There are even more cache efficient methods, but the complexity also tend to increase.
Another optimization would be to use SIMD, but this might require platform specific code for best performance, and will likely involve more work. But you might be able to find libraries that are already optimized.
But perhaps most importantly, Profile your code. It is quite easy to have simple things consume lot of time. You are for example using an interface, so if you may have a virtual method call for each memory access that cannot be inlined, potentially causing a severe performance penalty compared to a direct array access.
ForEach receives a collection, IEnumerable as the first argument and m1.RowCount is a number.
Probably Parallel.For() is what you wanted.
I have two BitArray objects. and want to check if the values change compare to first BitArray then return indices of second array. I have tried looping over each bit but it takes too much time i.e: I have below two objects:
BitArray a = new BitArray{true,false,true};
BitArray b = new BitArray{false,false,false};
and want to return result 0,2 because BitArray b has two change compared to BitArray a.
If performance is your main goal here, you're not going to get there using BitArray; that abstraction is simply not optimal. You probably want to drop to your own oversized integer buffers, use "xor" on each to compute the delta, then use bit operations on the delta (xor result).
However, on .NET Core 3, you have direct access to the x86 instruction set, giving us both SIMD and popcnt; we can combine those things nicely here, using a SIMD XOR and then popcnt on the delta (there is no SIMD popcnt AFAIK, but we can unroll manually):
// make sure these are multiples of 128-bit, so: 4; otherwise
// you'll have to deal with the leftover bits manually
uint[] left = new uint[16], right = new uint[16];
Random rand = new Random(12345);
for (int i = 0; i < left.Length; i++)
left[i] = (uint)rand.Next();
for (int i = 0; i < right.Length; i++)
right[i] = (uint)rand.Next();
// real(ish) code starts here
// loop over our `uint[]` as spans of Vector128<uint>
var lspan = MemoryMarshal.Cast<uint, Vector128<uint>>(left);
var rspan = MemoryMarshal.Cast<uint, Vector128<uint>>(right);
uint count = 0;
for(int i = 0; i < lspan.Length; i++)
{
// compute the bit delta
var delta = Popcnt.Xor(lspan[i], rspan[i]);
// Vector128 is 4xUInt32, so: unroll
count += Popcnt.PopCount(delta.GetElement(0))
+ Popcnt.PopCount(delta.GetElement(1))
+ Popcnt.PopCount(delta.GetElement(2))
+ Popcnt.PopCount(delta.GetElement(3));
}
Console.WriteLine(count);
You could also use the more generic Vector<T> for the xor (which works on .NET Framework too, and can handle wider sizes than 128), but: no direct popcount then; example:
// loop over our `uint[]` as spans of Vector<uint>
var lspan = MemoryMarshal.Cast<uint, Vector<uint>>(left);
var rspan = MemoryMarshal.Cast<uint, Vector<uint>>(right);
for(int i = 0; i < lspan.Length; i++)
{
// compute the bit delta
var delta = lspan[i] ^ rspan[i];
// work with delta...
}
This (Vector<T>) will commonly give you SIMD widths of 256, or possibly even 512.
If the element count for bit arrays is the same then you can you use the following -
IList<int> differentIndices = new List<int>();
for(int i=0;i<a.Count;i++)
{
if (a[i] ^ b[i])
{
differentIndices.Add(i);
}
}
Just use XOR operation
There are may operations on arrays that do not depend on the rank of an array. Iterators are also not always a suitable solution. Given the array
double[,] myarray = new double[10,5];
it would be desirable to realize the following workflow:
Reshape an array of Rank>1 to a linear array with rank=1 with the same number of elements. This should happen in place to be runtime efficient. Copying is not allowed.
Pass reshaped array to a method defined for Rank=1 arrays only. e.g. Array.copy()
Reshape result array to original rank and dimensions.
There is a similar question on this topic: How to reshape array in c#. The solutions there use memory copy operation with BlockCopy().
My question are:
Can this kind of reshaping be realized without memory copy? Or even in a temporary way like creating a new view on the data?
There wording to this is a little tough, yet surely pointers unsafe and fixed would work. No memory copy, direct access, add pepper and salt to taste
The CLR just wont let you cast an array like you want, any other method you can think of will require allocating a new array and copy (which mind you can be lightening fast). The only other possibly way to so this is to use fixed, which will give you contiguous 1 dimensional array.
unsafe public static void SomeMethod(int* p, int size)
{
for (var i = 0; i < 4; i++)
{
//Perform any linear operation
*(p + i) *= 10;
}
}
...
var someArray = new int[2,2];
someArray[0, 0] = 1;
someArray[0,1] = 2;
someArray[1, 0] = 3;
someArray[1, 1] = 4;
//Reshape an array to a linear array
fixed (int* p = someArray)
{
SomeMethod(p, 4);
}
//Reshape result array to original rank and dimensions.
for (int i = 0; i < 2; i++)
{
for (int j = 0; j < 2; j++)
{
Console.WriteLine(someArray[i, j]);
}
}
Output
10
20
30
40
I have a piece of code that reads points from an stl, then I have to do a transformation, aplying a transformation matrix, of this stl and write the results on other stl. I do all this stuff, but it's too slow, about 5 or more minutes.
I put the code of the matrix multiplication, it recieves the two matrix and it makes the multiplication:
public double[,] MultiplyMatrix(double[,] A, double[,] B)
{
int rA = A.GetLength(0);
int cA = A.GetLength(1);
int rB = B.GetLength(0);
int cB = B.GetLength(1);
double temp = 0;
double[,] kHasil = new double[rA, cB];
if (cA != rB)
{
MessageBox.Show("matrix can't be multiplied !!");
}
else
{
for (int i = 0; i < rA; i++)
{
for (int j = 0; j < cB; j++)
{
temp = 0;
for (int k = 0; k < cA; k++)
{
temp += A[i, k] * B[k, j];
}
kHasil[i, j] = temp;
}
}
return kHasil;
}
return kHasil;
}
My problem is that all the code is too slow, it has to read from a stl, multiply all the points and write in other stl the results, it spends 5-10 minutes to do that. I see that all comercial programs, like cloudcompare, do all this operations in a few seconds.
Can anyone tell me how I can do it faster? Is there any library to do that faster than my code?
Thank you! :)
I fond this on internet:
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
then use Plinq
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:
1.Name Milliseconds2.Popt 187
Source is: Daniweb
That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.
You have couple of options:
Rewrite your code with jagged arrays (like double[][] A) it should give ~2x increase in speed.
Write unmanaged C/C++ DLL with matrix multiplication code.
Use third-party math library that have native BLAS implementation under the hood. I suggest Math.NET Numerics, it can be switched to use Intel MKL which is smoking fast.
Probably, the third option is the best.
Just for the records: CloudCompare is not a commercial product. It's a free open-source project. And there are no 'huge team of developers' (only a handful of them actually, doing this on their free time).
Here is our biggest secret: we use pure C++ code ;). And we rarely use multi-threading but for very lengthy processes (you have to take the thread management and processing time overhead into account).
Here are a few 'best practice' rules for the parts of the code that are called loads of times:
avoid any dynamic memory allocation
make as less (far) function calls as possible
always process the most probable case first in a 'if-then-else' branching
avoid very small loops (inline them if N = 2 or 3)
Given this simple piece of code and 10mln array of random numbers:
static int Main(string[] args)
{
int size = 10000000;
int num = 10; //increase num to reduce number of buckets
int numOfBuckets = size/num;
int[] ar = new int[size];
Random r = new Random(); //initialize with randum numbers
for (int i = 0; i < size; i++)
ar[i] = r.Next(size);
var s = new Stopwatch();
s.Start();
var group = ar.GroupBy(i => i / num);
var l = group.Count();
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Console.ReadLine();
return 0;
}
I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.
So my question is, why number of buckets influences groupBy performance that much?
EDIT:
running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.
I also changed the group.ToArray to piece of code below just to force execution of grouping :
foreach (var g in group)
array[g.Key] = 1;
where array is initialized before timer with appropriate size, the results stayed almost the same.
EDIT2:
You can see the working code from mellamokb in here pastebin.com/tJUYUhGL
I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.
To verify this, I took three steps:
Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.
In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.
Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Test
{
const int Size = 100000000;
const int Iterations = 3;
static void Main()
{
int[] input = new int[Size];
// Use the same seed for repeatability
var rng = new Random(0);
for (int i = 0; i < Size; i++)
{
input[i] = rng.Next(Size);
}
// Switch to PopulateArray to change which method is tested
Func<int[], int, TimeSpan> test = PopulateDictionary;
for (int buckets = 10; buckets <= Size; buckets *= 10)
{
TimeSpan total = TimeSpan.Zero;
for (int i = 0; i < Iterations; i++)
{
// Switch which line is commented to change the test
// total += PopulateDictionary(input, buckets);
total += PopulateArray(input, buckets);
GC.Collect();
GC.WaitForPendingFinalizers();
}
Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
}
}
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
count++;
dictionary[key] = count;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
static TimeSpan PopulateArray(int[] input, int buckets)
{
int[] output = new int[buckets];
int divisor = input.Length / buckets;
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
output[key]++;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
}
Results on my machine:
PopulateDictionary:
10: 10500ms
100: 10556ms
1000: 10557ms
10000: 11303ms
100000: 15262ms
1000000: 54037ms
10000000: 64236ms // Why is this slower? See later.
100000000: 56753ms
PopulateArray:
10: 1298ms
100: 1287ms
1000: 1290ms
10000: 1286ms
100000: 1357ms
1000000: 2717ms
10000000: 5940ms
100000000: 7870ms
An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.
The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.
EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...
--EDIT--
The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.
First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:
static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
int c1, c2;
c1 = c2 = 0;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
if (!dictionary.TryGetValue(key, out count))
{
dictionary.Add(key, 1);
++c1;
}
else
{
count++;
dictionary[key] = count;
++c2;
}
}
stopwatch.Stop();
Console.WriteLine("{0}:{1}", c1, c2);
return stopwatch.Elapsed;
}
The result is something like this:
10:99999990
10: 4683ms
100:99999900
100: 4946ms
1000:99999000
1000: 4732ms
10000:99990000
10000: 4964ms
100000:99900000
100000: 7033ms
1000000:99000000
1000000: 22038ms
9999538:90000462 <<-
10000000: 26104ms
63196841:36803159 <<-
100000000: 25045ms
Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.
10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.
Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.
A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.
What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how log₂(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.
The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:
The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.
Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.
There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.
I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.
There are a few factors at work here.
Hashes and groupings
The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.
Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.
The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.
The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.
I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.
Cache and random
Caching and random number generators aren't friends.
Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.
Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.
Test code
static void Main(string[] args)
{
int size = 10000000;
int[] ar = new int[size];
//random number init with numbers [0,size-1]
var r = new Random();
for (var i = 0; i < size; i++)
{
ar[i] = r.Next(0, size);
//ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
}
Console.WriteLine("Fixed dictionary:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixBucketTest(ar, num);
//timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
Console.WriteLine("Fixed array:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
}
static long FixBucketTest(int[] ar, int num)
{
// This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
foreach (var item in ar)
{
int idx = item / num;
List<int> ll;
if (!grouping.TryGetValue(idx, out ll))
{
grouping.Add(idx, ll = new List<int>());
}
//ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
}
s.Stop();
return s.ElapsedMilliseconds;
}
// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
int[] buf = new int[(ar.Length / num + 1) * 10];
foreach (var item in ar)
{
int code = (item & 0x7FFFFFFF) % buf.Length;
buf[code]++;
}
s.Stop();
return s.ElapsedMilliseconds;
}
When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.
Try something like the following:
int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num);
//the number of expected buckets is size / num.
var l = group.ToArray();
}
s.Stop();
calcuting 4 times with lower numbers.