I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.
I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.
Given the following example array,
0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1
Becomes
0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6
And so forth to the right, up to problematic sizes (1024x1024)
The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.
The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.
Is there another way to structure the data to make it faster to process?
If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.
Proper coding and a bit of insight in how .NET knows stuff helps as well :-)
Some rules of thumb that apply in this case:
If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
Use as few IL instructions as possible if you want speed. Generally this seems to work.
Test multiple iterations. A single iteration might not be good enough.
using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)
public static void Main(string[] args)
{
int width = 4096;
int height = 4096;
int[] ar = new int[width * height];
Random rnd = new Random(213);
for (int i = 0; i < ar.Length; ++i)
{
ar[i] = rnd.Next(0, 120);
}
// (5)...
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
int sum = 0;
for (int i = 0; i < ar.Length; ++i) // (3) sequential access
{
if ((i % width) == 0)
{
sum = 0;
}
// (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
// (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
ar[i] = (sum += ar[i]);
}
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.
If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
Parallel.For(0, height, (a) =>
{
int sum = 0;
for (var i = width * a + 1; i < width * (a + 1); i++)
{
ar[i] = (sum += ar[i]);
}
});
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.
Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.
There are no dependencies between rows.
0 0 1 0 0 - process on core1 |
2 0 0 0 3 - process on core1 |
-------------------------------
0 4 1 1 0 - process on core2 |
0 1 0 4 1 - process on core2 |
Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.
It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.
As #Chris Eelmaa told you it is possible to do a parallel execution by row. Using Parallel.For could be rewritten like this:
static int[,] values = new int[,]{
{0, 0, 1, 0, 0},
{2, 0, 0, 0, 3},
{0, 4, 1, 1, 0},
{0, 1, 0, 4 ,1}};
static void Main(string[] args)
{
int rows=values.GetLength(0);
int columns=values.GetLength(1);
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[row, column] += values[row, column - 1];
}
});
for (var row = 0; row < rows; row++)
{
for (var column = 0; column < columns; column++)
{
Console.Write("{0} ", values[row, column]);
}
Console.WriteLine();
}
So, as stated in your question, you have a one dimensional array, the code would be a bit faster:
static void Main(string[] args)
{
var values = new int[1024 * 1024];
Random r = new Random();
for (int i = 0; i < 1024; i++)
{
for (int j = 0; j < 1024; j++)
{
values[i * 1024 + j] = r.Next(25);
}
}
int rows = 1024;
int columns = 1024;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[(row * columns) + column] += values[(row * columns) + column - 1];
}
});
}
Console.WriteLine(sw.Elapsed);
}
But not as fast as a GPU. To use parallel GPU processing you will have to rewrite it in C++ AMP or take a look on how to port this parallel for to cudafy: http://w8isms.blogspot.com.es/2012/09/cudafy-me-part-3-of-4.html
You may as well store the array as a jagged array, the memory layout will be the same. So, instead of,
int[] texture;
you have,
int[][] texture;
Isolate the row operation as,
private static Task ProcessRow(int[] row)
{
var v = row[0];
for (var i = 1; i < row.Length; i++)
{
v = row[i] += v;
}
return Task.FromResult(true);
}
then you can write a function that does,
Task.WhenAll(texture.Select(ProcessRow)).Wait();
If you want to remain with a 1-dimensional array, a similar approach will work, just change ProcessRow.
private static Task ProcessRow(int[] texture, int start, int limit)
{
var v = texture[start];
for (var i = start + 1; i < limit; i++)
{
v = texture[i] += v;
}
return Task.FromResult(true);
}
then once,
var rowSize = 1024;
var rows =
Enumerable.Range(0, texture.Length / rowSize)
.Select(i => Tuple.Create(i * rowSize, (i * rowSize) + rowSize))
.ToArray();
then on each cycle.
Task.WhenAll(rows.Select(t => ProcessRow(texture, t.Item1, t.Item2)).Wait();
Either way, each row is processed in parallel.
Related
Given the following code:
public float[] weights;
public void Input(Neuron[] neurons)
{
float output = 0;
for (int i = 0; i < neurons.Length; i++)
output += neurons[i].input * weights[i];
}
Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'
Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.
Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).
I think you are looking for AVX in C#
So you can actually calculate several values in one command.
Thats SIMD for CPU cores. Take a look at this
Here an example from the website:
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
var simdLength = Vector<int>.Count;
var result = new int[lhs.Length];
var i = 0;
for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
{
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
}
for (; i < lhs.Length; ++i)
{
result[i] = lhs[i] + rhs[i];
}
return result;
}
You can also combine it with the parallelism you already use.
i tried this code but it takes so long and I can not get the result
public long getCounter([FromBody]object req)
{
JObject param = Utility.GetRequestParameter(req);
long input = long.Parse(param["input"].ToString());
long counter = 0;
for (long i = 14; i <= input; i++)
{
string s = i.ToString();
if (s.Contains("14"))
{
counter += 1;
}
}
return counter;
}
please help
We can examine all non-negative numbers < 10^10. Every such number can be represented with the sequence of 10 digits (with leading zeroes allowed).
How many numbers include 14
Dynamic programming solution. Let's find the number of sequences of a specific length that ends with the specific digit and contains (or not) subsequence 14:
F(len, digit, 0) is the number of sequences of length len that ends with digit and do not contain 14, F(len, digit, 1) is the number of such sequences that contain 14. Initially F(0, 0, 0) = 1. The result is the sum of all F(10, digit, 1).
C++ code to play with: https://ideone.com/2aS17v. The answer seems to be 872348501.
How many times the numbers include 14
First, let's place 14 at the end of the sequence:
????????14
Every '?' can be replaced with any digit from 0 to 9. Thus, there are 10^8 numbers in the interval that contains 14 at the end. Then consider ???????14?, ??????14??, ..., 14???????? numbers. There are 9 possible locations of 14 sequence. The answer is 10^8 * 9 = 90000000.
[Added by Matthew Watson]
Here's the C# version of the C++ implementation; it runs in less than 100ms:
using System;
namespace Demo
{
public static class Program
{
public static void Main(string[] args)
{
const int M = 10;
int[,,] f = new int [M + 1, 10, 2];
f[0, 0, 0] = 1;
for (int len = 1; len <= M; ++len)
{
for (int d = 0; d <= 9; ++d)
{
for (int j = 0; j <= 9; ++j)
{
f[len,d,0] += f[len - 1,j,0];
f[len,d,1] += f[len - 1,j,1];
}
}
f[len,4,0] -= f[len - 1,1,0];
f[len,4,1] += f[len - 1,1,0];
}
int sum = 0;
for (int i = 0; i <= 9; ++i)
sum += f[M,i,1];
Console.WriteLine(sum); // 872,348,501
}
}
}
If you want a brute force solution it could be something like this (please, notice, that we should avoid time consuming string operations like ToString, Contains):
int count = 0;
// Let's use all CPU's cores: Parallel.For
Parallel.For(0L, 10000000000L, (v) => {
for (long x = v; x > 10; x /= 10) {
// Get rid of ToString and Contains here
if (x % 100 == 14) {
Interlocked.Increment(ref count); // We want an atomic (thread safe) operation
break;
}
}
});
Console.Write(count);
It returns 872348501 within 6 min (Core i7 with 4 cores at 3.2GHz)
UPDATE
My parallel code calculated the result as 872,348,501 in 9 minutes on my 8- processor-core Intel Core I7 PC.
(There is a much better solution above that takes less than 100ms, but I shall leave this answer here since it provides corroborating evidence for the fast answer.)
You can use multiple threads (one per processor core) to reduce the calculation time.
At first I thought that I could use AsParallel() to speed this up - however, it turns out that you can't use AsParallel() on sequences with more than 2^31 items.
(For completeness I'm including my faulty implementation using AsParallel at the end of this answer).
Instead, I've written some custom code to break the problem down into a number of chunks equal to the number of processors:
using System;
using System.Linq;
using System.Threading.Tasks;
namespace Demo
{
class Program
{
static void Main()
{
int numProcessors = Environment.ProcessorCount;
Task<long>[] results = new Task<long>[numProcessors];
long count = 10000000000;
long elementsPerProcessor = count / numProcessors;
for (int i = 0; i < numProcessors; ++i)
{
long end;
long start = i * elementsPerProcessor;
if (i != (numProcessors - 1))
end = start + elementsPerProcessor;
else // Last thread - go right up to the last element.
end = count;
results[i] = Task.Run(() => processElements(start, end));
}
long sum = results.Select(r => r.Result).Sum();
Console.WriteLine(sum);
}
static long processElements(long inclusiveStart, long exclusiveEnd)
{
long total = 0;
for (long i = inclusiveStart; i < exclusiveEnd; ++i)
if (i.ToString().Contains("14"))
++total;
return total;
}
}
}
The following code does NOT work because AsParallel() doesn't work on sequences with more than 2^31 items.
static void Main(string[] args)
{
var numbersContaining14 =
from number in numbers(0, 100000000000).AsParallel()
where number.ToString().Contains("14")
select number;
Console.WriteLine(numbersContaining14.LongCount());
}
static IEnumerable<long> numbers(long first, long count)
{
for (long i = first, last = first + count; i < last; ++i)
yield return i;
}
You compute the count of numbers of a given length ending in 1, 4 or something else that don't contain 14. Then you can extend the length by 1.
Then the count of numbers that do contain 14 is the count of all numbers minus those that don't contain a 14.
private static long Count(int len) {
long e1=0, e4=0, eo=1;
long N=1;
for (int n=0; n<len; n++) {
long ne1 = e4+e1+eo, ne4 = e4+eo, neo = 8*(e1+e4+eo);
e1 = ne1; e4 = ne4; eo = neo;
N *= 10;
}
return N - e1 - e4 - eo;
}
You can reduce this code a little, noting that eo = 8*e1 except for the first iteration, and then avoiding the local variables.
private static long Count(int len) {
long e1=1, e4=1, N=10;
for (int n=1; n<len; n++) {
e4 += 8*e1;
e1 += e4;
N *= 10;
}
return N - 9*e1 - e4;
}
For both of these, Count(10) returns 872348501.
One easy way to calculate the answer is,
You can fix 14 at a place and count the combination of the remaining numbers right to it,
and do this for all the possible positions where 14 can be place such that the number is still less than 10000000000,Lets take a example,
***14*****,
Here the '*' before 14 can be filled by 900 ways and the * after 14 can be filled by 10^5 ways so total occurrence will be 10^5*(900),
Similarly you can fix 14 at other positions to calculate the result and this solution will be very fast O(10) or simply O(1), while the previous solution was O(N), where N is 10000000000
You can use the fact that in each 1000 (that is from 1 to 1000 and from 1001 to 2000 etc)
the 14 is found: 19 times so when you receive your input divide it by 1000 for example you received 1200 so 1200/1000
the result is 1 and remainder 200, so we have 1 * 19 "14"s and then you can loop over the 200.
you can extend for 10000 (that is count how many "14"s there are in 10000 and fix it to a global variable) and start dividing by 10000 then and apply the equation above, then you divide the remainder by 1000 and apply the equation and add the two results.
You can extend it as the fact that for all hundreds (that is from 1 to 100 and from 201 to 300) the "14" is found only 1 except for the second hundred (101 to 200).
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
c# Leaner way of initializing int array
Basically I would like to know if there is a more efficent code than the one shown below
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; i++)
{
result[i] = value;
}
return result;
}
where size can vary from 10 to 150000. For small arrays is not an issue, but there should be a better way to do the above.
I am using VS2010(.NET 4.0)
C#/CLR does not have built in way to initalize array with non-default values.
Your code is as efficient as it could get if you measure in operations per item.
You can get potentially faster initialization if you initialize chunks of huge array in parallel. This approach will need careful tuning due to non-trivial cost of mutlithread operations.
Much better results can be obtained by analizing your needs and potentially removing whole initialization alltogether. I.e. if array is normally contains constant value you can implement some sort of COW (copy on write) approach where your object initially have no backing array and simpy returns constant value, that on write to an element it would create (potentially partial) backing array for modified segment.
Slower but more compact code (that potentially easier to read) would be to use Enumerable.Repeat. Note that ToArray will cause significant amount of memory to be allocated for large arrays (which may also endup with allocations on LOH) - High memory consumption with Enumerable.Range?.
var result = Enumerable.Repeat(value, size).ToArray();
One way that you can improve speed is by utilizing Array.Copy. It's able to work at a lower level in which it's bulk assigning larger sections of memory.
By batching the assignments you can end up copying the array from one section to itself.
On top of that, the batches themselves can be quite effectively paralleized.
Here is my initial code up. On my machine (which only has two cores) with a sample array of size 10 million items, I was getting a 15% or so speedup. You'll need to play around with the batch size (try to stay in multiples of your page size to keep it efficient) to tune it to the size of items that you have. For smaller arrays it'll end up almost identical to your code as it won't get past filling up the first batch, but it also won't be (noticeably) worse in those cases either.
private const int batchSize = 1048576;
private static int[] GetDefaultSeriesArray2(int size, int value)
{
int[] result = new int[size];
//fill the first batch normally
int end = Math.Min(batchSize, size);
for (int i = 0; i < end; i++)
{
result[i] = value;
}
int numBatches = size / batchSize;
Parallel.For(1, numBatches, batch =>
{
Array.Copy(result, 0, result, batch * batchSize, batchSize);
});
//handle partial leftover batch
for (int i = numBatches * batchSize; i < size; i++)
{
result[i] = value;
}
return result;
}
Another way to improve performance is with a pretty basic technique: loop unrolling.
I have written some code to initialize an array with 20 million items, this is done repeatedly 100 times and an average is calculated. Without unrolling the loop, this takes about 44 MS. With loop unrolling of 10 the process is finished in 23 MS.
private void Looper()
{
int repeats = 100;
float avg = 0;
ArrayList times = new ArrayList();
for (int i = 0; i < repeats; i++)
times.Add(Time());
Console.WriteLine(GetAverage(times)); //44
times.Clear();
for (int i = 0; i < repeats; i++)
times.Add(TimeUnrolled());
Console.WriteLine(GetAverage(times)); //22
}
private float GetAverage(ArrayList times)
{
long total = 0;
foreach (var item in times)
{
total += (long)item;
}
return total / times.Count;
}
private long Time()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i++)
{
result[i] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
private long TimeUnrolled()
{
Stopwatch sw = new Stopwatch();
int size = 20000000;
int[] result = new int[size];
sw.Start();
for (int i = 0; i < size; i += 10)
{
result[i] = 5;
result[i + 1] = 5;
result[i + 2] = 5;
result[i + 3] = 5;
result[i + 4] = 5;
result[i + 5] = 5;
result[i + 6] = 5;
result[i + 7] = 5;
result[i + 8] = 5;
result[i + 9] = 5;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
return sw.ElapsedMilliseconds;
}
Enumerable.Repeat(value, size).ToArray();
Reading up Enumerable.Repeat is 20 times slower than the ops standard for loop and the only thing I found which might improve its speed is
private static int[] GetDefaultSeriesArray(int size, int value)
{
int[] result = new int[size];
for (int i = 0; i < size; ++i)
{
result[i] = value;
}
return result;
}
NOTE the i++ is changed to ++i. i++ copies i, increments i, and returns the original value. ++i just returns the incremented value
As someone already mentioned, you can leverage parallel processing like this:
int[] result = new int[size];
Parallel.ForEach(result, x => x = value);
return result;
Sorry I had no time to do performance testing on this (don't have VS installed on this machine) but if you can do it and share the results it would be great.
EDIT: As per comment, while I still think that in terms of performance they are equivalent, you can try the parallel for loop:
Parallel.For(0, size, i => result[i] = value);
I need to find the intersection of two sorted integer arrays and do it very fast.
Right now, I am using the following code:
int i = 0, j = 0;
while (i < arr1.Count && j < arr2.Count)
{
if (arr1[i] < arr2[j])
{
i++;
}
else
{
if (arr2[j] < arr1[i])
{
j++;
}
else
{
intersect.Add(arr2[j]);
j++;
i++;
}
}
}
Unfortunately it might to take hours to do all work.
How to do it faster? I found this article where SIMD instructions are used. Is it possible to use SIMD in .NET?
What do you think about:
http://docs.go-mono.com/index.aspx?link=N:Mono.Simd Mono.SIMD
http://netasm.codeplex.com/ NetASM(inject asm code to managed)
and something like http://www.atrevido.net/blog/PermaLink.aspx?guid=ac03f447-d487-45a6-8119-dc4fa1e932e1
EDIT:
When i say thousands i mean following (in code)
for(var i=0;i<arrCollection1.Count-1;i++)
{
for(var j=i+1;j<arrCollection2.Count;j++)
{
Intersect(arrCollection1[i],arrCollection2[j])
}
}
UPDATE
The fastest I got was 200ms with arrays size 10mil, with the unsafe version (Last piece of code).
The test I've did:
var arr1 = new int[10000000];
var arr2 = new int[10000000];
for (var i = 0; i < 10000000; i++)
{
arr1[i] = i;
arr2[i] = i * 2;
}
var sw = Stopwatch.StartNew();
var result = arr1.IntersectSorted(arr2);
sw.Stop();
Console.WriteLine(sw.Elapsed); // 00:00:00.1926156
Full Post:
I've tested various ways to do it and found this to be very good:
public static List<int> IntersectSorted(this int[] source, int[] target)
{
// Set initial capacity to a "full-intersection" size
// This prevents multiple re-allocations
var ints = new List<int>(Math.Min(source.Length, target.Length));
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
ints.Add(source[i++]);
j++;
// Saves us a JMP instruction
continue;
}
}
// Free unused memory (sets capacity to actual count)
ints.TrimExcess();
return ints;
}
For further improvement you can remove the ints.TrimExcess();, which will also make a nice difference, but you should think if you're going to need that memory.
Also, if you know that you might break loops that use the intersections, and you don't have to have the results as an array/list, you should change the implementation to an iterator:
public static IEnumerable<int> IntersectSorted(this int[] source, int[] target)
{
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
yield return source[i++];
j++;
// Saves us a JMP instruction
continue;
}
}
}
Another improvement is to use unsafe code:
public static unsafe List<int> IntersectSorted(this int[] source, int[] target)
{
var ints = new List<int>(Math.Min(source.Length, target.Length));
fixed (int* ptSrc = source)
{
var maxSrcAdr = ptSrc + source.Length;
fixed (int* ptTar = target)
{
var maxTarAdr = ptTar + target.Length;
var currSrc = ptSrc;
var currTar = ptTar;
while (currSrc < maxSrcAdr && currTar < maxTarAdr)
{
switch ((*currSrc).CompareTo(*currTar))
{
case -1:
currSrc++;
continue;
case 1:
currTar++;
continue;
default:
ints.Add(*currSrc);
currSrc++;
currTar++;
continue;
}
}
}
}
ints.TrimExcess();
return ints;
}
In summary, the most major performance hit was in the if-else's.
Turning it into a switch-case made a huge difference (about 2 times faster).
Have you tried something simple like this:
var a = Enumerable.Range(1, int.MaxValue/100).ToList();
var b = Enumerable.Range(50, int.MaxValue/100 - 50).ToList();
//var c = a.Intersect(b).ToList();
List<int> c = new List<int>();
var t1 = DateTime.Now;
foreach (var item in a)
{
if (b.BinarySearch(item) >= 0)
c.Add(item);
}
var t2 = DateTime.Now;
var tres = t2 - t1;
This piece of code takes 1 array of 21,474,836 elements and the other one with 21,474,786
If I use var c = a.Intersect(b).ToList(); I get an OutOfMemoryException
The result product would be 461,167,507,485,096 iterations using nested foreach
But with this simple code, the intersection occurred in TotalSeconds = 7.3960529 (using one core)
Now I am still not happy, so I am trying to increase the performance by breaking this in parallel, as soon as I finish I will post it
Yorye Nathan gave me the fastest intersection of two arrays with the last "unsafe code" method. Unfortunately it was still too slow for me, I needed to make combinations of array intersections, which goes up to 2^32 combinations, pretty much no? I made following modifications and adjustments and time dropped to 2.6X time faster. You need to make some pre optimization before, for sure you can do it some way or another. I am using only indexes instead the actual objects or ids or some other abstract comparison. So, by example if you have to intersect big number like this
Arr1: 103344, 234566, 789900, 1947890,
Arr2: 150034, 234566, 845465, 23849854
put everything into and array
Arr1: 103344, 234566, 789900, 1947890, 150034, 845465,23849854
and use, for intersection, the ordered indexes of the result array
Arr1Index: 0, 1, 2, 3
Arr2Index: 1, 4, 5, 6
Now we have smaller numbers with whom we can build some other nice arrays. What I did after taking the method from Yorye, I took Arr2Index and expand it into, theoretically boolean array, practically into byte arrays, because of the memory size implication, to following:
Arr2IndexCheck: 0, 1, 0, 0, 1, 1 ,1
that is more or less a dictionary which tells me for any index if second array contains it.
The next step I did not use memory allocation which also took time, instead I pre-created the result array before calling the method, so during the process of finding my combinations I never instantiate anything. Of course you have to deal with the length of this array separately, so maybe you need to store it somewhere.
Finally the code looks like this:
public static unsafe int IntersectSorted2(int[] arr1, byte[] arr2Check, int[] result)
{
int length;
fixed (int* pArr1 = arr1, pResult = result)
fixed (byte* pArr2Check = arr2Check)
{
int* maxArr1Adr = pArr1 + arr1.Length;
int* arr1Value = pArr1;
int* resultValue = pResult;
while (arr1Value < maxArr1Adr)
{
if (*(pArr2Check + *arr1Value) == 1)
{
*resultValue = *arr1Value;
resultValue++;
}
arr1Value++;
}
length = (int)(resultValue - pResult);
}
return length;
}
You can see the result array size is returned by the function, then you do what you wish(resize it, keep it). Obviously the result array has to have at least the minimum size of arr1 and arr2.
The big improvement, is that I only iterate through the first array, which would be best to have less size than the second one, so you have less iterations. Less iterations means less CPU cycles right?
So here is the really fast intersection of two ordered arrays, that if you need a reaaaaalllyy high performance ;).
Are arrCollection1 and arrCollection2 collections of arrays of integers? IN this case you should get some notable improvement by starting second loop from i+1 as opposed to 0
C# doesn't support SIMD. Additionally, and I haven't yet figured out why, DLL's that use SSE aren't any faster when called from C# than the non-SSE equivalent functions. Also, all SIMD extensions that I know of don't work with branching anyway, ie your "if" statements.
If you're using .net 4.0, you can use Parallel For to gain speed if you have multiple cores. Otherwise you can write a multithreaded version if you have .net 3.5 or less.
Here is a method similar to yours:
IList<int> intersect(int[] arr1, int[] arr2)
{
IList<int> intersect = new List<int>();
int i = 0, j = 0;
int iMax = arr1.Length - 1, jMax = arr2.Length - 1;
while (i < iMax && j < jMax)
{
while (i < iMax && arr1[i] < arr2[j]) i++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (i < iMax && arr1[i] == arr2[j]) i++; //prevent reduntant entries
while (j < jMax && arr2[j] < arr1[i]) j++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (j < jMax && arr2[j] == arr1[i]) j++; //prevent redundant entries
}
return intersect;
}
This one also prevents any entry from appearing twice. With 2 sorted arrays both of size 10 million, it completed in about a second. The compiler is supposed to remove array bounds checks if you use array.Length in a For statement, I don't know if that works in a while statement though.
Let's say I have the array
1,2,3,4,5,6,7,8,9,10,11,12
if my chunck size = 4
then I want to be able to have a method that will output an array of ints int[] a =
a[0] = 1
a[1] = 3
a[2] = 6
a[3] = 10
a[4] = 14
a[5] = 18
a[6] = 22
a[7] = 26
a[8] = 30
a[9] = 34
a[10] = 38
a[11] = 42
note that a[n] = a[n] + a[n-1] + a[n-2] + a[n-3] because the chunk size is 4 thus I sum the last 4 items
I need to have the method without a nested loop
for(int i=0; i<12; i++)
{
for(int k = i; k>=0 ;k--)
{
// do sumation
counter++;
if(counter==4)
break;
}
}
for example i don't want to have something like that... in order to make code more efficient
also the chunck size may change so I cannot do:
a[3] = a[0] + a[1] + a[2] + a[3]
edit
The reason why I asked this question is because I need to implement check sum rolling for my data structures class. I basically open a file for reading. I then have a byte array. then I will perform a hash function on parts of the file. lets say the file is 100 bytes. I split it in chunks of 10 bytes. I perform a hash function in each chunck thus I get 10 hashes. then I need to compare those hashes with a second file that is similar. let's say the second file has the same 100 bytes but with an additional 5 so it contains a total of 105 bytes. becasuse those extra bytes may have been in the middle of the file if I perform the same algorithm that I did on the first file it is not going to work. Hope I explain my self correctly. and because some files are large. it is not efficient to have a nested loop in my algorithm.
also the real rolling hashing functions are very complex. Most of them are in c++ and I have a hard time understanding them. That's why I want to create my own hashing function very simple just to demonstrate how check sum rolling works...
Edit 2
int chunckSize = 4;
int[] a = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12 }; // the bytes of the file
int[] b = new int[a.Length]; // array where we will place the checksums
int[] sum = new int[a.Length]; // array needed to avoid nested loop
for (int i = 0; i < a.Length; i++)
{
int temp = 0;
if (i == 0)
{
temp = 1;
}
sum[i] += a[i] + sum[i-1+temp];
if (i < chunckSize)
{
b[i] = sum[i];
}
else
{
b[i] = sum[i] - sum[i - chunckSize];
}
}
the problem with this algorithm is that with large files the sum will at some point be larger than int.Max thus it is not going to work....
but at least know it is more efficient. getting rid of that nested loop helped a lot!
edit 3
Based on edit two I have worked this out. It does not work with large files and also the checksum algorithm is very bad. but at least I think it explains the hashing rolling that I am trying to explain...
Part1(#"A:\fileA.txt");
Part2(#"A:\fileB.txt", null);
.....
// split the file in chuncks and return the checksums of the chuncks
private static UInt64[] Part1(string file)
{
UInt64[] hashes = new UInt64[(int)Math.Pow(2, 20)];
var stream = File.OpenRead(file);
int chunckSize = (int)Math.Pow(2, 22); // 10 => kilobite 20 => megabite 30 => gigabite etc..
byte[] buffer = new byte[chunckSize];
int bytesRead; // how many bytes where read
int counter = 0; // counter
while ( // while bytesRead > 0
(bytesRead =
(stream.Read(buffer, 0, buffer.Length)) // returns the number of bytes read or 0 if no bytes read
) > 0)
{
hashes[counter] = 0;
for (int i = 0; i < bytesRead; i++)
{
hashes[counter] = hashes[counter] + buffer[i]; // simple algorithm not realistic to perform check sum of file
}
counter++;
}// end while loop
return hashes;
}
// split the file in chuncks rolling it. In reallity this file will be on a different computer..
private static void Part2(string file, UInt64[] hash)
{
UInt64[] hashes = new UInt64[(int)Math.Pow(2, 20)];
var stream = File.OpenRead(file);
int chunckSize = (int)Math.Pow(2, 22); // chunks must be as big as in pervious method
byte[] buffer = new byte[chunckSize];
int bytesRead; // how many bytes where read
int counter = 0; // counter
UInt64[] sum = new UInt64[(int)Math.Pow(2, 20)];
while ( // while bytesRead > 0
(bytesRead =
(stream.Read(buffer, 0, buffer.Length)) // returns the number of bytes read or 0 if no bytes read
) > 0)
{
for (int i = 0; i < bytesRead; i++)
{
int temp = 0;
if (counter == 0)
temp = 1;
sum[counter] += (UInt64)buffer[i] + sum[counter - 1 + temp];
if (counter < chunckSize)
{
hashes[counter] = (UInt64)sum[counter];
}else
{
hashes[counter] = (UInt64)sum[counter] - (UInt64)sum[counter - chunckSize];
}
counter++;
}
}// end while loop
// mising to compare hashes arrays
}
Add an array r for the result, and initialize its first chunk members using a loop from 0 to chunk-1. Now observe that to get r[i+1] you can add a[i+1] to r[i], and subtract a[i-chunk+1]. Now you can do the rest of the items in a single non-nested loop:
for (int i=chunk+1 ; i < N-1 ; i++) {
r[i+1] = a[i+1] + r[i] - a[i-chunk+1];
}
You can get this down to a single for loop, though that may not be good enough. To do that, just note that c[i+1] = c[i]-a[i-k+1]+a[i+1]; where a is the original array, c is the chunky array, and k is the size of the chunks.
I understand that you want to compute a rolling hash function to hash every n-gram (where n is what you call the "chunk size"). Rolling hashing is sometimes called "recursive hashing". There is a wikipedia entry on the topic:
http://en.wikipedia.org/wiki/Rolling_hash
A common algorithm to solve this problem is Karp-Rabin. Here is some pseudo-code which you should be able to easily implement in C#:
B←37
s←empty First-In-First-Out (FIFO) structure (e.g., a linked-list)
x←0(L-bit integer)
z←0(L-bit integer)
for each character c do
append c to s
x ← (B x−B^n z + c ) mod 2^L
yield x
if length(s) = n then
remove oldest character y from s
z ← y
end if
end for
Note that because B^n is a constant, the main loop only does two multiplications, one subtraction and one addition. The "mod 2^L" operation can be done very fast (use a mask, or unsigned integers with L=32 or L=64, for example).
Specifically, your C# code might look like this where n is the "chunk" size (just set B=37, and Btothen = 37 ^ n)
r[0] = 0
for (int i=1 ; i < N ; i++) {
r[i] = a[i] + B * r[i-1] - Btothen * a[i-n];
}
Karp-Rabin is not ideal however. I wrote a paper where better solutions are discussed:
Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010.
http://arxiv.org/abs/0705.4676
I also published the source code (Java and C++, alas no C# but it should not be hard to go from Java to C#):
https://github.com/lemire/rollinghashjava
https://github.com/lemire/rollinghashcpp
How about storing off the last chunk_size values as you step through?
Allocate an array of size chunk_size, set them all to zero, and then set the element at i % chunk_size with your current element at each iteration of i, and then add up all the values?
using System;
class Sample {
static void Main(){
int chunckSize = 4;
int[] a = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
int[] b = new int[a.Length];
int sum = 0;
int d = chunckSize*(chunckSize-1)/2;
foreach(var i in a){
if(i < chunckSize){
sum += i;
b[i-1]=sum;
} else {
b[i-1]=chunckSize*i -d;
}
}
Console.WriteLine(String.Join(",", b));//1,3,6,10,14,18,22,26,30,34,38,42
}
}