Fast intersection of two sorted integer arrays

Fast intersection of two sorted integer arrays - c#

I need to find the intersection of two sorted integer arrays and do it very fast.
Right now, I am using the following code:
int i = 0, j = 0;
while (i < arr1.Count && j < arr2.Count)
{
if (arr1[i] < arr2[j])
{
i++;
}
else
{
if (arr2[j] < arr1[i])
{
j++;
}
else
{
intersect.Add(arr2[j]);
j++;
i++;
}
}
}
Unfortunately it might to take hours to do all work.
How to do it faster? I found this article where SIMD instructions are used. Is it possible to use SIMD in .NET?
What do you think about:
http://docs.go-mono.com/index.aspx?link=N:Mono.Simd Mono.SIMD
http://netasm.codeplex.com/ NetASM(inject asm code to managed)
and something like http://www.atrevido.net/blog/PermaLink.aspx?guid=ac03f447-d487-45a6-8119-dc4fa1e932e1
EDIT:
When i say thousands i mean following (in code)
for(var i=0;i<arrCollection1.Count-1;i++)
{
for(var j=i+1;j<arrCollection2.Count;j++)
{
Intersect(arrCollection1[i],arrCollection2[j])
}
}

UPDATE
The fastest I got was 200ms with arrays size 10mil, with the unsafe version (Last piece of code).
The test I've did:
var arr1 = new int[10000000];
var arr2 = new int[10000000];
for (var i = 0; i < 10000000; i++)
{
arr1[i] = i;
arr2[i] = i * 2;
}
var sw = Stopwatch.StartNew();
var result = arr1.IntersectSorted(arr2);
sw.Stop();
Console.WriteLine(sw.Elapsed); // 00:00:00.1926156
Full Post:
I've tested various ways to do it and found this to be very good:
public static List<int> IntersectSorted(this int[] source, int[] target)
{
// Set initial capacity to a "full-intersection" size
// This prevents multiple re-allocations
var ints = new List<int>(Math.Min(source.Length, target.Length));
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
ints.Add(source[i++]);
j++;
// Saves us a JMP instruction
continue;
}
}
// Free unused memory (sets capacity to actual count)
ints.TrimExcess();
return ints;
}
For further improvement you can remove the ints.TrimExcess();, which will also make a nice difference, but you should think if you're going to need that memory.
Also, if you know that you might break loops that use the intersections, and you don't have to have the results as an array/list, you should change the implementation to an iterator:
public static IEnumerable<int> IntersectSorted(this int[] source, int[] target)
{
var i = 0;
var j = 0;
while (i < source.Length && j < target.Length)
{
// Compare only once and let compiler optimize the switch-case
switch (source[i].CompareTo(target[j]))
{
case -1:
i++;
// Saves us a JMP instruction
continue;
case 1:
j++;
// Saves us a JMP instruction
continue;
default:
yield return source[i++];
j++;
// Saves us a JMP instruction
continue;
}
}
}
Another improvement is to use unsafe code:
public static unsafe List<int> IntersectSorted(this int[] source, int[] target)
{
var ints = new List<int>(Math.Min(source.Length, target.Length));
fixed (int* ptSrc = source)
{
var maxSrcAdr = ptSrc + source.Length;
fixed (int* ptTar = target)
{
var maxTarAdr = ptTar + target.Length;
var currSrc = ptSrc;
var currTar = ptTar;
while (currSrc < maxSrcAdr && currTar < maxTarAdr)
{
switch ((*currSrc).CompareTo(*currTar))
{
case -1:
currSrc++;
continue;
case 1:
currTar++;
continue;
default:
ints.Add(*currSrc);
currSrc++;
currTar++;
continue;
}
}
}
}
ints.TrimExcess();
return ints;
}
In summary, the most major performance hit was in the if-else's.
Turning it into a switch-case made a huge difference (about 2 times faster).

Have you tried something simple like this:
var a = Enumerable.Range(1, int.MaxValue/100).ToList();
var b = Enumerable.Range(50, int.MaxValue/100 - 50).ToList();
//var c = a.Intersect(b).ToList();
List<int> c = new List<int>();
var t1 = DateTime.Now;
foreach (var item in a)
{
if (b.BinarySearch(item) >= 0)
c.Add(item);
}
var t2 = DateTime.Now;
var tres = t2 - t1;
This piece of code takes 1 array of 21,474,836 elements and the other one with 21,474,786
If I use var c = a.Intersect(b).ToList(); I get an OutOfMemoryException
The result product would be 461,167,507,485,096 iterations using nested foreach
But with this simple code, the intersection occurred in TotalSeconds = 7.3960529 (using one core)
Now I am still not happy, so I am trying to increase the performance by breaking this in parallel, as soon as I finish I will post it

Yorye Nathan gave me the fastest intersection of two arrays with the last "unsafe code" method. Unfortunately it was still too slow for me, I needed to make combinations of array intersections, which goes up to 2^32 combinations, pretty much no? I made following modifications and adjustments and time dropped to 2.6X time faster. You need to make some pre optimization before, for sure you can do it some way or another. I am using only indexes instead the actual objects or ids or some other abstract comparison. So, by example if you have to intersect big number like this
Arr1: 103344, 234566, 789900, 1947890,
Arr2: 150034, 234566, 845465, 23849854
put everything into and array
Arr1: 103344, 234566, 789900, 1947890, 150034, 845465,23849854
and use, for intersection, the ordered indexes of the result array
Arr1Index: 0, 1, 2, 3
Arr2Index: 1, 4, 5, 6
Now we have smaller numbers with whom we can build some other nice arrays. What I did after taking the method from Yorye, I took Arr2Index and expand it into, theoretically boolean array, practically into byte arrays, because of the memory size implication, to following:
Arr2IndexCheck: 0, 1, 0, 0, 1, 1 ,1
that is more or less a dictionary which tells me for any index if second array contains it.
The next step I did not use memory allocation which also took time, instead I pre-created the result array before calling the method, so during the process of finding my combinations I never instantiate anything. Of course you have to deal with the length of this array separately, so maybe you need to store it somewhere.
Finally the code looks like this:
public static unsafe int IntersectSorted2(int[] arr1, byte[] arr2Check, int[] result)
{
int length;
fixed (int* pArr1 = arr1, pResult = result)
fixed (byte* pArr2Check = arr2Check)
{
int* maxArr1Adr = pArr1 + arr1.Length;
int* arr1Value = pArr1;
int* resultValue = pResult;
while (arr1Value < maxArr1Adr)
{
if (*(pArr2Check + *arr1Value) == 1)
{
*resultValue = *arr1Value;
resultValue++;
}
arr1Value++;
}
length = (int)(resultValue - pResult);
}
return length;
}
You can see the result array size is returned by the function, then you do what you wish(resize it, keep it). Obviously the result array has to have at least the minimum size of arr1 and arr2.
The big improvement, is that I only iterate through the first array, which would be best to have less size than the second one, so you have less iterations. Less iterations means less CPU cycles right?
So here is the really fast intersection of two ordered arrays, that if you need a reaaaaalllyy high performance ;).

Are arrCollection1 and arrCollection2 collections of arrays of integers? IN this case you should get some notable improvement by starting second loop from i+1 as opposed to 0

C# doesn't support SIMD. Additionally, and I haven't yet figured out why, DLL's that use SSE aren't any faster when called from C# than the non-SSE equivalent functions. Also, all SIMD extensions that I know of don't work with branching anyway, ie your "if" statements.
If you're using .net 4.0, you can use Parallel For to gain speed if you have multiple cores. Otherwise you can write a multithreaded version if you have .net 3.5 or less.
Here is a method similar to yours:
IList<int> intersect(int[] arr1, int[] arr2)
{
IList<int> intersect = new List<int>();
int i = 0, j = 0;
int iMax = arr1.Length - 1, jMax = arr2.Length - 1;
while (i < iMax && j < jMax)
{
while (i < iMax && arr1[i] < arr2[j]) i++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (i < iMax && arr1[i] == arr2[j]) i++; //prevent reduntant entries
while (j < jMax && arr2[j] < arr1[i]) j++;
if (arr1[i] == arr2[j]) intersect.Add(arr1[i]);
while (j < jMax && arr2[j] == arr1[i]) j++; //prevent redundant entries
}
return intersect;
}
This one also prevents any entry from appearing twice. With 2 sorted arrays both of size 10 million, it completed in about a second. The compiler is supposed to remove array bounds checks if you use array.Length in a For statement, I don't know if that works in a while statement though.

Related

How to shuffle string characters to right and left until int.MaxValue?

My task is to make a organized shuffle, from source all odd numbers will go to left and even number will go to right.
I have done that much like this, and it is good for normal scenario:
public static string ShuffleChars(string source, int count)
{
if (string.IsNullOrWhiteSpace(source) || source.Length == 0)
{
throw new ArgumentException(null);
}
if (count < 0)
{
throw new ArgumentException(null);
}
for (int i = 0; i < count; i++)
{
source = string.Concat(source.Where((item, index) => index % 2 == 0)) +
string.Concat(source.Where((item, index) => index % 2 != 0));
}
return source;
}
Now the problem is, what if the count is int.MaxValue or a other huge number in millions, it will loop trough a lot. How can I optimize the code in terms of speed and resource consumption?

You should be able to determine by the string's length how many iterations it will take before it's back to it's original sort order. Then take the modulus of the iteration count and the input count, and only iterate that many times.
For example, a string that is three characters will be back to it's original sort order in 2 iterations. If the input count was to do 11 iterations, we know that 11 % 2 == 1, so we only need to iterate one time.
Once you determine a formula for how many iterations it takes to reach the original sort order for any length of string, you can always reduce the number of iterations to that number or less.
Coming up with a formula will be tricky, however. A string with 14 characters takes 12 iterations until it matches itself, but a string with 15 characters only takes 4 iterations.
Therefore, a shortcut might be to simply start iterating until we reach the original sort order (or the specified count, whichever comes first). If we reach the count first, then we return that answer. Otherwise, we can determine the answer from the idea in the first paragraph - take the modulus of the input count and the iteration count, and return that answer.
This would require that we store the values from our iterations (in a dictionary, for example) so we can retrieve a specific previous value.
For example:
public static string ShuffleChars(string source, int count)
{
string s = source;
var results = new Dictionary<int, string>();
for (int i = 0; i < count; i++)
{
s = string.Concat(s.Where((item, index) => index % 2 == 0)) +
string.Concat(s.Where((item, index) => index % 2 != 0));
// If we've repeated our original string, return the saved
// value of the input count modulus the current iteration
if (s == source)
{
return results[count % (i + 1) - 1];
}
// Otherwise, save the value for later
else
{
results[i] = s;
}
}
// If we get here it means we hit the requested count before
// ever returning to the original sort order of the input
return s;
}

Instead of creating new immutable strings on each loop, you could work with a mutable array of characters (char[]), and swap characters between places. This would be the most efficient in terms of memory consumption, but doing the swaps on a single array could be quite tricky. Using two arrays is much easier, because you can just copy characters from one array to the other, and at the end of each loop swap the two arrays.
One more optimization you could do is to work with the indices of the char array, instead of its values. I am not sure if this will make any difference in practice, since in modern 64 bit machines both char and int types occupy 8 bytes (AFAIK). It will surely make a difference on 32 bit machines though. Here is an implementation, with all these ideas put together:
public static string ShuffleChars(string source, int count)
{
if (source == null) throw new ArgumentNullException(nameof(source));
if (count < 0) throw new ArgumentOutOfRangeException(nameof(count));
// Instantiate the two arrays
int[] indices = new int[source.Length];
int[] temp = new int[source.Length];
// Initialize the indices array with incremented numbers
for (int i = 0; i < indices.Length; i++)
indices[i] = i;
for (int k = 0; k < count; k++)
{
// Copy the odds to the temp array
for (int i = 0, j = 0; j < indices.Length; i += 1, j += 2)
temp[i] = indices[j];
// Copy the evens to the temp array
int lastEven = (indices.Length >> 1 << 1) - 1;
for (int i = indices.Length - 1, j = lastEven; j >= 0; i -= 1, j -= 2)
temp[i] = indices[j];
// Swap the two arrays, using value tuples
(indices, temp) = (temp, indices);
}
// Map the indices to characters from the source string
return String.Concat(indices.Select(i => source[i]));
}

Is it possible to multiply two arrays as a single command for code performance?

Given the following code:
public float[] weights;
public void Input(Neuron[] neurons)
{
float output = 0;
for (int i = 0; i < neurons.Length; i++)
output += neurons[i].input * weights[i];
}
Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'
Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.
Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).

I think you are looking for AVX in C#
So you can actually calculate several values in one command.
Thats SIMD for CPU cores. Take a look at this
Here an example from the website:
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
{
var simdLength = Vector<int>.Count;
var result = new int[lhs.Length];
var i = 0;
for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
{
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
}
for (; i < lhs.Length; ++i)
{
result[i] = lhs[i] + rhs[i];
}
return result;
}
You can also combine it with the parallelism you already use.

Increasing sequence in one dimensional array

You're given an array of integers,in case if you see subsequence in which each following bigger than the previous on one(2 3 4 5) you have to rewrite this subsequence in the resulting array like this 2 - 5 and then the rest of the array. So in general what is expected when you have 1 2 3 5 8 10 11 12 13 14 15 the output should be something like 1-3 5 8 10-15.
I have my own idea but can't really implement it so all I managed to do is:
static void CompactArray(int[] arr)
{
int[] newArr = new int[arr.length];
int l = 0;
for (int i = 0,k=1; i <arr.length ; i+=k,k=1) {
if(arr[i+1]==arr[i]+1)
{
int j = i;
while (arr[j+1]==arr[j]+1)
{
j++;
k++;
}
if (k>1)
{
}
}
else if(k==1)
{
newArr[i] = arr[i];
}
}
In short here I walk through the array and checking if next element is sum of one and previous array element and if so I'm starting to walk as long as condition is true and after that i just rewriting elements under indices and then move to the next.
I expect that people will help me to develop my own solution by giving me suggestions instead of throwing their own based on the tools which language provides because I had that situation on the russian forum and it didn't help me, and also I hope that my explanation is clear because eng isn't my native language so sorry for possible mistakes.

If I understand the problem correctly, you just need to print the result on the screen, so I'd start with declaring the variable which will hold our result string.
var result = string.Empty
Not using other array to store the state will help us keep the code clean and much more readable.
Let's now focus on the main logic. We'd like to loop over the array.
for (int i = 0; i < array.Length; i++)
{
// Let's store the initial index of current iteration.
var beginningIndex = i;
// Jump to the next element, as long as:
// - it exists (i + 1 < array.Length)
// - and it is greater from current element by 1 (array[i] == array[i+1] - 1)
while (i + 1 < array.Length && array[i] == array[i+1] - 1)
{
i++;
}
// If the current element is the same as the one we started with, add it to the result string.
if (i == beginningIndex)
{
result += $"{array[i]} ";
}
// If it is different element, add the range from beginning element to the one we ended with.
else
{
result += $"{array[beginningIndex]}-{array[i]} ";
}
}
All that's left is printing the result:
Console.WriteLine(result)
Combining it all together would make the whole function look like:
static void CompactArray(int[] array)
{
var result = string.Empty;
for (int i = 0; i < array.Length; i++)
{
var beginningIndex = i;
while (i + 1 < array.Length && array[i] == array[i+1] - 1)
{
i++;
}
if (i == beginningIndex)
{
result += $"{array[i]} ";
}
else
{
result += $"{array[beginningIndex]}-{array[i]} ";
}
}
Console.WriteLine(result);
}

Quick Sort Implementation with large numbers [duplicate]

I learnt about quick sort and how it can be implemented in both Recursive and Iterative method.
In Iterative method:
Push the range (0...n) into the stack
Partition the given array with a pivot
Pop the top element.
Push the partitions (index range) onto a stack if the range has more than one element
Do the above 3 steps, till the stack is empty
And the recursive version is the normal one defined in wiki.
I learnt that recursive algorithms are always slower than their iterative counterpart.
So, Which method is preferred in terms of time complexity (memory is not a concern)?
Which one is fast enough to use in Programming contest?
Is c++ STL sort() using a recursive approach?

In terms of (asymptotic) time complexity - they are both the same.
"Recursive is slower then iterative" - the rational behind this statement is because of the overhead of the recursive stack (saving and restoring the environment between calls).
However -these are constant number of ops, while not changing the number of "iterations".
Both recursive and iterative quicksort are O(nlogn) average case and O(n^2) worst case.
EDIT:
just for the fun of it I ran a benchmark with the (java) code attached to the post , and then I ran wilcoxon statistic test, to check what is the probability that the running times are indeed distinct
The results may be conclusive (P_VALUE=2.6e-34, https://en.wikipedia.org/wiki/P-value. Remember that the P_VALUE is P(T >= t | H) where T is the test statistic and H is the null hypothesis). But the answer is not what you expected.
The average of the iterative solution was 408.86 ms while of recursive was 236.81 ms
(Note - I used Integer and not int as argument to recursiveQsort() - otherwise the recursive would have achieved much better, because it doesn't have to box a lot of integers, which is also time consuming - I did it because the iterative solution has no choice but doing so.
Thus - your assumption is not true, the recursive solution is faster (for my machine and java for the very least) than the iterative one with P_VALUE=2.6e-34.
public static void recursiveQsort(int[] arr,Integer start, Integer end) {
if (end - start < 2) return; //stop clause
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
recursiveQsort(arr, start, p);
recursiveQsort(arr, p+1, end);
}
public static void iterativeQsort(int[] arr) {
Stack<Integer> stack = new Stack<Integer>();
stack.push(0);
stack.push(arr.length);
while (!stack.isEmpty()) {
int end = stack.pop();
int start = stack.pop();
if (end - start < 2) continue;
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
stack.push(p+1);
stack.push(end);
stack.push(start);
stack.push(p);
}
}
private static int partition(int[] arr, int p, int start, int end) {
int l = start;
int h = end - 2;
int piv = arr[p];
swap(arr,p,end-1);
while (l < h) {
if (arr[l] < piv) {
l++;
} else if (arr[h] >= piv) {
h--;
} else {
swap(arr,l,h);
}
}
int idx = h;
if (arr[h] < piv) idx++;
swap(arr,end-1,idx);
return idx;
}
private static void swap(int[] arr, int i, int j) {
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
public static void main(String... args) throws Exception {
Random r = new Random(1);
int SIZE = 1000000;
int N = 100;
int[] arr = new int[SIZE];
int[] millisRecursive = new int[N];
int[] millisIterative = new int[N];
for (int t = 0; t < N; t++) {
for (int i = 0; i < SIZE; i++) {
arr[i] = r.nextInt(SIZE);
}
int[] tempArr = Arrays.copyOf(arr, arr.length);
long start = System.currentTimeMillis();
iterativeQsort(tempArr);
millisIterative[t] = (int)(System.currentTimeMillis()-start);
tempArr = Arrays.copyOf(arr, arr.length);
start = System.currentTimeMillis();
recursvieQsort(tempArr,0,arr.length);
millisRecursive[t] = (int)(System.currentTimeMillis()-start);
}
int sum = 0;
for (int x : millisRecursive) {
System.out.println(x);
sum += x;
}
System.out.println("end of recursive. AVG = " + ((double)sum)/millisRecursive.length);
sum = 0;
for (int x : millisIterative) {
System.out.println(x);
sum += x;
}
System.out.println("end of iterative. AVG = " + ((double)sum)/millisIterative.length);
}

Recursion is NOT always slower than iteration. Quicksort is perfect example of it. The only way to do this in iterate way is create stack structure. So in other way do the same that the compiler do if we use recursion, and propably you will do this worse than compiler. Also there will be more jumps if you don't use recursion (to pop and push values to stack).

That's the solution i came up with in Javascript. I think it works.
const myArr = [33, 103, 3, 726, 200, 984, 198, 764, 9]
document.write('initial order :', JSON.stringify(myArr), '<br><br>')
qs_iter(myArr)
document.write('_Final order :', JSON.stringify(myArr))
function qs_iter(items) {
if (!items || items.length <= 1) {
return items
}
var stack = []
var low = 0
var high = items.length - 1
stack.push([low, high])
while (stack.length) {
var range = stack.pop()
low = range[0]
high = range[1]
if (low < high) {
var pivot = Math.floor((low + high) / 2)
stack.push([low, pivot])
stack.push([pivot + 1, high])
while (low < high) {
while (low < pivot && items[low] <= items[pivot]) low++
while (high > pivot && items[high] > items[pivot]) high--
if (low < high) {
var tmp = items[low]
items[low] = items[high]
items[high] = tmp
}
}
}
}
return items
}
Let me know if you found a mistake :)
Mister Jojo UPDATE :
this code just mixes values that can in rare cases lead to a sort, in other words never.
For those who have a doubt, I put it in snippet.

Segmented Aggregation within an Array

I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.
I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.
Given the following example array,
0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1
Becomes
0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6
And so forth to the right, up to problematic sizes (1024x1024)
The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.
The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.
Is there another way to structure the data to make it faster to process?
If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.

Proper coding and a bit of insight in how .NET knows stuff helps as well :-)
Some rules of thumb that apply in this case:
If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
Use as few IL instructions as possible if you want speed. Generally this seems to work.
Test multiple iterations. A single iteration might not be good enough.
using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)
public static void Main(string[] args)
{
int width = 4096;
int height = 4096;
int[] ar = new int[width * height];
Random rnd = new Random(213);
for (int i = 0; i < ar.Length; ++i)
{
ar[i] = rnd.Next(0, 120);
}
// (5)...
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
int sum = 0;
for (int i = 0; i < ar.Length; ++i) // (3) sequential access
{
if ((i % width) == 0)
{
sum = 0;
}
// (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
// (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
ar[i] = (sum += ar[i]);
}
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.
If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
Parallel.For(0, height, (a) =>
{
int sum = 0;
for (var i = width * a + 1; i < width * (a + 1); i++)
{
ar[i] = (sum += ar[i]);
}
});
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.

Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.
There are no dependencies between rows.
0 0 1 0 0 - process on core1 |
2 0 0 0 3 - process on core1 |
-------------------------------
0 4 1 1 0 - process on core2 |
0 1 0 4 1 - process on core2 |
Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.
It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.

As #Chris Eelmaa told you it is possible to do a parallel execution by row. Using Parallel.For could be rewritten like this:
static int[,] values = new int[,]{
{0, 0, 1, 0, 0},
{2, 0, 0, 0, 3},
{0, 4, 1, 1, 0},
{0, 1, 0, 4 ,1}};
static void Main(string[] args)
{
int rows=values.GetLength(0);
int columns=values.GetLength(1);
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[row, column] += values[row, column - 1];
}
});
for (var row = 0; row < rows; row++)
{
for (var column = 0; column < columns; column++)
{
Console.Write("{0} ", values[row, column]);
}
Console.WriteLine();
}
So, as stated in your question, you have a one dimensional array, the code would be a bit faster:
static void Main(string[] args)
{
var values = new int[1024 * 1024];
Random r = new Random();
for (int i = 0; i < 1024; i++)
{
for (int j = 0; j < 1024; j++)
{
values[i * 1024 + j] = r.Next(25);
}
}
int rows = 1024;
int columns = 1024;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[(row * columns) + column] += values[(row * columns) + column - 1];
}
});
}
Console.WriteLine(sw.Elapsed);
}
But not as fast as a GPU. To use parallel GPU processing you will have to rewrite it in C++ AMP or take a look on how to port this parallel for to cudafy: http://w8isms.blogspot.com.es/2012/09/cudafy-me-part-3-of-4.html

You may as well store the array as a jagged array, the memory layout will be the same. So, instead of,
int[] texture;
you have,
int[][] texture;
Isolate the row operation as,
private static Task ProcessRow(int[] row)
{
var v = row[0];
for (var i = 1; i < row.Length; i++)
{
v = row[i] += v;
}
return Task.FromResult(true);
}
then you can write a function that does,
Task.WhenAll(texture.Select(ProcessRow)).Wait();
If you want to remain with a 1-dimensional array, a similar approach will work, just change ProcessRow.
private static Task ProcessRow(int[] texture, int start, int limit)
{
var v = texture[start];
for (var i = start + 1; i < limit; i++)
{
v = texture[i] += v;
}
return Task.FromResult(true);
}
then once,
var rowSize = 1024;
var rows =
Enumerable.Range(0, texture.Length / rowSize)
.Select(i => Tuple.Create(i * rowSize, (i * rowSize) + rowSize))
.ToArray();
then on each cycle.
Task.WhenAll(rows.Select(t => ProcessRow(texture, t.Item1, t.Item2)).Wait();
Either way, each row is processed in parallel.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.