I need to implement this scenario in C#:
The matrix will be very large, maybe 10000x10000 or larger. I will use this for distance matrix in hierarchical clustering algorithm. In every iteration of the algorithm the matrix should be updated (joining 2 rows into 1 and 2 columns into 1). If I use simple double[,] or double[][] matrix this operations will be very "expensive".
Please, can anyone suggest C# implementation of this scenario?
Do you have a algorithm at the moment? And what do you mean by expensive? Memory or time expensive? If memory expensive: There is not much you can do in c#. But you can consider executing the calculation inside a database using temporary objects. If time expensive: You can use parallelism to join columns and rows.
But beside that I think a simple double[,] array is the fastest and memory sparing way you can get in c#, because accessing the array values is an o(1) operation and arrays have a least amount of memory and management overhead (compared to lists and dictionaries).
As mentioned above, a basic double[,] is going to be the most effective way of handling this in C#.
Remember that C# sits of top of managed memory, and as such you have less fine grain control over low level (in terms of memory) operations in contrast to something like basic C. Creating your own objects in C# to add functionality will only use more memory in this scenario, and likely slow the algorithm down as well.
If you have yet to pick an algorithm, CURE seems to be a good bet. The choice of algorithm may affect your data structure choice, but that's not likely.
You will find that the algorithm determines the theoretical limits of 'cost' at any rate. For example you will read that for CURE, you are bound by a O(n2 log n) running time, and O(n) memory use.
I hope this helps. If you can provide more detail, we might be able to assist further!
N.
It's not possible to 'merge' two rows or two columns, you'd have to copy the whole matrix into a new, smaller one, which is indeed unacceptably expensive.
You should probably just add the values in one row to the previous and then ignore the values, acting like they where removed.
the arrays of arrays: double[][] is actually faster than double[,]. But takes more memory.
The whole array merging thing might not be needed if you change the algoritm a bit, but this might help u:
public static void MergeMatrix()
{
int size = 100;
// Initialize the matrix
double[,] matrix = new double[size, size];
for (int i = 0; i < size; i++)
for (int j = 0; j < size; j++)
matrix[i, j] = ((double)i) + (j / 100.0);
int rowMergeCount = 0, colMergeCount = 0;
// Merge last row.
for (int i = 0; i < size; i++)
matrix[size - rowMergeCount - 2, i] += matrix[size - rowMergeCount - 1, i];
rowMergeCount++;
// Merge last column.
for (int i = 0; i < size; i++)
matrix[i, size - colMergeCount - 2] += matrix[i, size - colMergeCount - 1];
colMergeCount++;
// Read the newly merged values.
int newWidth = size - rowMergeCount, newHeight = size - colMergeCount;
double[,] smaller = new double[newWidth, newHeight];
for (int i = 0; i < newWidth; i++)
for (int j = 0; j < newHeight; j++)
smaller[i, j] = matrix[i, j];
List<int> rowsMerged = new List<int>(), colsMerged = new List<int>();
// Merging row at random position.
rowsMerged.Add(15);
int target = rowsMerged[rowMergeCount - 1];
int source = rowsMerged[rowMergeCount - 1] + 1;
// Still using the original matrix since it's values are still usefull.
for (int i = 0; i < size; i++)
matrix[target, i] += matrix[source, i];
rowMergeCount++;
// Merging col at random position.
colsMerged.Add(37);
target = colsMerged[colMergeCount - 1];
source = colsMerged[colMergeCount - 1] + 1;
for (int i = 0; i < size; i++)
matrix[i, target] += matrix[i, source];
colMergeCount++;
newWidth = size - rowMergeCount;
newHeight = size - colMergeCount;
smaller = new double[newWidth, newHeight];
for (int i = 0, j = 0; i < newWidth && j < size; i++, j++)
{
for (int k = 0, m = 0; k < newHeight && m < size; k++, m++)
{
smaller[i, k] = matrix[j, m];
Console.Write(matrix[j, m].ToString("00.00") + " ");
// So merging columns is more expensive because we have to check for it more often while reading.
if (colsMerged.Contains(m)) m++;
}
if (rowsMerged.Contains(j)) j++;
Console.WriteLine();
}
Console.Read();
}
In this code I use two 1D helper lists to calculate the index into a big array containing the data. Deleting rows/columns is really cheap since I only need to remove that index from the helper-lists. But of course the memory in the big array remains, i.e. depending on your usage you have a memory-leak.
public class Matrix
{
double[] data;
List<int> cols;
List<int> rows;
private int GetIndex(int x,int y)
{
return rows[y]+cols[x];
}
public double this[int x,int y]
{
get{return data[GetIndex(x,y)];}
set{data[GetIndex(x,y)]=value;}
}
public void DeleteColumn(int x)
{
cols.RemoveAt(x);
}
public void DeleteRow(int y)
{
rows.RemoveAt(y);
}
public Matrix(int width,int height)
{
cols=new List<int>(Enumerable.Range(0,width));
rows=new List<int>(Enumerable.Range(0,height).Select(i=>i*width));
data=new double[width*height];
}
}
Hm, to me this looks like a simple binary tree. The left node represents the next value in a row and the right node represents the column.
So it should be easy to iterate rows and columns and combine them.
Thank you for the answers.
At the moment I'm using this solution:
public class NodeMatrix
{
public NodeMatrix Right { get; set;}
public NodeMatrix Left { get; set; }
public NodeMatrix Up { get; set; }
public NodeMatrix Down { get; set; }
public int I { get; set; }
public int J { get; set; }
public double Data { get; set; }
public NodeMatrix(int I, int J, double Data)
{
this.I = I;
this.J = J;
this.Data = Data;
}
}
List<NodeMatrix> list = new List<NodeMatrix>(10000);
Then I'm building the connections between the nodes. After that the matrix is ready.
This will use more memory, but operations like adding rows and columns, joining rows and columns I think will be far more faster.
Related
So I'm trying to have an array of 2D arrays (so [,][])
They all return Null in the end though for some reason (Most likely due to no assigning to the array)
Edit1: Mistake fixed in code. Thanks.
Edit2: Fixed title, as it is in fact a 2D array of 1D arrays.
Example:
public Environment()
{
_grid2D = new Object[20, 20][];
}
I also try to assign objects to them later in my code:
public GenerateGrid()
{
Random rand = new Random();
for (int i = 0; i < 10; i++ )
{
obj = new InsertObject(rand.Next(0,19), rand.next(0,19));
_grid2D[InsertObject.XPos, InsertObject.YPos][0] = obj;
}
}
I am attempting to use this kind of array because I require seemingly multiple planes of 2D, that are kind of stacked on top of eachother. This way multiple game objects can technically exist in the same space, as the 2D array is an array that will contain positions (X and Y properties already defined elsewhere)
This may be a little convoluted, as there is maybe a better approach.
I need a 20x20 grid, with multiple planes of this grid.
Randomly deciding the location is a design choice, and when the time comes that there are multiple objects in the same location, I will check for this and prevent / reassign a location (rand again).
I'm guessing, but I think what you are trying to do is to allow any number of "InsertObject" objects in each cell of the 2D array. I'm guessing, based on your access pattern of randomly selecting x,y coordinates for each object as you add them to the grid.
If that is the case, then use a List in each cell of the grid. If you want, you can allocate each List when you add the first item, and leave it sparse (so you don't create lists for cells with 0 objects in them.) Or you can do a first pass where you populate all the cells of the 2D array with empty lists. The strategy you go with depends on how much you care about efficiency, and whether you expect a sparse or dense population.
List<Object>[,] _grid2D = new List<Object>[20, 20];
Random rand = new Random();
for (int i = 0; i < 10; i++)
{
int x = rand.Next(1, 20);
int y = rand.Next(1, 19);
Object obj = new object(); // Replace with your InsertObject here.
if (_grid2D[x, y] == null) // If this cell's list doesn't exist yet...
{
_grid2D[x, y] = new List<Object>(); /// ... then make one.
}
_grid2D[x, y].Add(obj); // Add the object to the list.
}
Just be careful when accessing the grid if you go with this sparse technique, as some grid cells may have no List created if they have 0 objects (_grid2D[x,y] may be null).
And if you don't want to allow multiple objects per grid cell, then you just need a 2D array of InsertObject objects. InsertObject[,] _grid2D = new InsertObject[20,20];
Two different implementations one with a 3D array and another is a list of 2D arrays
static void Main(string[] args)
{
int numberOfPanes = 50;
var myGrid1 = GenerateGrid1(20, 20, numberOfPanes);
var myGrid2 = GenerateGrid2(20, 20, numberOfPanes);
}
public static Object[,,] GenerateGrid1(int x, int y, int numberOfPanes)
{
var grid = new Object[x, y, numberOfPanes];
Random rand = new Random(Guid.NewGuid().GetHashCode());
for (int k = 0; k < numberOfPanes; k++)
{
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
{
grid[i, j, k] = rand.Next(1, 20);
}
}
}
return grid;
}
public static List<int[,]> GenerateGrid2(int x, int y, int numberOfPanes)
{
var grid = new int[x, y];
var multiPanes = new List<int[,]>();
Random rand = new Random(Guid.NewGuid().GetHashCode());
for (int k = 0; k < numberOfPanes; k++)
{
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
{
grid[i, j] = rand.Next(1, 20);
}
}
multiPanes.Add(grid);
}
return multiPanes;
}
Trying to build a method that will find the sum of all the values within the 2D array. I'm very new to programming and can't find a good starting point on trying to figure out how its done. Here is what I have so far (forgive me, I'm usually an english/history guy, logic isn't my forte...)
class Program
{
static void Main(string[] args)
{
int[,] myArray = new int[5,6];
FillArray(myArray);
LargestValue(myArray);
}
//Fills the array with random values 1-15
public static void FillArray(int[,] array)
{
Random rnd = new Random();
int[,] tempArray = new int[,] { };
for (int i = 0; i < tempArray.GetLength(0); i++)
{
for (int j = 0; j < tempArray.GetLength(1); j++)
{
tempArray[i, j] = rnd.Next(1, 16);
}
}
}
//finds the largest value in the array (using an IEnumerator someone
//showed me, but I'm a little fuzzy on how it works...)
public static void LargestValue(int[,] array)
{
FillArray(array);
IEnumerable<int> allValues = array.Cast<int>();
int max = allValues.Max();
Console.WriteLine("Max value: " + max);
}
//finds the sum of all values
public int SumArray(int[,] array)
{
FillArray(array);
}
}
I guess I could try to find the sum of each row or column and add them up with a for loop? Add them up and return an int? If anyone could shed any insight, it would be greatly appreciated, thanks!
Firstly, you don't need to call FillArray in the beginning of each method, you have already populated the array in the main method, you are passing a populated array to these other methods.
A loop similar to what you use to populate the array is the easiest to understand:
//finds the sum of all values
public int SumArray(int[,] array)
{
int total = 0;
// Iterate through the first dimension of the array
for (int i = 0; i < array.GetLength(0); i++)
{
// Iterate through the second dimension
for (int j = 0; j < array.GetLength(1); j++)
{
// Add the value at this location to the total
// (+= is shorthand for saying total = total + <something>)
total += array[i, j];
}
}
return total;
}
To sum an array if you know the length is easy
As a bonus code included to get the highest valeu too.
You could easily expand this to get other kinds of statistical code.
I asume below Xlength and Ylength are integers too, and known by you.
You could also replace them by a number in the code.
int total = 0;
int max=0;
int t=0; // temp valeu
For (int x=0;x<Xlength;x++)
{
for (int y=0;y<Ylength;y++)
{
t = yourArray[x,y];
total =total +t;
if(t>max){max=t;} // an if on a single line
}
}
here is a link with an MSDN sample on how to retrieve unknown array lengths.
and there is a nice site to have around when you start in c#
google ".net perls"
I have read the question for Performance of 2-dimensional array vs 1-dimensional array
But in conclusion it says could be the same (depending the map own map function, C does this automatically)?...
I have a matrix wich has 1,000 columns and 440,000,000 rows where each element is a double in C#...
If I am doing some computations in memory, which one could be better to use in performance aspect? (note that I have the memory needed to hold such a monstruos quantity of information)...
If what you're asking is which is better, a 2D array of size 1000x44000 or a 1D array of size 44000000, well what's the difference as far as memory goes? You still have the same number of elements! In the case of performance and understandability, the 2D is probably better. Imagine having to manually find each column or row in a 1D array, when you know exactly where they are in a 2D array.
It depends on how many operations you are performing. In the below example, I'm setting the values of the array 2500 times. Size of the array is (1000 * 1000 * 3). The 1D array took 40 seconds and the 3D array took 1:39 mins.
var startTime = DateTime.Now;
Test1D(new byte[1000 * 1000 * 3]);
Console.WriteLine("Total Time taken 1d = " + (DateTime.Now - startTime));
startTime = DateTime.Now;
Test3D(new byte[1000,1000,3], 1000, 1000);
Console.WriteLine("Total Time taken 3D = " + (DateTime.Now - startTime));
public static void Test1D(byte[] array)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = 10;
}
}
}
public static void Test3D(byte[,,] array, int w, int h)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
array[i, j, 0] = 10;
array[i, j, 1] = 10;
array[i, j, 2] = 10;
}
}
}
}
The difference between double[1000,44000] and double[44000000] will not be significant.
You're probably better of with the [,] version (letting the compiler(s) figure out the addressing. But the pattern of your calculations is likely to have more impact (locality and cache use).
Also consider the array-of-array variant, double[1000][]. It is a known 'feature' of the Jitter that it cannot eliminate range-checking in the [,] arrays.
I am trying to figure out why "Choice A" performs better that "Choice B". My test shows something like 228 vs 830 or there about, it's like a 4 x difference. Looking at the IL, the untrained eye doesn't pick-out the subtly between the 2 calls.
Thank you,
Stephen
const int SIZE = 10000;
void Main()
{
var sw = Stopwatch.StartNew();
int[,]A = new int[SIZE, SIZE];
int total, x, y;
// Choice A
total = 0;
for (x = 0; x < SIZE; x++)
{
for (y = 0; y < SIZE; y++)
{
total += A[x, y];
}
}
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
// Choice B
total = 0;
for (y = 0; y < SIZE; y++)
{
for (x = 0; x < SIZE; x++)
{
total += A[x, y];
}
}
Console.WriteLine(sw.ElapsedMilliseconds);
}
// Define other methods and classes here
Ok, I broke this out so that they would run independently of each other and mitigate any caching and or diagnostics... and B is ALWAYS coming in behind A
namespace ConsoleApplication1
{
class ProgramA
{
const int SIZE = 10000;
static void Main(string[] args)
{
var sw = Stopwatch.StartNew();
int[,] A = new int[SIZE, SIZE];
int total, x, y;
// Choice A
total = 0;
for (x = 0; x < SIZE; x++)
{
for (y = 0; y < SIZE; y++)
{
total += A[x, y];
}
}
Console.WriteLine(sw.ElapsedMilliseconds);
Console.ReadLine();
}
}
class ProgramB
{
const int SIZE = 10000;
static void Main(string[] args)
{
var sw = Stopwatch.StartNew();
int[,] A = new int[SIZE, SIZE];
int total, x, y;
// Choice B
total = 0;
for (y = 0; y < SIZE; y++)
{
for (x = 0; x < SIZE; x++)
{
total += A[x, y];
}
}
Console.WriteLine(sw.ElapsedMilliseconds);
Console.ReadLine();
}
}
}
At a guess, cache effects would be the big one here.
A two-dimensional array is layed out in memory like so:
(0, 0) (0, 1) (0, 2) (0, 3) (1, 0) (1, 1) (1, 2) ...
In option A, you're accessing successive elements in memory - this means that when the CPU fetches a cache line, it gets several successive elements. While option B is jumping around through memory. Thus option B requires significantly more memory accesses once the array becomes larger than the cache size.
Ahh I think I remember.
If you think of a 2d array as a table in memory, the first value is the row, the second value is a column.
[0, 0] [0, 1] [0, 2] [0, 3]...
[1, 0] [1, 1] [1, 2] [1, 3]...
When you iterate over it, the first loop is the row, the second loop is the column. It's quicker to iterate by doing foreach row, assign each column.
In the second scenario it's values are assigned as
[0, 0] [1, 0] [2, 0] [3, 0]...
[0, 1] [1, 1] [2, 1] [3, 1]...
So this is slower because your looping, you're assigning foreach column, foreach row. You're only assigning the first column, for each row.
Does that make sense?
Edit: This was one of the things I was looking for:
http://en.wikipedia.org/wiki/Row-major_order
In row-major storage, a
multidimensional array in linear
memory is accessed such that rows are
stored one after the other.
So when iterating over a row at a time, it's not jumping around memory looking for each next row to assign the value to the column, it has the row, assigns all columns, then jumps to the next row in memory.
To expand upon the cacheing answers:
The values in question are 4 bytes each and IIRC current memory architecture reads 16 byte lines from memory assuming a properly populated motherboard. (I don't know about DDR3, it's three-chip nature suggests the reads are even bigger.) Thus when you read a line of memory you get 4 values.
When you do it the first way you use all of these values before going back to the memory for the next line. Done the second way you use only one of them and it then gets flushed from the on-chip cache long before it's called for again.
Is it more performant to have a bidimensional array (type[,]) or an array of arrays (type[][]) in C#?
Particularly for initial allocation and item access
Of course, if all else fails... test it! Following gives (in "Release", at the console):
Size 1000, Repeat 1000
int[,] set: 3460
int[,] get: 4036 (chk=1304808064)
int[][] set: 2441
int[][] get: 1283 (chk=1304808064)
So a jagged array is quicker, at least in this test. Interesting! However, it is a relatively small factor, so I would still stick with whichever describes my requirement better. Except for some specific (high CPU/processing) scenarios, readability / maintainability should trump a small performance gain. Up to you, though.
Note that this test assumes you access the array much more often than you create it, so I have not included timings for creation, where I would expect rectangular to be slightly quicker unless memory is highly fragmented.
using System;
using System.Diagnostics;
static class Program
{
static void Main()
{
Console.WriteLine("First is just for JIT...");
Test(10,10);
Console.WriteLine("Real numbers...");
Test(1000,1000);
Console.ReadLine();
}
static void Test(int size, int repeat)
{
Console.WriteLine("Size {0}, Repeat {1}", size, repeat);
int[,] rect = new int[size, size];
int[][] jagged = new int[size][];
for (int i = 0; i < size; i++)
{ // don't count this in the metrics...
jagged[i] = new int[size];
}
Stopwatch watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
rect[i, j] = i * j;
}
}
}
watch.Stop();
Console.WriteLine("\tint[,] set: " + watch.ElapsedMilliseconds);
int sum = 0;
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
sum += rect[i, j];
}
}
}
watch.Stop();
Console.WriteLine("\tint[,] get: {0} (chk={1})", watch.ElapsedMilliseconds, sum);
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
jagged[i][j] = i * j;
}
}
}
watch.Stop();
Console.WriteLine("\tint[][] set: " + watch.ElapsedMilliseconds);
sum = 0;
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
sum += jagged[i][j];
}
}
}
watch.Stop();
Console.WriteLine("\tint[][] get: {0} (chk={1})", watch.ElapsedMilliseconds, sum);
}
}
I believe that [,] can allocate one contiguous chunk of memory, while [][] is N+1 chunk allocations where N is the size of the first dimension. So I would guess that [,] is faster on initial allocation.
Access is probably about the same, except that [][] would involve one extra dereference. Unless you're in an exceptionally tight loop it's probably a wash. Now, if you're doing something like image processing where you are referencing between rows rather than traversing row by row, locality of reference will play a big factor and [,] will probably edge out [][] depending on your cache size.
As Marc Gravell mentioned, usage is key to evaluating the performance...
It really depends. The MSDN Magazine article, Harness the Features of C# to Power Your Scientific Computing Projects, says this:
Although rectangular arrays are generally superior to jagged arrays in terms of structure and performance, there might be some cases where jagged arrays provide an optimal solution. If your application does not require arrays to be sorted, rearranged, partitioned, sparse, or large, then you might find jagged arrays to perform quite well.
type[,] will work faster. Not only because of less offset calculations. Mainly because of less constraint checking, less memory allocation and greater localization in memory. type[][] is not a single object -- it's 1 + N objects that must be allocated and can be away from each other.