Can someone please guide me on how can I perform matrix multiplication in C# to use the GPU using opencl.
I have looked at opencl example here:
https://www.codeproject.com/Articles/1116907/How-to-Use-Your-GPU-in-NET
But I am not sure how to proceed for matrix multiplication.
yes as say doqtor, you need to flatten into 1D. So i have an example to use more args :
class Program
{
static string CalculateKernel
{
get
{
return #"
kernel void Calc(global int* m1, global int* m2, int size)
{
for(int i = 0; i < size; i++)
{
printf("" %d / %d\n"",m1[i],m2[i] );
}
}";
}
}
static void Main(string[] args)
{
int[] r1 = new int[]
{1, 2, 3, 4};
int[] r2 = new int[]
{4, 3, 2, 1};
int rowSize = r1.Length;
// pick first platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// create a command queue with first gpu found
ComputeCommandQueue queue = new ComputeCommandQueue(context,
context.Devices[0], ComputeCommandQueueFlags.None);
// load opencl source and
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, CalculateKernel);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// load chosen kernel from program
ComputeKernel kernel = program.CreateKernel("Calc");
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> row1Buffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, r1);
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> row2Buffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, r2);
kernel.SetMemoryArgument(0, row1Buffer); // set the integer array
kernel.SetMemoryArgument(1, row2Buffer); // set the integer array
kernel.SetValueArgument(2, rowSize); // set the array size
// execute kernel
queue.ExecuteTask(kernel, null);
// wait for completion
queue.Finish();
Console.WriteLine("Finished");
Console.ReadKey();
}
another sample with the reading of result from gpubuffer:
class Program
{
static string CalculateKernel
{
get
{
// you could put your matrix algorithm here an take the result in array m3
return #"
kernel void Calc(global int* m1, global int* m2, int size, global int* m3)
{
for(int i = 0; i < size; i++)
{
int val = m2[i];
printf("" %d / %d\n"",m1[i],m2[i] );
m3[i] = val * 4;
}
}";
}
}
static void Main(string[] args)
{
int[] r1 = new int[]
{8, 2, 3, 4};
int[] r2 = new int[]
{4, 3, 2, 5};
int[] r3 = new int[4];
int rowSize = r1.Length;
// pick first platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// create a command queue with first gpu found
ComputeCommandQueue queue = new ComputeCommandQueue(context,
context.Devices[0], ComputeCommandQueueFlags.None);
// load opencl source and
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, CalculateKernel);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// load chosen kernel from program
ComputeKernel kernel = program.CreateKernel("Calc");
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> row1Buffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, r1);
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> row2Buffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, r2);
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> resultBuffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, new int[4]);
kernel.SetMemoryArgument(0, row1Buffer); // set the integer array
kernel.SetMemoryArgument(1, row2Buffer); // set the integer array
kernel.SetValueArgument(2, rowSize); // set the array size
kernel.SetMemoryArgument(3, resultBuffer); // set the integer array
// execute kernel
queue.ExecuteTask(kernel, null);
// wait for completion
queue.Finish();
GCHandle arrCHandle = GCHandle.Alloc(r3, GCHandleType.Pinned);
queue.Read<int>(resultBuffer, true, 0, r3.Length, arrCHandle.AddrOfPinnedObject(), null);
Console.WriteLine("display result from gpu buffer:");
for (int i = 0; i<r3.Length;i++)
Console.WriteLine(r3[i]);
arrCHandle.Free();
row1Buffer.Dispose();
row2Buffer.Dispose();
kernel.Dispose();
program.Dispose();
queue.Dispose();
context.Dispose();
Console.WriteLine("Finished");
Console.ReadKey();
}
}
you just adapt the kernel program to calculate the multiplication of 2 matrix
result of last program:
8 / 4
2 / 3
3 / 2
4 / 5
display result from gpu buffer:
16
12
8
20
Finished
to flatten 2d to 1d its really easy take this sample:
int[,] twoD = { { 1, 2,3 }, { 3, 4,5 } };
int[] oneD = twoD.Cast<int>().ToArray();
and see this link to do 1D -> 2D
I found a very good reference source for using OpenCL with dot Net.
This site is well structured and very useful. It also has matrix multiplication case study example.
OpenCL Tutorial
Related
I was migrating a function from a dll to C# and I found a line of code that I don't understand:
< Module >.lm_minimize(((Vector<CalPoint>)calPoints).Size, n_par1, par, (ILMCallbacks)callbacks, (lm_data_type)data, control);
When I try to enter the class there is code in hexadecimal (I guess it's auto generated code) that I can't copy.
This is the code:
internal static void lm_minimize(
int m_dat,
int n_par,
double[] par,
ILMCallbacks callbacks,
lm_data_type data,
lm_control_type control)
{
double[] numArray = new double[m_dat];
double[] diag = new double[n_par];
double[] qtf = new double[n_par];
double[] fjac = new double[m_dat * n_par];
double[] wa1 = new double[n_par];
double[] wa2 = new double[n_par];
double[] wa3 = new double[n_par];
double[] wa4 = new double[m_dat];
int[] ipvt = new int[n_par];
control.info = 0;
control.nfev = 0;
\u003CModule\u003E.lm_lmdif(m_dat, n_par, par, numArray, control.ftol, control.xtol, control.gtol, control.maxcall * (n_par + 1), control.epsilon, diag, 1, control.stepbound, ref control.info, ref control.nfev, fjac, ipvt, qtf, wa1, wa2, wa3, wa4, callbacks, data);
callbacks.lm_printout(n_par, par, m_dat, numArray, data, -1, 0, control.nfev);
control.fnorm = \u003CModule\u003E.\u003FA0x6d25334e\u002Elm_enorm(m_dat, 0, numArray);
if (control.info >= 0)
return;
control.info = 10;
}
And this code calls the function lm_mdif. Part of the code is like this:
\u003CModule\u003E.\u003FA0x6d25334e\u002E\u003F\u0024S1\u0040\u003F1\u003F\u003Flm_lmdif\u0040\u0040YMXHHP\u002401AN0NNNHN0HNA\u0024CAH10P\u002401AH00000A\u0024AAUILMCallbacks\u0040\u0040P\u0024AAVlm_data_type\u0040\u0040\u0040Z\u0040\u0024\u0024Q4IA |= 1U;
// ISSUE: fault handler
I wanted to know if anyone knows what the LM_MINIMIZE function is for, I can't find what it is for..
Thanks
I tried migrate the library code to C#.
This is a follow on from this question.
I am trying to return an array of floating point numbers from C, to .Net. I will include some F# code, as well as C# code, so that people of either language can answer.
Unmanaged C code:
extern "C"
{
__declspec(dllexport) void DisplayHelloFromDLL(c_float* P_x, c_float* x)
{
//printf("Hello from DLL !\n");
//cout << "You gave me ... an int: " << i << endl;
// Load problem data
//c_float P_x[4] = { 4., 1., 1., 2., }; //covariance matrix
c_int P_nnz = 4; //number of non-zero elements in covar
c_int P_i[4] = { 0, 1, 0, 1, }; //row indices?
c_int P_p[3] = { 0, 2, 4}; //?
c_float q[2] = { 1., 1., }; //linear terms
c_float A_x[4] = { 1., 1., 1., 1., }; //constraint coefficients matrix
c_int A_nnz = 4; //number of non zero elements in constraints matrix
c_int A_i[4] = { 0, 1, 0, 2, }; //row indices?
c_int A_p[3] = { 0, 2, 4}; //?
c_float l[3] = { 1., 0., 0., }; //lower bounds
c_float u[3] = { 1., 0.7, 0.7, }; //upper bounds
c_int n = 2; //number of variables (x)
c_int m = 3; //number of constraints
// Problem settings
OSQPSettings *settings = (OSQPSettings *)c_malloc(sizeof(OSQPSettings));
// Structures
OSQPWorkspace *work; // Workspace
OSQPData *data; // OSQPData
// Populate data
data = (OSQPData *)c_malloc(sizeof(OSQPData));
data->n = n;
data->m = m;
data->P = csc_matrix(data->n, data->n, P_nnz, P_x, P_i, P_p);
data->q = q;
data->A = csc_matrix(data->m, data->n, A_nnz, A_x, A_i, A_p);
data->l = l;
data->u = u;
// Define Solver settings as default
osqp_set_default_settings(settings);
// Setup workspace
work = osqp_setup(data, settings);
// Solve Problem
osqp_solve(work);
//return the value
OSQPSolution* sol = work->solution;
x = sol->x;
// Clean workspace
osqp_cleanup(work);
c_free(data->A);
c_free(data->P);
c_free(data);
c_free(settings);
}
}
So all I have done is declared a parameter 'x', and I am setting it after the results are calculated.
F# code
open System.Runtime.InteropServices
module ExternalFunctions =
[<DllImport("TestLibCpp.dll")>]
extern void DisplayHelloFromDLL(float[] i, [<In>][<Out>] float[] x)
[<EntryPoint>]
let main argv =
let P_x = [|4.; 1.; 1.; 2.|]
let mutable xResult:float[] = [|0.;0.|]
ExternalFunctions.DisplayHelloFromDLL(P_x, xResult);
printfn "This is x:%A" xResult
0 // return an integer exit code
C# code
class Program
{
[DllImport("TestLibCpp.dll")]
public static extern void DisplayHelloFromDLL(double[] i, [In, Out] double[] x);
static void Main(string[] args)
{
Console.WriteLine("This is C# program");
double[] P_x = new double[] {4.0, 1.0, 1.0, 2.0};
double[] x = new double[] { 0.0, 0.0};
DisplayHelloFromDLL(P_x, x);
Console.WriteLine("Finished");
}
}
In both the F#, and C# cases, the value of x is unchanged.
I have tried other variations such as
open System.Runtime.InteropServices
module ExternalFunctions =
[<DllImport("TestLibCpp.dll")>]
extern void DisplayHelloFromDLL(float[] i, [<In>][<Out>] float[]& x)
[<EntryPoint>]
let main argv =
let P_x = [|4.; 1.; 1.; 2.|]
let mutable xResult:float[] = [|0.0; 0.0|]
ExternalFunctions.DisplayHelloFromDLL(P_x, &xResult);
printfn "This is x:%A" xResult
But I suspect the problem is the C code, not the .Net code. I think the problem is that work->solution->x is a pointer, not an array. I am guessing that I need to convert it from a pointer to an array, but can't work out how to do that. And again, not totally sure this is even the problem to begin with.
Thanks to PetSerAl for pointing (pun intended) me in the right direction.
All I needed to do was change part of the C code to
//return the value
OSQPSolution* sol = work->solution;
for (int i = 0; i != n; ++i)
x[i] = sol->x[i];
I am working with Unity 4.5, grabbing images as bytes arrays (each byte represent a channel, taking 4 bytes per pixel (rgba) and displaying them on a texture converting the array to a Color32 array, using this loop:
img = new Color32[byteArray.Length / nChannels]; //nChannels being 4
for (int i=0; i< img.Length; i++) {
img[i].r = byteArray[i*nChannels];
img[i].g = byteArray[i*nChannels+1];
img[i].b = byteArray[i*nChannels+2];
img[i].a = byteArray[i*nChannels+3];
}
Then, it is applied to the texture using:
tex.SetPixels32(img);
However, this slows down the application significantly (this loop is executed on every single frame), and I would like to know if there is any other way to speed up the copying process. I've found some people (Fast copy of Color32[] array to byte[] array) using the Marshal.Copy functions in order to do the reverse process (Color32 to byte array), but I have not been able to make it work to copy a byte array to a Color32 array. Does anybody know a faster way?
Thank you in advance!
Yes, Marshal.Copy is the way to go. I've answered a similar question here.
Here's a generic method to copy from struct[] to byte[] and vice versa
private static byte[] ToByteArray<T>(T[] source) where T : struct
{
GCHandle handle = GCHandle.Alloc(source, GCHandleType.Pinned);
try
{
IntPtr pointer = handle.AddrOfPinnedObject();
byte[] destination = new byte[source.Length * Marshal.SizeOf(typeof(T))];
Marshal.Copy(pointer, destination, 0, destination.Length);
return destination;
}
finally
{
if (handle.IsAllocated)
handle.Free();
}
}
private static T[] FromByteArray<T>(byte[] source) where T : struct
{
T[] destination = new T[source.Length / Marshal.SizeOf(typeof(T))];
GCHandle handle = GCHandle.Alloc(destination, GCHandleType.Pinned);
try
{
IntPtr pointer = handle.AddrOfPinnedObject();
Marshal.Copy(source, 0, pointer, source.Length);
return destination;
}
finally
{
if (handle.IsAllocated)
handle.Free();
}
}
Use it as:
[StructLayout(LayoutKind.Sequential)]
public struct Demo
{
public double X;
public double Y;
}
private static void Main()
{
Demo[] array = new Demo[2];
array[0] = new Demo { X = 5.6, Y = 6.6 };
array[1] = new Demo { X = 7.6, Y = 8.6 };
byte[] bytes = ToByteArray(array);
Demo[] array2 = FromByteArray<Demo>(bytes);
}
This code requires unsafe switch but should be fast. I think you should benchmark these answers...
var bytes = new byte[] { 1, 2, 3, 4 };
var colors = MemCopyUtils.ByteArrayToColor32Array(bytes);
public class MemCopyUtils
{
unsafe delegate void MemCpyDelegate(byte* dst, byte* src, int len);
static MemCpyDelegate MemCpy;
static MemCopyUtils()
{
InitMemCpy();
}
static void InitMemCpy()
{
var mi = typeof(Buffer).GetMethod(
name: "Memcpy",
bindingAttr: BindingFlags.NonPublic | BindingFlags.Static,
binder: null,
types: new Type[] { typeof(byte*), typeof(byte*), typeof(int) },
modifiers: null);
MemCpy = (MemCpyDelegate)Delegate.CreateDelegate(typeof(MemCpyDelegate), mi);
}
public unsafe static Color32[] ByteArrayToColor32Array(byte[] bytes)
{
Color32[] colors = new Color32[bytes.Length / sizeof(Color32)];
fixed (void* tempC = &colors[0])
fixed (byte* pBytes = bytes)
{
byte* pColors = (byte*)tempC;
MemCpy(pColors, pBytes, bytes.Length);
}
return colors;
}
}
Using Parallel.For may give you a significant performance increase.
img = new Color32[byteArray.Length / nChannels]; //nChannels being 4
Parallel.For(0, img.Length, i =>
{
img[i].r = byteArray[i*nChannels];
img[i].g = byteArray[i*nChannels+1];
img[i].b = byteArray[i*nChannels+2];
img[i].a = byteArray[i*nChannels+3];
});
Example on MSDN
I haven't profiled it, but using fixed to ensure your memory doesn't get moved around and to remove bounds checks on array accesses might provide some benefit:
img = new Color32[byteArray.Length / nChannels]; //nChannels being 4
fixed (byte* ba = byteArray)
{
fixed (Color32* c = img)
{
byte* byteArrayPtr = ba;
Color32* colorPtr = c;
for (int i = 0; i < img.Length; i++)
{
(*colorPtr).r = *byteArrayPtr++;
(*colorPtr).g = *byteArrayPtr++;
(*colorPtr).b = *byteArrayPtr++;
(*colorPtr).a = *byteArrayPtr++;
colorPtr++;
}
}
}
It might not provide much more benefit on 64-bit systems - I believe that the bounds checking is is more highly optimized. Also, this is an unsafe operation, so take care.
public Color32[] GetColorArray(byte[] myByte)
{
if (myByte.Length % 1 != 0)
throw new Exception("Must have an even length");
var colors = new Color32[myByte.Length / nChannels];
for (var i = 0; i < myByte.Length; i += nChannels)
{
colors[i / nChannels] = new Color32(
(byte)(myByte[i] & 0xF8),
(byte)(((myByte[i] & 7) << 5) | ((myByte[i + 1] & 0xE0) >> 3)),
(byte)((myByte[i + 1] & 0x1F) << 3),
(byte)1);
}
return colors;
}
Worked about 30-50 times faster than just i++. The "extras" is just styling. This code is doing, in one "line", in the for loop, what you're declaring in 4 lines plus it is much quicker. Cheers :)
Referenced + Referenced code: Here
I'm trying to get this demo to build but I get this error
I've tried this with mono and visual studio 2010, same problem
The error occurs on line
program.Build(null, null, null, IntPtr.Zero);
EDIT
C#
using System;
using Cloo;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
namespace ClooTest
{
class MainClass
{
public static void Main (string[] args)
{
// pick first platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// create a command queue with first gpu found
ComputeCommandQueue queue = new ComputeCommandQueue
(
context,
context.Devices[0],
ComputeCommandQueueFlags.None
);
// load opencl source
StreamReader streamReader = new StreamReader("kernels.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// load chosen kernel from program
ComputeKernel kernel = program.CreateKernel("helloWorld");
// create a ten integer array and its length
int[] message = new int[] { 1, 2, 3, 4, 5 };
int messageSize = message.Length;
// allocate a memory buffer with the message (the int array)
ComputeBuffer<int> messageBuffer = new ComputeBuffer<int>(context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, message);
kernel.SetMemoryArgument(0, messageBuffer); // set the integer array
kernel.SetValueArgument(1, messageSize); // set the array size
// execute kernel
queue.ExecuteTask(kernel, null);
// wait for completion
queue.Finish();
}
}
}
OpenCL
kernel void helloWorld(global read_only int* message, int messageSize) {
for (int i = 0; i < messageSize; i++) {
printf("%d", message[i]);
}
}
EDIT
Yeah print probably isn't very well supported. I would suggest performing your "Hello world" with some simple number crunching instead. Maybe something like:
kernel void IncrementNumber(global float4 *celldata_in, global float4 *celldata_out) {
int index = get_global_id(0);
float4 a = celldata_in[index];
a.w = a.w + 1;
celldata_out[index] = a;
}
I got stuck with Access violation exception in managed code. Histogram pointer is not null and everything seems ok. Got example of creating IntPtr's from http://www.emgu.com/forum/viewtopic.php?f=8&t=59
// initializing data
var random = new Random();
var array = new double[1000];
for (int i = 0; i < 1000; i++)
{
array[i] = random.NextDouble();
}
var arrayPtr = GetDataPtr(array);
//initializing ranges array
double[] rangesArray = { 0, 1 };
var rangesArrayPtr = GetRangesArrayPtr(rangesArray);
//creating and querying histogram
var histogramStructure = CvInvoke.cvCreateHist(1, new[] {20}, HIST_TYPE.CV_HIST_ARRAY, rangesArrayPtr, true);
var histogram = CvInvoke.cvMakeHistHeaderForArray(1, new[] { 20 }, histogramStructure, arrayPtr, rangesArrayPtr, 1);
CvInvoke.cvNormalizeHist(histogram, 1.0);
CvInvoke.cvQueryHistValue_1D(histogram, 0); // getting exception here
help methods
private static IntPtr[] GetRangesArrayPtr(double[] array)
{
var ranges = new IntPtr[1];
ranges[0] = Marshal.AllocHGlobal(array.Length * sizeof(double));
Marshal.Copy(array, 0, ranges[0], array.Length);
return ranges;
}
private static IntPtr GetDataPtr(double[] array)
{
var ranges = new IntPtr();
ranges = Marshal.AllocHGlobal(array.Length * sizeof(double));
Marshal.Copy(array, 0, ranges,array.Length);
return ranges;
}
I had the same problem in a recent project and solved it by copying the histogram values into a new array.
Double[] histtemp = new double[255];
Histogram.MatND.ManagedArray.CopyTo(histtemp,0);
Now you can access the histogram values in histtemp. I hope it will help future viewers.