GPU global memory calculation

GPU global memory calculation - c#

In the worst case, does this sample allocate testCnt * xArray.Length storage in the GPU global memory? How to make sure just one copy of the array is transferred to the device? The GpuManaged attribute seems to serve this purpose but it doesn't solve our unexpected memory consumption.
void Worker(int ix, byte[] array)
{
// process array - only read access
}
void Run()
{
var xArray = new byte[100];
var testCnt = 10;
Gpu.Default.For(0, testCnt, ix => Worker(ix, xArray));
}
EDIT
The main question in a more precise form:
Does each worker thread get a fresh copy of xArray or is there only one copy of xArray for all threads?

Your sample code should allocate 100 bytes of memory on the GPU and 100 bytes of memory on the CPU.
(.Net adds a bit of overhead, but we can ignore that)
Since you're using implicit memory, some resources need to be allocated to track that memory, (basically where it lives: CPU/GPU).
Now... You're probably seeing a bigger memory consumption on the CPU side I assume.
The reason for that is possibly due to kernel compilation happening on the fly.
AleaGPU has to compile your IL code into LLVM, that LLVM is fed into the Cuda compiler which in turn converts it into PTX.
This happens when you run a kernel for the first time.
All of the resources and unmanaged dlls are loaded into memory.
That's possibly what you're seeing.
testCnt has no effect on the amount of memory being allocated.
EDIT*
One suggestion is to use memory in an explicit way.
Its faster and more efficient:
private static void Run()
{
var input = Gpu.Default.AllocateDevice<byte>(100);
var deviceptr = input.Ptr;
Gpu.Default.For(0, input.Length, i => Worker(i, deviceptr));
Console.WriteLine(string.Join(", ", Gpu.CopyToHost(input)));
}
private static void Worker(int ix, deviceptr<byte> array)
{
array[ix] = 10;
}

Try use explicit memory:
static void Worker(int ix, byte[] array)
{
// you must write something back, note, I changed your Worker
// function to static!
array[ix] += 1uy;
}
void Run()
{
var gpu = Gpu.Default;
var hostArray = new byte[100];
// set your host array
var deviceArray = gpu.Allocate<byte>(100);
// deviceArray is of type byte[], but deviceArray.Length = 0,
assert deviceArray.Length == 0
assert Gpu.ArrayGetLength(deviceArray) == 100
Gpu.Copy(hostArray, deviceArray);
var testCnt = 10;
gpu.For(0, testCnt, ix => Worker(ix, deviceArray));
// you must copy memory back
Gpu.Copy(deviceArray, hostArray);
// check your result in hostArray
Gpu.Free(deviceArray);
}

Related

Halcon FindNccModel causes memory leak in C#

Using the Halcon 13 function FindNccModel in C# causes the following error:
HALCON error #6001: Not enough memory available in operator find_ncc_model
class Program
{
static void Main(string[] args)
{
HImage Image = new HImage(#"08_09_09_41_33_582_OK_000000153000.png");
double MidpointRow = 1053.5210373923057, MidpointCol = 1223.5205413999142;
int iCounter = 0;
while (true)
{
HNCCModel model = new HNCCModel(#"000000135000Mark_0.ncm");
HXLDCont hxCont = new HXLDCont();
hxCont.GenRectangle2ContourXld(
721.9213759213759,
1775.862648221344,
-0.99483767363676778,
72,
14.5);
HTuple htRowXLD, htColXLD;
hxCont.GetContourXld(out htRowXLD, out htColXLD);
HTuple htRadius = new HTuple();
htRadius = new HTuple(htRowXLD.TupleSub(MidpointRow).TuplePow(2) + htColXLD.TupleSub(MidpointCol).TuplePow(2)).TupleSqrt();
HRegion hrAnnulus = new HRegion();
hrAnnulus = hrAnnulus.GenAnnulus(MidpointRow, MidpointCol, htRadius.TupleMin() - 5.0, htRadius.TupleMax() + 5.0);
HImage hiTemp = Image.Clone();
HImage hiTemp2 = hiTemp.Rgb1ToGray();
HImage hiTemp3 = hiTemp2.ReduceDomain(hrAnnulus);
HTuple htRow, htColumn, Angle, Score;
model.FindNccModel(hiTemp3, -0.39, 6.29, 0.65, 1, 0, "true", 0, out htRow, out htColumn, out Angle, out Score);
hxCont.DisposeIfNotNull();
hrAnnulus.DisposeIfNotNull();
model.Dispose();
hiTemp.DisposeIfNotNull();
hiTemp2.DisposeIfNotNull();
hiTemp3.DisposeIfNotNull();
Console.WriteLine(iCounter++.ToString());
}
}
}
public static class DL_HalconUtilityClass
{
public static HRegion GenAnnulus(this HRegion region, double dCenterRow, double dCenterColumn, double dRadiusSmall, double dRadiusBig)
{
region.GenEmptyRegion();
if (dRadiusSmall > dRadiusBig)
{
throw new NotSupportedException("Wrong input parameters. Small radius is bigger than big radius.");
}
HRegion hrCircleSmall = new HRegion(dCenterRow, dCenterColumn, dRadiusSmall);
HRegion hrCircleBig = new HRegion(dCenterRow, dCenterColumn, dRadiusBig);
region = new HRegion();
region = hrCircleBig.Difference(hrCircleSmall);
hrCircleSmall.Dispose();
hrCircleBig.Dispose();
return region;
}
public static void DisposeIfNotNull(this HImage hiImage)
{
if (hiImage != null) hiImage.Dispose();
}
public static void DisposeIfNotNull(this HRegion hrRegion)
{
if (hrRegion != null) hrRegion.Dispose();
}
public static void DisposeIfNotNull(this HObject hoObject)
{
if (hoObject != null) hoObject.Dispose();
}
}
The function itself can run endlessly in an while loop, but if it's combined with our program it causes a memory exception. On the other hand the program itself can run endlessly without this function. It is also interesting that the error happens before the program reaches typical 1,1 Gb of memory which means that there is a memory leak.
I didn't find any references to this problem in Halcon documentation and upgrading to the newest Halcon 13 version or using Halcon XL did not help. Does anyone know what could cause this problem?

In your code you already manually dispose of most HALCON objects, as it is suggested to do. As you probably know this is necessary because the .NET garbage collector does not know about the amount of unmanaged memory handled by the HALCON library that might be used by the managed object.
However, you miss to Dispose the HTuples that contain the result of FindNccModel htRow, htColumn, Angle and Score.
You might also want to move the creation of the HNCCModel out of your while loop.

Halcon has two memory management optimization system settings: global_mem_cache and temporary_mem_cache. The global_mem_cache had no influence, but set the temporary_mem_cache parameter to "idle" or "shared" solved the problem.
Default setting is "exclusive" where temporary memory is cached locally for each thread. This is an excerpt from Halcon documentation:
'temporary_mem_cache' *), 'tsp_temporary_mem_cache'
This parameter controls the operating mode of the temporary memory cache. The temporary memory cache is used to speed up an application by caching memory used temporarily during the execution of an operator. For most applications the default setting ('exclusive') will produce the best results. The following modes are supported:
'idle' The temporary memory cache is turned off. This mode will use the least memory, but will also reduce performance compared to the other modes.
'shared' All temporary memory is cached globally in the temporary memory reservoir. This mode will use less memory than 'exclusive' mode, but will also generally offer less performance.
'exclusive' All temporary memory is cached locally for each thread. This mode will use the most memory, but will generally also offer the best performance.
'aggregate' Temporary memory blocks that are larger than the threshold set with the 'alloctmp_max_blocksize' parameter are cached in the global memory reservoir, while all smaller blocks are aggregated into a single block that is cached locally for each thread. If the global memory reservoir is disabled, the large blocks are freed instead. The aggregated block will be sized according to the temporary memory usage the thread has seen so far, but it will not be larger than 'alloctmp_max_blocksize' (if set) or smaller than 'alloctmp_min_blocksize' (if set). This mode balances memory usage and speed, but requires correctly setting 'alloctmp_min_blocksize' and 'alloctmp_max_blocksize' for the application's memory usage pattern for effectiveness.
Note that cache mode 'idle' is set in exclusive run mode, whereas the other modes are set in reentrant mode.
For backward compatibility, the values 'false' and 'true' are also accepted; they correspond to 'idle' and 'exclusive', respectively.

free memory of an local array of string in method C#

I have recently encountered a problem about the memory my program used. The reason is the memory of an array of string i used in a method. More specifically, this program is to read an integer array from a outside file. Here is my code
class Program
{
static void Main(string[] args)
{
int[] a = loadData();
for (int i = 0; i < a.Length; i++)
{
Console.WriteLine(a[i]);
}
Console.ReadKey();
}
private static int[] loadData()
{
string[] lines = System.IO.File.ReadAllLines(#"F:\data.txt");
int[] a = new int[lines.Length];
for (int i = 0; i < lines.Length; i++)
{
string[] temp = lines[i].Split(new char[]{','},StringSplitOptions.RemoveEmptyEntries);
a[i] = Convert.ToInt32(temp[0]);
}
return a;
}
}
File data.txt is about 7.4 MB and 574285 lines. But when I run, the memory of program shown in task manager is : 41.6 MB. It seems that the memory of the array of string I read in loadData() (it is string[] lines) is not be freed. How can i free it, because it is never used later.

You can call GC.Collect() after setting lines to null, but I suggest you look at all answers here, here and here. Calling GC.Collect() is something that you rarely want to do. The purpose of using a language such as C# is that it manages the memory for you. If you want granular control over the memory read in, then you could create a C++ dll and call into that from your C# code.
Instead of reading the entire file into a string array, you could read it line by line and perform the operations that you need to on that line. That would probably be more efficient as well.
What problem does the 40MB of used memory cause? How often do you read the data? Would it be worth caching it for future use (assuming the 7MB is tolerable).

Multi-threaded 'fixed'

I have a huge array that is being analyzed differently by two threads:
Data is large- no copies allowed
Threads must process concurrently
Must disable bounds checking for maximum performance
Therefore, each thread looks something like this:
unsafe void Thread(UInt16[] data)
{
fixed(UInt16* pData = data)
{
UInt16* pDataEnd = pData + data.Length;
for(UInt16* pCur=pData; pCur != pDataEnd; pCur++)
{
// do stuff
}
}
}
Since there is no mutex (intentionally), I'm wondering if it's safe to use two fixed statements on the same data on parallel threads?? Presumably the second fixed should return the same pointer as the first, because memory is already pinned... and when the first completes, it won't really unpin memory because there is a second fixed() still active.. Has anyone tried this scenario?

According to "CLR via C#" it is safe to do so.
The compiler sets a 'pinned' flag on pData variable (on the pointer, not on the array instance).
So multiple/recursive use should be OK.

Maybe instead of using fixed, you could use GCHandle.Alloc to pin the array:
// not inside your thread, but were you init your shared array
GCHandle handle = GCHandle.Alloc(anArray, GCHandleType.Pinned);
IntPtr intPtr = handle.AddrOfPinnedObject();
// your thread
void Worker(IntPtr pArray)
{
unsafe
{
UInt16* ptr = (UInt16*) pArray.ToPointer();
....
}
}

If all you need to do is
for(int i = 0; i < data.Length; i++)
{
// do stuff with data[i]
}
the bounds check is eliminated by the JIT compiler. So no need for unsafe code.
Note that this does not hold if your access pattern is more complex than that.

VST plugin : using FFT on audio input buffer with arbitrary size, how?

I'm getting interested in programming a VST plugin, and I have a basic knowledge of audio dsp's and FFT's.
I'd like to use VST.Net, and I'm wondering how to implement an FFT-based effect.
The process-code looks like
public override void Process(VstAudioBuffer[] inChannels, VstAudioBuffer[] outChannels)
If I'm correct, normally the FFT would be applied on the input, some processing would be done on the FFT'd data, and then an inverse-FFT would create the processed soundbuffer.
But since the FFT works on a specified buffersize that will most probably be different then the (arbitrary) amount of input/output-samples, how would you handle this ?

FFT requires that your buffer size is a power of two, but to get around this you should just implement an internal buffer and work with that instead. So for instance:
// MyNiftyPlugin.h
#define MY_NUM_CHANNELS 2
#define MY_FFT_BUFFER_SIZE 1024
class MyNiftyPlugin : public AudioEffectX {
// ... stuff ...
private:
float internalBuffer[MY_NUM_CHANNELS][MY_FFT_BUFFER_SIZE];
long internalBufferIndex;
};
And then in your process loop:
// MyNiftyPlugin.cpp
void process(float **input, float **output, long sampleFrames) {
for(int frame = 0; frame < sampleFrames; ++frame) {
for(int channel = 0; channel < MY_NUM_CHANNELS; ++channel) {
internalBuffer[channel][internalBufferIndex] = inputs[channel][frame];
}
if(++internalBufferIndex > MY_FFT_BUFFER_SIZE) {
doFftStuff(...);
internalBufferIndex = 0;
}
}
}
This will impose a bit of latency in your plugin, but the performance boost you can achieve by knowing the buffer size for FFT during compile time makes it worthwhile.
Also, this is a good workaround for hosts like FL Studio (aka "Fruity Loops") which are known to call process() with different blocksizes every time.

Is the Managed heap not scalable to multi-core systems

I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?

The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.

Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.

[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.