Tensorflowsharp results getvalue() is very slow - c#

I am using TensorflowSharp to run evaluations using a neural network on an Android phone. I am building the project with Unity.
I am using the tensorflowsharp unity plugin listed under the requirements here: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Using-TensorFlow-Sharp-in-Unity.md.
Everything is working, however extracting the result is very slow.
The network I am running is an autoencoder and the output is an image with dimensions of 128x128x16 (yes there is a lot of output channels).
The evaluation is done in ~ 0.2 seconds which is acceptable. However when i need to extract the result data using results[0].GetValue() it is VERY slow.
This is my code where i run the neural network
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = results[0].GetValue() as float[,,,]; // <- this is SLOW
The problem:
The last line where i convert the result to floats is taking ~1.2 seconds.
Can it realy be true that reading the result data into a float array is taking more than 5 times as long as the actual evaluation of the network?
Is there another way to extract the result values?

So I have found a solution to this. I still do not know why the GetValue() call is so slow, but I found another way to retrieve the data.
I chose to manually read the raw tensor data available at results[0].Data
I created a small function to handle this as a drop in for GetValue, (Here just with the dimensions i am expecting hardcoded)
private float[,,,] TensorToFLoats(TFTensor tensor)
IntPtr resData = tensor.Data;
UIntPtr dataSize = tensor.TensorByteSize;
byte[] s_ImageBuffer = new byte[(int)dataSize];
System.Runtime.InteropServices.Marshal.Copy(resData, s_ImageBuffer, 0, (int)dataSize);
int floatsLength = s_ImageBuffer.Length / 4;
float[] floats = new float[floatsLength];
for (int n = 0; n < s_ImageBuffer.Length; n += 4)
floats[n / 4] = BitConverter.ToSingle(s_ImageBuffer, n);
float[,,,] result = new float[1, 128, 128, 16];
int i = 0;
for (int y = 0; y < 128; y++)
for (int x = 0; x < 128; x++)
for (int p = 0; p < 16; p++)
result[0, y, x, p] = floats[i++];
return result;
Given this i can replace the code in my question with the following
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = TensorToFLoats(results[0]);
This is insanely much faster. Where GetValue took ~1 second the TensorToFloats function i created got the same data in ~0.02 seconds


Peak generations for WaveSurfer.js using CSCore

I am trying to generate peaks using CSCore for WaveSurfer.js. I am essentially just trying to get peak value so it could be fed to the WaveSurfer.js element as prerendered peaks. Using CSCore as an alternative to AudioWaveForm.
Here is the code I am using:
var audioFile = CodecFactory.Instance.GetCodec("input.mp3");
var source = audioFile.ToSampleSource();
var peakMeter = new PeakMeter(source) { Interval = 40 };
var peakData = new float[source.Length / source.WaveFormat.BytesPerSample];
int read;
int i = 0;
while ((read = peakMeter.Read(peakData, i, peakData.Length - i)) > 0)
i += read;
// Convert the peak values from dB to linear scale
for (int j = 0; j < peakData.Length; j++)
decimal num = (decimal)peakData[j] * 100000;
var e = $"{num},";
File.AppendAllText("out.txt", e.ToString());
//peakData[j] = (float)Math.Pow(10, peakData[j] / 20);
I am trying to get a CSV or and array of values. Is this correct because I am getting wildly different results. I am new to CSCore and C# as a whole so any help would be helpful.
When using the PeakMeter you have to use its PeakCalculated event which will provide the peaks. It gets fired while reading all samples as you are already doing using the Read method. So keep calling read till it returns zero and collect the peaks using the mentioned event.

Is it possible to multiply two arrays as a single command for code performance?

Given the following code:
public float[] weights;
public void Input(Neuron[] neurons)
float output = 0;
for (int i = 0; i < neurons.Length; i++)
output += neurons[i].input * weights[i];
Is it possible to perform all the calculations in a single execution? For example that would be 'neurons[0].input * weights[0].value + neurons[1].input * weights[1].value...'
Coming from this topic - How to sum up an array of integers in C#, there is a way for simpler caclulations, but the idea of my code is to iterate over the first array, multiply each element by the element in the same index in the second array and add that to a sum total.
Doing perf profiling, the line where the output is summed is very heavy on I/O and consumes 99% of my processing power. The stack should have enough memory for this, I am not worried about stack overflow, I just want to see it work faster for the moment (even if accuracy is sacrificed).
I think you are looking for AVX in C#
So you can actually calculate several values in one command.
Thats SIMD for CPU cores. Take a look at this
Here an example from the website:
public static int[] SIMDArrayAddition(int[] lhs, int[] rhs)
var simdLength = Vector<int>.Count;
var result = new int[lhs.Length];
var i = 0;
for (i = 0; i <= lhs.Length - simdLength; i += simdLength)
var va = new Vector<int>(lhs, i);
var vb = new Vector<int>(rhs, i);
(va + vb).CopyTo(result, i);
for (; i < lhs.Length; ++i)
result[i] = lhs[i] + rhs[i];
return result;
You can also combine it with the parallelism you already use.

c# managedCuda 2d array to GPU

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel.
I have to following working code for 1 dimension array:
class Program
static void Main(string[] args)
int N = 10;
int deviceID = 0;
CudaContext ctx = new CudaContext(deviceID);
CudaKernel kernel = ctx.LoadKernel(#"doubleIt.ptx", "DoubleIt");
kernel.GridDimensions = (N + 255) / 256;
kernel.BlockDimensions = Math.Min(N,256);
// Allocate input vectors h_A in host memory
float[] h_A = new float[N];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
h_A[i] = i;
// Allocate vectors in device memory and copy vectors from host memory to device memory
CudaDeviceVariable<float> d_A = h_A;
CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);
// Invoke kernel
kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);
// Copy result from device memory to host memory
float[] h_C = d_C;
// h_C contains the result in host memory
with the following kernel code:
__global__ void DoubleIt(const float* A, float* C, int N)
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] * 2;
as I said, everything works fine but I want to work with 2d array as follow:
// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
h_A[i] = new float[W];
for (int j = 0; j < W; j++)
h_A[i][j] = i*W+j;
I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.
so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).
Short answer: you shouldn't do it...
Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...
To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.
For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...
This would realize the jagged array copy:
float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.
int sizeX = 512;
int sizeY = 256;
data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
data_h[x] = new float[sizeY];
arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
//ToDo: init data on host...
//Copy the pointers once:
//Copy data:
for (int x = 0; x < sizeX; x++)
//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);
//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)
This is a sample for CudaPitchedDeviceVariable:
int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
for (int x = 0; x < dimX; x++)
array_host[y * dimX + x] = x * y;
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);
//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dimX || y >= dimY)
//pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
//which gives the start of line y. Convert to float* and add x, to get the
//value at entry x of line y:
float value = *(((float*)((char*)data + y * pitch)) + x);
*(((float*)((char*)data + y * pitch)) + x) = value + 1;
//Or simpler if you don't like pointers:
float* line = (float*)((char*)data + y * pitch);
float value2 = line[x];

Accessing processed values from FFT

I am attempting to use Lomont FFT in order to return complex numbers to build a spectrogram / spectral density chart using c#.
I am having trouble understanding how to return values from the class.
Here is the code I have put together thus far which appears to be working.
int read = 0;
Double[] data;
byte[] buffer = new byte[1024];
FileStream wave = new FileStream(args[0], FileMode.Open, FileAccess.Read);
read = wave.Read(buffer, 0, 44);
read = wave.Read(buffer, 0, 1024);
data = new Double[read];
for (int i = 0; i < read; i+=2)
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(data, true);
What I am not clear on is, how to return/access the values from Lomont FFT implementation back into my application (console)?
Being pretty new to c# development, I'm thinking I am perhaps missing a fundamental aspect of understanding regarding how to retrieve processed values from the instance of the Lomont Class, or perhaps even calling it incorrectly.
Console.WriteLine(LFFT.A); // Returns 0
Console.WriteLine(LFFT.B); // Returns 1
I have been searching for a code snippet or explanation of how to do this, but so far have come up with nothing that I understand or explains this particular aspect of the issue I am facing. Any guidance would be greatly appreciated.
A subset of the results held in data array noted in the code above can be found below and based on my current understanding, appear to be valid:
What am I actually attempting to do? (perspective)
I am looking to load a wave file into a console application and return a spectrogram/spectral density chart/image as a jpg/png for further processing.
The wave files I am reading are mono in format
I Receive slightly different results depending on which FFT is used.
Using RealFFT
for (int i = 0; i < read; i+=2)
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
for (int i = 0; i < buffer.Length / 2; i++)
Math.Sqrt(data[2 * i] * data[2 * i] + data[2 * i + 1] * data[2 * i + 1]));
Partial Result of RealFFT
Using FFT
for (int i = 0; i < read; i+=2)
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
double[] bufferB = new double[2 * data.Length];
for (int i = 0; i < data.Length; i++)
bufferB[2 * i] = data[i];
bufferB[2 * i + 1] = 0;
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(bufferB, true);
for (int i = 0; i < bufferB.Length / 2; i++)
Math.Sqrt(bufferB[2 * i] * bufferB[2 * i] + bufferB[2 * i + 1] * bufferB[2 * i + 1]));
Partial Result of FFT:
Looking at the LomontFFT.FFT documentation:
Compute the forward or inverse Fourier Transform of data, with
data containing complex valued data as alternating real and
imaginary parts. The length must be a power of 2. The data is
modified in place.
This tells us a few things. First the function is expecting complex-valued data whereas your data is real. A quick fix for this is to create another buffer of twice the size and setting all the imaginary parts to 0:
double[] buffer = new double[2*data.Length];
for (int i=0; i<data.Length; i++)
buffer[2*i] = data[i];
buffer[2*i+1] = 0;
The documentation also tells us that the computation is done in place. That means that after the call to FFT returns, the input array is replaced with the computed result. You could thus print the spectrum with:
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(buffer, true);
for (int i = 0; i < buffer.Length/2; i++)
Note since your input data is real valued you could also use LomontFFT.RealFFT. In that case, given a slightly different packing rule, you would obtain the FFT results using:
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
System.Console.WriteLine("{0}", Math.Abs(data[0]);
for (int i = 1; i < data.Length/2; i++)
System.Console.WriteLine("{0}", Math.Abs(data[1]);
This would give you the non-redundant lower half of the spectrum (Unlike LomontFFT.FFT which provides the entire spectrum). Also, numerical differences on the order of double precision (around 1e-16 times the spectrum peak value) with respect to LomontFFT.FFT can be expected.

Character Recognizing using Aforge.net Nural network

I am trying to recognize 0 to 9 digits using Aforge.net . I tried everything but I am still unable to get result please look at my program and why I am unable to recognize digits. Problem may be in number of hidden layers, learning rate or input data , I have tried it by changing number of hidden layers and learning rate. Please suggest ideas.
// opening file
OpenFileDialog open = new OpenFileDialog();
ActivationNetwork enactivation = new ActivationNetwork(new BipolarSigmoidFunction(1), 3886,10, 10);
double[][] input = new double[10][];
double[][] output = new double[10][];
//generating input data using Feature class -- which code is given below
Feature feature = new Feature();
//iterating for all 10 digits.
for (int i = 0; i < 10; i++)
Bitmap bitmap = new Bitmap(open.FileName);
double[] features = feature.features(bitmap);
input[i] = features;
features = feature.features(bitmap);
output[i] = feature.features(bitmap);
BackPropagationLearning learn = new BackPropagationLearning(enactivation);
learn.LearningRate = 0.005f;
learn.Momentum = 0.005f;
double errora;
int iteration = 0;
while (true)
errora = learn.RunEpoch(input, output);
if (errora < 0.0006)
else if (iteration > 23000)
// Console.WriteLine("error {0} {1} ", errora, iteration);
double[] sample;
Bitmap temp = new Bitmap(open.FileName);
// providing input for computation using feature class
sample = feature.features(temp);
foreach (double daa in enactivation.Compute(sample))
Class Feature for providing input for training nural network
class Feature
public double[] features(Bitmap bitmap)
double[] feature = new double[bitmap.Width * bitmap.Height];
int featurec = 0;
for (int vert = 0; vert < bitmap.Height; vert++)
for (int horizantal = 0; horizantal < bitmap.Width; horizantal++)
feature[featurec] = bitmap.GetPixel(horizantal, vert).ToArgb();
if (feature[featurec] < 1)
feature[featurec] = -0.5;
feature[featurec] = 0.5;
return feature;
I haven't used aforge, but re. using backprop neural nets for this problem:
You need something like a 10x10 input grid with each cell in the grid getting 1/100 of the image
You need at least one, possibly 2, hidden layers
The net will train faster with a bias input - meaning a source of a fixed value - for each cell (this lets the cells train faster: Role of Bias in Neural Networks)
I'd never start in bp mode but always run something a statistical annealing first. Bp is for descending inside a local minimum once one is found
Have you successfully used aforge for other problems?
What happens when you try to train the net?
