Using local workers in OpenCL for large matrix computation

Using local workers in OpenCL for large matrix computation - c#

I'm a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#, and am currently working on an application that computes a large 3D matrix. At each pixel in the matrix, 192 unique values are computed and then summed to yield the final value for that pixel. So, functionally, it is like a 4-D matrix, (161 x 161 x 161) x 192.
Right now I'm calling the kernel from my host code like this:
//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}
//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);
Sample kernel code is posted below.
__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
int N = 161; //Size of matrix edges
int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
int array_id = get_global_id(1); //The location within the otherArray
//Finding the x,y,z values of the pixel_id.
float3 p;
p.x = pixel_id % N;
p.y = ((pixel_id % (N*N))-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/(N*N);
float result;
//...
//Some long calculation for 'result' involving otherArray and p...
//...
BigMatrix[pixel_id] += result;
}
My code currently works, however I'm looking for speed for this application, and I'm not sure if my worker/group setup is the best approach (i.e. 161*161*161 and 192 for dimensions of the worker pool).
I've seen other examples of organizing the global worker pool into local worker groups to increase efficiency, but I'm not quite sure how to implement that in OpenCL.NET. I'm also not sure how this is different than just creating another dimension in the worker pool.
So, my question is: Can I use local groups here, and if so how would I organize them? In general, how is using local groups different than just calling an n-dimensional worker pool? (i.e. calling Execute(args, new int[]{(N*N*N),192}), versus having a local workgroup size of 192?)
Thanks for all the help!

I think a lot of performance is lost waiting on memory access. I have answered a similar SO question. I hope my post helps you out. Please ask any questions you have.
Optimizations:
The big boost in my version of your kernel comes from reading otherArray into local memory.
each work item computes 4 values in BigMatrix. This means they can be written at the same time, on the same cacheline. There is minimal loss of parallelism because there are still > 1M work items to execute.
...
#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192
__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
//using 1 quarter of the total size of the matrix
//this work item will be responsible for computing 4 consecutive values in BigMatrix
//also reduces global size to (N^3)/4 ~= 1043000 for N=161
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id;
//array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely
int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group
float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//cache the values in otherArray to local memory
//now each work item in the group will be able to read the values efficently
//each element in otherArray will be read a total of N^3 times, so this is important
//opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
__local float otherValues[otherSize];
for(i=local_id; i<otherSize; i+= local_size){
otherValues[i] = otherArray[i];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
//now this work item can compute the complete result for pixel_id
for(j=0;j<4;j++){
result[j] = 0;
pixel_id = global_id + j;
//Finding the x,y,z values of the pixel_id.
//TODO: optimize the calculation of p.y and p.z
//they will be the same most of the time for a given work item
p.x = pixel_id % N;
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;
for(i=0;i<otherSize;i++){
//...
//Some long calculation for 'result' involving otherValues[i] and p...
//...
//result[j] += ...
}
}
//4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
BigMatrix[global_id] += result[0];
BigMatrix[global_id + 1] += result[1];
BigMatrix[global_id + 2] += result[2];
BigMatrix[global_id + 3] += result[3];
}
Notes:
Global work size needs to be a multiple of four. Ideally, a multiple of 4*workgroupsize. This is because there is no error checking to see if each pixel_id falls within the range: 0..N^3-1. Unprocessed elements can be crunched by the cpu while you wait for the kernel to execute.
The work group size should be fairly large. This will force the cached values to be used more heavily and the benefit of caching the data in LDS will grow.
There is a further optimization to be done with the calculation of p.x/y/z in order to avoid too many costly division and modulo operations. see code below.
__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray) {
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id = global_id;
int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group
float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//Finding the initial x,y,z values of the pixel_id.
p.x = pixel_id % N;
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;
//cache the values here. same as above...
//now this work item can compute the complete result for pixel_id
for(j=0;j<4;j++){
result[j] = 0;
//increment the x,y,and z values instead of computing them all from scratch
p.x += 1;
if(p.x >= N){
p.x = 0;
p.y += 1;
if(p.y >= N){
p.y = 0;
p.z += 1;
}
}
for(i=0;i<otherSize;i++){
//same i loop as above...
}
}

I have a few suggestions for you:
I think your code has a race condition. Your last line of code has the same element of BigMatrix being modified by multiple different work items.
If your matrix is truly 161x161x161, there is plenty of work items here to use those dimensions as your only dimensions. You already have > 4 million work items, which should be plenty of parallelism for your machine. You don't need 192 times that. Plus, if you don't split the computation of an individual pixel into multiple work items, you won't need to synchronize the final add.
If your global work size is not a nice multiple of a big power of 2, you might try to pad it out so that it becomes one. Even if you pass NULL as your local work size, some OpenCL implementations choose inefficient local sizes for global sizes that don't divide well.
If you don't need local memory or barriers for your algorithm, you can pretty much skip local workgroups.
Hope this helps!

Related

Faster Image Processing than Lock Bits

I've been working on an edge detection program in C#, and to make it run faster, I recently made it use lock bits. However, lockBits is still not as fast as I would like it to run. Although the problem could be my general algorithm, I'm also wondering if there is anything better than lockBits I can use for image processing.
In case the problem is the algorithm, here's a basic explanation. Go through an array of Colors (made using lockbits, which represent pixels) and for each Color, check the color of the eight pixels around that pixel. If those pixels do not match the current pixel closely enough, consider the current pixel an edge.
Here's the basic code that defines if a pixel is an edge. It takes in a Color[] of nine colors, the first of which is the pixel is to check.
public Boolean isEdgeOptimized(Color[] colors)
{
//colors[0] should be the checking pixel
Boolean returnBool = true;
float percentage = percentageInt; //the percentage used is set
//equal to the global variable percentageInt
if (isMatching(colors[0], colors[1], percentage) &&
isMatching(colors[0], colors[2], percentage) &&
isMatching(colors[0], colors[3], percentage) &&
isMatching(colors[0], colors[4], percentage) &&
isMatching(colors[0], colors[5], percentage) &&
isMatching(colors[0], colors[6], percentage) &&
isMatching(colors[0], colors[7], percentage) &&
isMatching(colors[0], colors[8], percentage))
{
returnBool = false;
}
return returnBool;
}
This code is applied for every pixel, the colors of which are fetched using lockbits.
So basically, the question is, how can I get my program to run faster? Is it my algorithm, or is there something I can use that is faster than lockBits?
By the way, the project is on gitHub, here

Are you really passing in a floating point number as a percentage to isMatching?
I looked at your code for isMatching on GitHub and well, yikes. You ported this from Java, right? C# uses bool not Boolean and while I don't know for sure, I don't like the looks of code that does that much boxing and unboxing. Further, you're doing a ton of floating point multiplication and comparison when you don't need to:
public static bool IsMatching(Color a, Color b, int percent)
{
//this method is used to identify whether two pixels,
//of color a and b match, as in they can be considered
//a solid color based on the acceptance value (percent)
int thresh = (int)(percent * 255);
return Math.Abs(a.R - b.R) < thresh &&
Math.Abs(a.G - b.G) < thresh &&
Math.Abs(a.B - b.B) < thresh;
}
This will cut down the amount of work you're doing per pixel. I still don't like it because I try to avoid method calls in the middle of a per-pixel loop especially an 8x per-pixel loop. I made the method static to cut down on an instance being passed in that isn't used. These changes alone will probably double your performance since we're doing only 1 multiply, no boxing, and are now using the inherent short-circuit of && to cut down the work.
If I were doing this, I'd be more likely to do something like this:
// assert: bitmap.Height > 2 && bitmap.Width > 2
BitmapData data = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height),
ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb);
int scaledPercent = percent * 255;
unsafe {
byte* prevLine = (byte*)data.Scan0;
byte* currLine = prevLine + data.Stride;
byte* nextLine = currLine + data.Stride;
for (int y=1; y < bitmap.Height - 1; y++) {
byte* pp = prevLine + 3;
byte* cp = currLine + 3;
byte* np = nextLine + 3;
for (int x = 1; x < bitmap.Width - 1; x++) {
if (IsEdgeOptimized(pp, cp, np, scaledPercent))
{
// do what you need to do
}
pp += 3; cp += 3; np += 3;
}
prevLine = currLine;
currLine = nextLine;
nextLine += data.Stride;
}
}
private unsafe static bool IsEdgeOptimized(byte* pp, byte* cp, byte* np, int scaledPecent)
{
return IsMatching(cp, pp - 3, scaledPercent) &&
IsMatching(cp, pp, scaledPercent) &&
IsMatching(cp, pp + 3, scaledPercent) &&
IsMatching(cp, cp - 3, scaledPercent) &&
IsMatching(cp, cp + 3, scaledPercent) &&
IsMatching(cp, np - 3, scaledPercent) &&
IsMatching(cp, np, scaledPercent) &&
IsMatching(cp, np + 3, scaledPercent);
}
private unsafe static bool IsMatching(byte* p1, byte* p2, int thresh)
{
return Math.Abs(p1++ - p2++) < thresh &&
Math.Abs(p1++ - p2++) < thresh &&
Math.Abs(p1 - p2) < thresh;
}
Which now does all kinds of horrible pointer mangling to cut down on array accesses and so on. If all of this pointer work makes you feel uncomfortable, you can allocate byte arrays for prevLine, currLine and nextLine and do a Marshal.Copy for each row as you go.
The algorithm is this: start one pixel in from the top and left and iterate over every pixel in the image except the outside edge (no edge conditions! Yay!). I keep pointers to the starts of each line, prevLine, currLine, and nextLine. Then when I start the x loop, I make up pp, cp, np which are previous pixel, current pixel and next pixel. current pixel is really the one we care about. pp is the pixel directly above it, np directly below it. I pass those into IsEdgeOptimized which looks around cp, calling IsMatching for each.
Now this all assume 24 bits per pixel. If you're looking at 32 bits per pixel, all those magic 3's in there need to be 4's, but other than that the code doesn't change. You could parameterize the number of bytes per pixel if you want so it could handle either.
FYI, the channels in the pixels are typically b, g, r, (a).
Colors are stored as bytes in memory. Your actual Bitmap, if it is a 24 bit image is stored as a block of bytes. Scanlines are data.Stride bytes wide, which is at least as large as 3 * the number of pixels in a row (it may be larger because scan lines are often padded).
When I declare a variable of type byte * in C#, I'm doing a few things. First, I'm saying that this variable contains the address of a location of a byte in memory. Second, I'm saying that I'm about to violate all the safety measures in .NET because I could now read and write any byte in memory, which can be dangerous.
So when I have something like:
Math.Abs(*p1++ - *p2++) < thresh
What it says is (and this will be long):
Take the byte that p1 points to and hold onto it
Add 1 to p1 (this is the ++ - it makes the pointer point to the next byte)
Take the byte that p2 points to and hold onto it
Add 1 to p2
Subtract step 3 from step 1
Pass that to Math.Abs.
The reasoning behind this is that, historically, reading the contents of a byte and moving forward is a very common operation and one that many CPUs build into a single operation of a couple instructions that pipeline into a single cycle or so.
When we enter IsMatching, p1 points to pixel 1, p2 points to pixel 2 and in memory they are laid out like this:
p1 : B
p1 + 1: G
p1 + 2: R
p2 : B
p2 + 1: G
p2 + 2: R
So IsMatching just does the the absolute difference while stepping through memory.
Your follow-on question tells me that you don't really understand pointers. That's OK - you can probably learn them. Honestly, the concepts really aren't that hard, but the problem with them is that without a lot of experience, you are quite likely to shoot yourself in the foot, and perhaps you should consider just using a profiling tool on your code and cooling down the worst hot spots and call it good.
For example, you'll note that I look from the first row to the penultimate row and the first column to the penultimate column. This is intentional to avoid having to handle the case of "I can't read above the 0th line", which eliminates a big class of potential bugs which would involve reading outside a legal memory block, which may be benign under many runtime conditions.

Instead of copying each image to a byte[], then copying to a Color[], creating another temp Color[9] for each pixel, and then using SetPixel to set the color, compile using the /unsafe flag, mark the method as unsafe, replace copying to a byte[] with Marshal.Copy to:
using (byte* bytePtr = ptr)
{
//code goes here
}
Make sure you replace the SetPixel call with setting the proper bytes. This isn't an issue with LockBits, you need LockBits, the issue is that you're being inefficient with everything else related to processing the image.

If you want to use parallel task execution, you can use the Parallel class in System.Threading.Tasks namespace. Following link has some samples and explanations.
http://csharpexamples.com/fast-image-processing-c/
http://msdn.microsoft.com/en-us/library/dd460713%28v=vs.110%29.aspx

You can split the image into 10 bitmaps and process each one, then finally combine them (just an idea).

get the result from array of discrete forurier transform

I just wrote the implementation of dft. Here is my code:
int T = 2205;
float[] sign = new float[T];
for (int i = 0, j = 0; i < T; i++, j++)
sign[i] = (float)Math.Sin(2.0f * Math.PI * 120.0f * i/ 44100.0f);
float[] re = new float[T];
float[] im = new float[T];
float[] dft = new float[T];
for (int k = 0; k < T; k++)
{
for (int n = 0; n < T; n++)
{
re[k] += sign[n] * (float)Math.Cos(2.0f* Math.PI * k * n / T);
im[k] += sign[n] * (float)Math.Sin(2.0f* Math.PI * k * n / T);;
}
dft[k] = (float)Math.Sqrt(re[k] * re[k] + im[k] * im[k]);
}
So the sampling freguency is 44100 Hz and I have a 50ms segment of a 120Hz sinus wave. According to the result I have a peak of the dft function at pont 7 and 2200. Did I do something wrong and if not, how should I interpret the results?
I tried the FFT method of AFORGE. Heres is my code.
int T = 2048;
float[] sign = new float[T];
AForge.Math.Complex[] input = new AForge.Math.Complex[T];
for (int i = 0; i < T; i++)
{
sign[i] = (float)Math.Sin(2.0f * Math.PI * 125.0f * i / 44100.0f);
input[i].Re = sign[i];
input[i].Im = 0.0;
}
AForge.Math.FourierTransform.FFT(input, AForge.Math.FourierTransform.Direction.Forward);
AForge.Math.FourierTransform.FFT(input, AForge.Math.FourierTransform.Direction.Backward);
I had expected to get the original sign but I got something different (a function with only positive values). Is that normal?
Thanks in advance!

Your code look correct, but it could be more efficient, DFT is often solved by FFT algorithm (fast-fourier transform, it's not a new transform, it's just an algorithm to solve DFT in more efficient way).
Even if you do not want to implement FFT (which is a bit harder to understand and it's harder to make it work on data which is not in form of 2^n) or use some open source code, you can make your implementation a bit fast, for example by seeing that 2.0f * Math.PI * K / T is a constant outside of inner loop, so you can compute it once for each k (move it outside inner loop) and then just multiply it by n in your cos/sin functions.
As for position and interpretation, you have changed your domain, now your X-axis, which is the index of data in table corresponds not to time but frequency. You have sampling of 44100Hz and you have captures 2205 samples, that means that every 1 sample represents a magnitude of your input signal at frequency equal to 44100Hz / 2205 = 20Hz. You have your magnitude peak at 7th point (index 6) because your signal is 120Hz, so 6 * 20Hz = 120Hz which is what you could expect.
Seconds peak might seem to represent some high frequency, but it's just a spurious signal, because your sampling rate is 44100Hz you can not measure frequencies higher than 44100Hz / 2 (Nyquist's law) which if you cut-off point, after that frequency DFT data is not valid. That's why, second half of your table is invalid and it's basically your first half but mirrored and you can ignore it.
Edit//
From your questions I can see that you are interested in audio processing, you might want to google NForge.Net library, which is a great opensource library for audio and visual processing and its author have many good articles on codeproject.com regarding many of it's features.

What is the most efficient way to avoid duplicate operations in a C# array?

I need to calculate distances between every pair of points in an array and only want to do that once per pair. Is what I've come up with efficient enough or is there a better way? Here's an example, along with a visual to explain what I'm trying to obtain:
e.g., first get segments A-B, A-C, A-D; then B-C, B-D; and finally, C-D. In other words, we want A-B in our new array, but not B-A since it would be a duplication.
var pointsArray = new Point[4];
pointsArray[0] = new Point(0, 0);
pointsArray[1] = new Point(10, 0);
pointsArray[2] = new Point(10, 10);
pointsArray[3] = new Point(0, 10);
// using (n * (n-1)) / 2 to determine array size
int distArraySize = (pointsArray.Length*(pointsArray.Length - 1))/2;
var distanceArray = new double[distArraySize];
int distanceArrayIndex = 0;
// Loop through points and get distances, never using same point pair twice
for (int currentPointIndex = 0; currentPointIndex < pointsArray.Length - 1; currentPointIndex++)
{
for (int otherPointIndex = currentPointIndex + 1;
otherPointIndex < pointsArray.Length;
otherPointIndex++)
{
double xDistance = pointsArray[otherPointIndex].X - pointsArray[currentPointIndex].X;
double yDistance = pointsArray[otherPointIndex].Y - pointsArray[currentPointIndex].Y;
double distance = Math.Sqrt(Math.Pow(xDistance, 2) + Math.Pow(yDistance, 2));
// Add distance to distanceArray
distanceArray[distanceArrayIndex] = distance;
distanceArrayIndex++;
}
}
Since this will be used with many thousands of points, I'm thinking a precisely dimensioned array would be more efficient than using any sort of IEnumerable.

If you have n points, the set of all pairs of points contains n * (n-1) / 2 elements. That's the number of operations you are doing. The only change I would do is using Parallel.ForEach() to do the operations in parallel.
Something like this (needs debugging)
int distArraySize = (pointsArray.Length * (pointsArray.Length - 1)) / 2;
var distanceArray = new double[distArraySize];
int numPoints = pointsArray.Length;
Parallel.ForEach<int>(Enumerable.Range(0, numPoints - 2),
currentPointIndex =>
{
Parallel.ForEach<int>(Enumerable.Range(currentPointIndex + 1, numPoints - 2),
otherPointIndex =>
{
double xDistance = pointsArray[otherPointIndex].X - pointsArray[currentPointIndex].X;
double yDistance = pointsArray[otherPointIndex].Y - pointsArray[currentPointIndex].Y;
double distance = Math.Sqrt(xDistance * xDistance + yDistance * yDistance);
int distanceArrayIndex = currentPointIndex * numPoints - (currentPointIndex * (currentPointIndex + 1) / 2) + otherPointIndex - 1;
distanceArray[distanceArrayIndex] = distance;
});
});

Looks good to me, but don't you have a bug?
Each of the inner iterations will overwrite the previous one almost completely, except for its first position. Won't it?
That is, in distanceArray[otherPointIndex] otherPointIndex gets values from currentPointIndex + 1 to pointsArray.Length - 1.
In your example, this will range on [0-3] instead of [0-6].

I've had to perform operations like this in the past, and I think your immediate reaction to high number crunch operations is "there must be a faster or more efficient way to do this".
The only other even remotely workable solution I can think of would be to hash the pair and place this hash in a HashSet, then check the HashSet before doing the distance calculation. However, this will likely ultimately work out worse for performance.
You're solution is good. As j0aqu1n points out, you're probably going to have to crunch the numbers one way or another, and in this case you aren't ever performing the same calculation twice.
It will be interesting to see if there are any other solutions to this.

I think, it's a bit faster to use xDistance*xDistance instead of Math.Pow(xDistance, 2).
Apart from this, if you really always need to calculate all distances, there is not much room for improvement.
If, OTOH, you sometimes don't need to calculate all, you could calculate the distances lazily when needed.

FFT Inaccuracy for C#

Ive been experimenting with the FFT algorithm. I use NAudio along with a working code of the FFT algorithm from the internet. Based on my observations of the performance, the resulting pitch is inaccurate.
What happens is that I have an MIDI (generated from GuitarPro) converted to WAV file (44.1khz, 16-bit, mono) that contains a pitch progression starting from E2 (the lowest guitar note) up to about E6. What results is for the lower notes (around E2-B3) its generally very wrong. But reaching C4 its somewhat correct in that you can already see the proper progression (next note is C#4, then D4, etc.) However, the problem there is that the pitch detected is a half-note lower than the actual pitch (e.g. C4 should be the note but D#4 is displayed).
What do you think may be wrong? I can post the code if necessary. Thanks very much! Im still beginning to grasp the field of DSP.
Edit: Here is a rough scratch of what Im doing
byte[] buffer = new byte[8192];
int bytesRead;
do
{
bytesRead = stream16.Read(buffer, 0, buffer.Length);
} while (bytesRead != 0);
And then: (waveBuffer is simply a class that is there to convert the byte[] into float[] since the function only accepts float[])
public int Read(byte[] buffer, int offset, int bytesRead)
{
int frames = bytesRead / sizeof(float);
float pitch = DetectPitch(waveBuffer.FloatBuffer, frames);
}
And lastly: (Smbpitchfft is the class that has the FFT algo ... i believe theres nothing wrong with it so im not posting it here)
private float DetectPitch(float[] buffer, int inFrames)
{
Func<int, int, float> window = HammingWindow;
if (prevBuffer == null)
{
prevBuffer = new float[inFrames]; //only contains zeroes
}
// double frames since we are combining present and previous buffers
int frames = inFrames * 2;
if (fftBuffer == null)
{
fftBuffer = new float[frames * 2]; // times 2 because it is complex input
}
for (int n = 0; n < frames; n++)
{
if (n < inFrames)
{
fftBuffer[n * 2] = prevBuffer[n] * window(n, frames);
fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
}
else
{
fftBuffer[n * 2] = buffer[n - inFrames] * window(n, frames);
fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
}
}
SmbPitchShift.smbFft(fftBuffer, frames, -1);
}
And for interpreting the result:
float binSize = sampleRate / frames;
int minBin = (int)(82.407 / binSize); //lowest E string on the guitar
int maxBin = (int)(1244.508 / binSize); //highest E string on the guitar
float maxIntensity = 0f;
int maxBinIndex = 0;
for (int bin = minBin; bin <= maxBin; bin++)
{
float real = fftBuffer[bin * 2];
float imaginary = fftBuffer[bin * 2 + 1];
float intensity = real * real + imaginary * imaginary;
if (intensity > maxIntensity)
{
maxIntensity = intensity;
maxBinIndex = bin;
}
}
return binSize * maxBinIndex;
UPDATE (if anyone is still interested):
So, one of the answers below stated that the frequency peak from the FFT is not always equivalent to pitch. I understand that. But I wanted to try something for myself if that was the case (on the assumption that there are times in which the frequency peak IS the resulting pitch). So basically, I got 2 softwares (SpectraPLUS and FFTProperties by DewResearch ; credits to them) that is able to display the frequency-domain for the audio signals.
So here are the results of the frequency peaks in the time domain:
SpectraPLUS
and FFT Properties:
This was done using a test note of A2 (around 110Hz). Upon looking at the images, they have frequency peaks around the range of 102-112 Hz for SpectraPLUS and 108 Hz for FFT Properties. On my code, I get 104Hz (I use 8192 blocks and a samplerate of 44.1khz ... 8192 is then doubled to make it complex input so in the end, I get around 5Hz for binsize, as compared to the 10Hz binsize of SpectraPLUS).
So now Im a bit confused, since on the softwares they seem to return the correct result but on my code, I always get 104Hz (note that I have compared the FFT function that I used with others such as Math.Net and it seems to be correct).
Do you think that the problem may be with my interpretation of the data? Or do the softwares do some other thing before displaying the Frequency-Spectrum? Thanks!

It sounds like you may have an interpretation problem with your FFT output. A few random points:
the FFT has a finite resolution - each output bin has a resolution of Fs / N, where Fs is the sample rate and N is the size of the FFT
for notes which are low on the musical scale, the difference in frequency between successive notes is relatively small, so you will need a sufficiently large N to discrimninate between notes which are a semitone apart (see note 1 below)
the first bin (index 0) contains energy centered at 0 Hz but includes energy from +/- Fs / 2N
bin i contains energy centered at i * Fs / N but includes energy from +/- Fs / 2N either side of this center frequency
you will get spectral leakage from adjacent bins - how bad this is depends on what window function you use - no window (== rectangular window) and spectral leakage will be very bad (very broad peaks) - for frequency estimation you want to pick a window function that gives you sharp peaks
pitch is not the same thing as frequency - pitch is a percept, frequency is a physical quantity - the perceived pitch of a musical instrument may be slightly different from the fundamental frequency, depending on the type of instrument (some instruments do not even produce significant energy at their fundamental frequency, yet we still perceive their pitch as if the fundamental were present)
My best guess from the limited information available though is that perhaps you are "off by one" somewhere in your conversion of bin index to frequency, or perhaps your FFT is too small to give you sufficient resolution for the low notes, and you may need to increase N.
You can also improve your pitch estimation via several techniques, such as cepstral analysis, or by looking at the phase component of your FFT output and comparing it for successive FFTs (this allows for a more accurate frequency estimate within a bin for a given FFT size).
Notes
(1) Just to put some numbers on this, E2 is 82.4 Hz, F2 is 87.3 Hz, so you need a resolution somewhat better than 5 Hz to discriminate between the lowest two notes on a guitar (and much finer than this if you actually want to do, say, accurate tuning). At a 44.1 kHz sample then you probably need an FFT of at least N = 8192 to give you sufficient resolution (44100 / 8192 = 5.4 Hz), probably N = 16384 would be better.

I thought this might help you. I made some plots of the 6 open strings of a guitar. The code is in Python using pylab, which I recommend for experimenting:
# analyze distorted guitar notes from
# http://www.freesound.org/packsViewSingle.php?id=643
#
# 329.6 E - open 1st string
# 246.9 B - open 2nd string
# 196.0 G - open 3rd string
# 146.8 D - open 4th string
# 110.0 A - open 5th string
# 82.4 E - open 6th string
from pylab import *
import wave
fs = 44100.0
N = 8192 * 10
t = r_[:N] / fs
f = r_[:N/2+1] * fs / N
gtr_fun = [329.6, 246.9, 196.0, 146.8, 110.0, 82.4]
gtr_wav = [wave.open('dist_gtr_{0}.wav'.format(n),'r') for n in r_[1:7]]
gtr = [fromstring(g.readframes(N), dtype='int16') for g in gtr_wav]
gtr_t = [g / float64(max(abs(g))) for g in gtr]
gtr_f = [2 * abs(rfft(g)) / N for g in gtr_t]
def make_plots():
for n in r_[:len(gtr_t)]:
fig = figure()
fig.subplots_adjust(wspace=0.5, hspace=0.5)
subplot2grid((2,2), (0,0))
plot(t, gtr_t[n]); axis('tight')
title('String ' + str(n+1) + ' Waveform')
subplot2grid((2,2), (0,1))
plot(f, gtr_f[n]); axis('tight')
title('String ' + str(n+1) + ' DFT')
subplot2grid((2,2), (1,0), colspan=2)
M = int(gtr_fun[n] * 16.5 / fs * N)
plot(f[:M], gtr_f[n][:M]); axis('tight')
title('String ' + str(n+1) + ' DFT (16 Harmonics)')
if __name__ == '__main__':
make_plots()
show()
String 1, fundamental = 329.6 Hz:
String 2, fundamental = 246.9 Hz:
String 3, fundamental = 196.0 Hz:
String 4, fundamental = 146.8 Hz:
String 5, fundamental = 110.0 Hz:
String 6, fundamental = 82.4 Hz:
The fundamental frequency isn't always the dominant harmonic. It determines the spacing between harmonics of a periodic signal.

I had a similar question and the answer for me was to use Goertzel instead of FFT. If you know what tones you are looking for (MIDI) Goertzel is capable of detecting the tones to within one sinus wave (one cycle). It does this by generating the sinus wave of the sound and "placing it on top of the raw data" to see if it exist. FFT samples large amounts of data to provide an aproximate frequency spectrum.

Musical pitch is different from frequency peak. Pitch is a psycho-perceptual phenomena that may depend more on the overtones and such. The frequency of what a human would call the pitch could be missing or quite small in the actual signal spectra.
And a frequency peak in a spectrum can be different from any FFT bin center. The FFT bin center frequencies will change in frequency and spacing depending only on the FFT length and sample rate, not the spectra in the data.
So you have at least 2 problems with which to contend. There are a ton of academic papers on frequency estimation as well as the separate subject of pitch estimation. Start there.

How do I calculate PI in C#?

How can I calculate the value of PI using C#?
I was thinking it would be through a recursive function, if so, what would it look like and are there any math equations to back it up?
I'm not too fussy about performance, mainly how to go about it from a learning point of view.

If you want recursion:
PI = 2 * (1 + 1/3 * (1 + 2/5 * (1 + 3/7 * (...))))
This would become, after some rewriting:
PI = 2 * F(1);
with F(i):
double F (int i) {
return 1 + i / (2.0 * i + 1) * F(i + 1);
}
Isaac Newton (you may have heard of him before ;) ) came up with this trick.
Note that I left out the end condition, to keep it simple. In real life, you kind of need one.

How about using:
double pi = Math.PI;
If you want better precision than that, you will need to use an algorithmic system and the Decimal type.

If you take a close look into this really good guide:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4
You'll find at Page 70 this cute implementation (with minor changes from my side):
static decimal ParallelPartitionerPi(int steps)
{
decimal sum = 0.0;
decimal step = 1.0 / (decimal)steps;
object obj = new object();
Parallel.ForEach(
Partitioner.Create(0, steps),
() => 0.0,
(range, state, partial) =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
decimal x = (i - 0.5) * step;
partial += 4.0 / (1.0 + x * x);
}
return partial;
},
partial => { lock (obj) sum += partial; });
return step * sum;
}

There are a couple of really, really old tricks I'm surprised to not see here.
atan(1) == PI/4, so an old chestnut when a trustworthy arc-tangent function is
present is 4*atan(1).
A very cute, fixed-ratio estimate that makes the old Western 22/7 look like dirt
is 355/113, which is good to several decimal places (at least three or four, I think).
In some cases, this is even good enough for integer arithmetic: multiply by 355 then divide by 113.
355/113 is also easy to commit to memory (for some people anyway): count one, one, three, three, five, five and remember that you're naming the digits in the denominator and numerator (if you forget which triplet goes on top, a microsecond's thought is usually going to straighten it out).
Note that 22/7 gives you: 3.14285714, which is wrong at the thousandths.
355/113 gives you 3.14159292 which isn't wrong until the ten-millionths.
Acc. to /usr/include/math.h on my box, M_PI is #define'd as:
3.14159265358979323846
which is probably good out as far as it goes.
The lesson you get from estimating PI is that there are lots of ways of doing it,
none will ever be perfect, and you have to sort them out by intended use.
355/113 is an old Chinese estimate, and I believe it pre-dates 22/7 by many years. It was taught me by a physics professor when I was an undergrad.

Good overview of different algorithms:
Computing pi;
Gauss-Legendre-Salamin.
I'm not sure about the complexity claimed for the Gauss-Legendre-Salamin algorithm in the first link (I'd say O(N log^2(N) log(log(N)))).
I do encourage you to try it, though, the convergence is really fast.
Also, I'm not really sure about why trying to convert a quite simple procedural algorithm into a recursive one?
Note that if you are interested in performance, then working at a bounded precision (typically, requiring a 'double', 'float',... output) does not really make sense, as the obvious answer in such a case is just to hardcode the value.

What is PI? The circumference of a circle divided by its diameter.
In computer graphics you can plot/draw a circle with its centre at 0,0 from a initial point x,y, the next point x',y' can be found using a simple formula:
x' = x + y / h : y' = y - x' / h
h is usually a power of 2 so that the divide can be done easily with a shift (or subtracting from the exponent on a double). h also wants to be the radius r of your circle. An easy start point would be x = r, y = 0, and then to count c the number of steps until x <= 0 to plot a quater of a circle. PI is 4 * c / r or PI is 4 * c / h
Recursion to any great depth, is usually impractical for a commercial program, but tail recursion allows an algorithm to be expressed recursively, while implemented as a loop. Recursive search algorithms can sometimes be implemented using a queue rather than the process's stack, the search has to backtrack from a deadend and take another path - these backtrack points can be put in a queue, and multiple processes can un-queue the points and try other paths.

Calculate like this:
x = 1 - 1/3 + 1/5 - 1/7 + 1/9 (... etc as far as possible.)
PI = x * 4
You have got Pi !!!
This is the simplest method I know of.
The value of PI slowly converges to the actual value of Pi (3.141592165......). If you iterate more times, the better.

Here's a nice approach (from the main Wikipedia entry on pi); it converges much faster than the simple formula discussed above, and is quite amenable to a recursive solution if your intent is to pursue recursion as a learning exercise. (Assuming that you're after the learning experience, I'm not giving any actual code.)
The underlying formula is the same as above, but this approach averages the partial sums to accelerate the convergence.
Define a two parameter function, pie(h, w), such that:
pie(0,1) = 4/1
pie(0,2) = 4/1 - 4/3
pie(0,3) = 4/1 - 4/3 + 4/5
pie(0,4) = 4/1 - 4/3 + 4/5 - 4/7
... and so on
So your first opportunity to explore recursion is to code that "horizontal" computation as the "width" parameter increases (for "height" of zero).
Then add the second dimension with this formula:
pie(h, w) = (pie(h-1,w) + pie(h-1,w+1)) / 2
which is used, of course, only for values of h greater than zero.
The nice thing about this algorithm is that you can easily mock it up with a spreadsheet to check your code as you explore the results produced by progressively larger parameters. By the time you compute pie(10,10), you'll have an approximate value for pi that's good enough for most engineering purposes.

Enumerable.Range(0, 100000000).Aggregate(0d, (tot, next) => tot += Math.Pow(-1d, next)/(2*next + 1)*4)

using System;
namespace Strings
{
class Program
{
static void Main(string[] args)
{
/* decimal pie = 1;
decimal e = -1;
*/
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start(); //added this nice stopwatch start routine
//leibniz formula in C# - code written completely by Todd Mandell 2014
/*
for (decimal f = (e += 2); f < 1000001; f++)
{
e += 2;
pie -= 1 / e;
e += 2;
pie += 1 / e;
Console.WriteLine(pie * 4);
}
decimal finalDisplayString = (pie * 4);
Console.WriteLine("pie = {0}", finalDisplayString);
Console.WriteLine("Accuracy resulting from approximately {0} steps", e/4);
*/
// Nilakantha formula - code written completely by Todd Mandell 2014
// π = 3 + 4/(2*3*4) - 4/(4*5*6) + 4/(6*7*8) - 4/(8*9*10) + 4/(10*11*12) - (4/(12*13*14) etc
decimal pie = 0;
decimal a = 2;
decimal b = 3;
decimal c = 4;
decimal e = 1;
for (decimal f = (e += 1); f < 100000; f++)
// Increase f where "f < 100000" to increase number of steps
{
pie += 4 / (a * b * c);
a += 2;
b += 2;
c += 2;
pie -= 4 / (a * b * c);
a += 2;
b += 2;
c += 2;
e += 1;
}
decimal finalDisplayString = (pie + 3);
Console.WriteLine("pie = {0}", finalDisplayString);
Console.WriteLine("Accuracy resulting from {0} steps", e);
stopwatch.Stop();
TimeSpan ts = stopwatch.Elapsed;
Console.WriteLine("Calc Time {0}", ts);
Console.ReadLine();
}
}
}

public static string PiNumberFinder(int digitNumber)
{
string piNumber = "3,";
int dividedBy = 11080585;
int divisor = 78256779;
int result;
for (int i = 0; i < digitNumber; i++)
{
if (dividedBy < divisor)
dividedBy *= 10;
result = dividedBy / divisor;
string resultString = result.ToString();
piNumber += resultString;
dividedBy = dividedBy - divisor * result;
}
return piNumber;
}

In any production scenario, I would compel you to look up the value, to the desired number of decimal points, and store it as a 'const' somewhere your classes can get to it.
(unless you're writing scientific 'Pi' specific software...)

Regarding...
... how to go about it from a learning point of view.
Are you trying to learning to program scientific methods? or to produce production software? I hope the community sees this as a valid question and not a nitpick.
In either case, I think writing your own Pi is a solved problem. Dmitry showed the 'Math.PI' constant already. Attack another problem in the same space! Go for generic Newton approximations or something slick.

#Thomas Kammeyer:
Note that Atan(1.0) is quite often hardcoded, so 4*Atan(1.0) is not really an 'algorithm' if you're calling a library Atan function (an quite a few already suggested indeed proceed by replacing Atan(x) by a series (or infinite product) for it, then evaluating it at x=1.
Also, there are very few cases where you'd need pi at more precision than a few tens of bits (which can be easily hardcoded!). I've worked on applications in mathematics where, to compute some (quite complicated) mathematical objects (which were polynomial with integer coefficients), I had to do arithmetic on real and complex numbers (including computing pi) with a precision of up to a few million bits... but this is not very frequent 'in real life' :)
You can look up the following example code.

I like this paper, which explains how to calculate π based on a Taylor series expansion for Arctangent.
The paper starts with the simple assumption that
Atan(1) = π/4 radians
Atan(x) can be iteratively estimated with the Taylor series
atan(x) = x - x^3/3 + x^5/5 - x^7/7 + x^9/9...
The paper points out why this is not particularly efficient and goes on to make a number of logical refinements in the technique. They also provide a sample program that computes π to a few thousand digits, complete with source code, including the infinite-precision math routines required.

The following link shows how to calculate the pi constant based on its definition as an integral, that can be written as a limit of a summation, it's very interesting:
https://sites.google.com/site/rcorcs/posts/calculatingthepiconstant
The file "Pi as an integral" explains this method used in this post.

First, note that C# can use the Math.PI field of the .NET framework:
https://msdn.microsoft.com/en-us/library/system.math.pi(v=vs.110).aspx
The nice feature here is that it's a full-precision double that you can either use, or compare with computed results. The tabs at that URL have similar constants for C++, F# and Visual Basic.
To calculate more places, you can write your own extended-precision code. One that is quick to code and reasonably fast and easy to program is:
Pi = 4 * [4 * arctan (1/5) - arctan (1/239)]
This formula and many others, including some that converge at amazingly fast rates, such as 50 digits per term, are at Wolfram:
Wolfram Pi Formulas

PI (π) can be calculated by using infinite series. Here are two examples:
Gregory-Leibniz Series:
π/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...
C# method :
public static decimal GregoryLeibnizGetPI(int n)
{
decimal sum = 0;
decimal temp = 0;
for (int i = 0; i < n; i++)
{
temp = 4m / (1 + 2 * i);
sum += i % 2 == 0 ? temp : -temp;
}
return sum;
}
Nilakantha Series:
π = 3 + 4 / (2x3x4) - 4 / (4x5x6) + 4 / (6x7x8) - 4 / (8x9x10) + ...
C# method:
public static decimal NilakanthaGetPI(int n)
{
decimal sum = 0;
decimal temp = 0;
decimal a = 2, b = 3, c = 4;
for (int i = 0; i < n; i++)
{
temp = 4 / (a * b * c);
sum += i % 2 == 0 ? temp : -temp;
a += 2; b += 2; c += 2;
}
return 3 + sum;
}
The input parameter n for both functions represents the number of iteration.
The Nilakantha Series in comparison with Gregory-Leibniz Series converges more quickly. The methods can be tested with the following code:
static void Main(string[] args)
{
const decimal pi = 3.1415926535897932384626433832m;
Console.WriteLine($"PI = {pi}");
//Nilakantha Series
int iterationsN = 100;
decimal nilakanthaPI = NilakanthaGetPI(iterationsN);
decimal CalcErrorNilakantha = pi - nilakanthaPI;
Console.WriteLine($"\nNilakantha Series -> PI = {nilakanthaPI}");
Console.WriteLine($"Calculation error = {CalcErrorNilakantha}");
int numDecNilakantha = pi.ToString().Zip(nilakanthaPI.ToString(), (x, y) => x == y).TakeWhile(x => x).Count() - 2;
Console.WriteLine($"Number of correct decimals = {numDecNilakantha}");
Console.WriteLine($"Number of iterations = {iterationsN}");
//Gregory-Leibniz Series
int iterationsGL = 1000000;
decimal GregoryLeibnizPI = GregoryLeibnizGetPI(iterationsGL);
decimal CalcErrorGregoryLeibniz = pi - GregoryLeibnizPI;
Console.WriteLine($"\nGregory-Leibniz Series -> PI = {GregoryLeibnizPI}");
Console.WriteLine($"Calculation error = {CalcErrorGregoryLeibniz}");
int numDecGregoryLeibniz = pi.ToString().Zip(GregoryLeibnizPI.ToString(), (x, y) => x == y).TakeWhile(x => x).Count() - 2;
Console.WriteLine($"Number of correct decimals = {numDecGregoryLeibniz}");
Console.WriteLine($"Number of iterations = {iterationsGL}");
Console.ReadKey();
}
The following output shows that Nilakantha Series returns six correct decimals of PI with one hundred iterations whereas Gregory-Leibniz Series returns five correct decimals of PI with one million iterations:
My code can be tested >> here

Here is a nice way:
Calculate a series of 1/x^2 for x from 1 to what ever you want- the bigger number- the better pie result. Multiply the result by 6 and to sqrt().
Here is the code in c# (main only):
static void Main(string[] args)
{
double counter = 0;
for (double i = 1; i < 1000000; i++)
{
counter = counter + (1 / (Math.Pow(i, 2)));
}
counter = counter * 6;
counter = Math.Sqrt(counter);
Console.WriteLine(counter);
}

public double PI = 22.0 / 7.0;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using local workers in OpenCL for large matrix computation - c#

Related

Faster Image Processing than Lock Bits

get the result from array of discrete forurier transform

What is the most efficient way to avoid duplicate operations in a C# array?

FFT Inaccuracy for C#

How do I calculate PI in C#?

Categories

Resources