OpenCL Kernel Troubles

OpenCL Kernel Troubles - c#

Hi I created two kernels to do a simple matching deshredder program to be run with OpenCL and timed. The two kernels do what they are supposed to do, but one runs far slower than the other for a reason i cannot decipher :/ The only real difference is how i store the data being sent up and how the matching happens.
__kernel void Horizontal_Match_Orig(
__global int* allShreds,
__global int* matchOut,
const unsigned int shredCount,
const unsigned int pixelCount)
{
int match = 0;
int GlobalID = get_global_id(0);
int currShred = GlobalID/pixelCount;
int thisPixel = GlobalID - (currShred * pixelCount);
int matchPixel = allShreds[GlobalID];//currShred*pixelCount+thisPixel];
for (int i = 0; i < shredCount; i++)
{
match = 0;
if (matchPixel == allShreds[(i * pixelCount) + thisPixel])
{
if (matchPixel == 0)
{
match = match + 150;
}
else match = match + 1;
}
else match = match - 50;
atomic_add(&matchOut[(currShred * shredCount) + i], match);
}
}
This kernel gets the shred edges horizontally, so the pixels of one shred take up pos 0 to n in the array allShreds and then the pixels of the next shred are stored from pos n+1 to m (Where n = number of pixels, and m = number of pixels added on). Each thread of the GPU gets one pixel to work with and matches it against the corresponding pixel of all the other shreds (including itself)
__kernel void Vertical(
__global int* allShreds,
__global int* matchOut,
const int numShreds,
const int pixelsPerEdge)
{
int GlobalID = get_global_id(0);
int myMatch = allShreds[GlobalID];
int myShred = GlobalID % numShreds;
int thisRow = GlobalID / numShreds;
for (int matchShred = 0; matchShred < numShreds; matchShred++)
{
int match = 0;
int matchPixel = allShreds[(thisRow * numShreds) + matchShred];
if (myMatch == matchPixel)
{
if (myMatch == 0)
match = 150;
else
match = 1;
}
else match = -50;
atomic_add(&matchOut[(myShred * numShreds) + matchShred], match);
}
}
This kernel gets the shred edges vertically, so the first pixels of all the shreds are stored in pos 0 to n then the 2nd pixels of all the shreds are stored in pos n+1 ot m (Where n = number of shreds, and m = number of shreds added to n). The process is similar to the previous one where each thread gets a pixel and matches it to the corresponding pixel of each of the other shreds.
Both give the same results correct results tested against a purely sequential program. In theory they should both run in roughly the same amount of time, with the possibility of the vertical one running faster as the atomic add shouldn't affect it as much... However it runs far slower... Any Ideas?
This is the code I use to start it (I am using a C# wrapper for it):
theContext.EnqueueNDRangeKernel(1, null, new int[] { minRows * shredcount }, null, out clEvent);
with the total global workload equaling the total number of pixels (#Shreds X #Pixels in each one).
Any help would be greatly appreciated

The two kernels do what they are supposed to do, but one runs far slower than the other for a reason i cannot decipher :/ The only real difference is how i store the data being sent up and how the matching happens.
And that makes all the difference. This is a classic coalescence problem. You haven't specified your GPU model nor vendor in your question so I'll have to remain vague as actual numbers and behavior are completely hardware dependent, but the general idea is reasonably portable.
Work items in a GPU issue memory requests (reads and writes) together (by "warp" / "wavefront" / "sub-group") to the memory engine. This engine serves memory in transactions (power-of-two sized chunks of 16 to 128 bytes). Let's assume a size of 128 for the following example.
Enter memory access coalescing: if 32 work items of a warp read 4 bytes (int or float) that are consecutive in memory, the memory engine will issue a single transaction to serve all 32 requests. But for every read that is more than 128 bytes apart from another, another transaction needs to be issued. In the worst case, that's 32 transactions of 128 bytes each, which is way more expensive.
Your horizontal kernel does the following access:
allShreds[(i * pixelCount) + thisPixel]
The (i * pixelCount) is constant across work items, only thisPixel varies. Given your code and assuming work item 0 has thisPixel = 0, then work item 1 has thisPixel = 1 and so on. This means your work items are requesting adjacent reads, so you get a perfectly coalesced access. Similarly for the call to atomic_add.
On the other hand, your vertical kernel does the following accesses:
allShreds[(thisRow * numShreds) + matchShred]
// ...
matchOut[(myShred * numShreds) + matchShred]
matchShred and numShreds are constant across threads, only thisRow and myShred vary. This means that you are requesting reads that are numShreds away from each other. This is not sequential access and therefore not coalesced.

Related

Distribute quantities into buckets - Not evenly

I've been searching around for a solution to this, but I think because of how I'm thinking about it, my search phrases might be a bit loaded in favor of topics that aren't completely relevant.
I have a number, say 950,000. This represents an inventory of [widgets] within an entire system. I have about 200 "buckets" that should each receive a portion of this inventory such that there are no widgets left over.
What I would like to happen is for each bucket to receive different amounts. I don't have any solid code to show right now, but here's some pesudo code to illustrate what I've been thinking:
//List<BucketObject> _buckets is a collection of "buckets", each of which has a "quantity" property for holding these numbers.
int _widgetCnt = 950000;
int _bucketCnt = _buckets.Count; //LINQ
//To start, each bucket receives (_widgetCnt / _bucketCnt) or 4750.
for (int _b = 0; b< _bucketCnt - 1; i++)
{
int _rndAmt = _rnd.Next(1, _buckets[i].Quantity/2); //Take SOME from this bucket...
int _rndBucket = _rnd.Next(0,_bucketCnt - 1); //Get a random bucket index from the List<BucketObject> collection.
_buckets.ElementAt(_rndBucket).Quantity += _rndAmt;
_buckets.ElementAt(i).Quantity -= _rndAmt;
}
Is this a statistically/mathematically proper way to handle this, or is there a distribution formula out there that handles this? The kicker is that while this pseudo code would run 200 times (so each bucket has a chance to alter its quantities) it would have to run X number of times depending on the TYPE of widget (which currently stands at just 11 flavors, but is expected to expand significantly in the future).
{EDIT}
This system is for a commodity trading game. Quantities at the 200 shops must differ because the inventory will determine the price at that station. The distro can't be even because that would make all prices the same. Over time, prices will naturally get out of balance, but the inventory must start out off-balance. And all inventories have to be at least similar in scope (ie, no one shop can have 1 item, and another have 900,000)

Sure, there is a solution. You could use Dirichlet Distribution for such task. Property of the distribution is that
Sumi xi = 1
So solution would be to sample 200 (equal the number of buckets) random values from Dirichlet, and then multiply each value by 950,000 (or whatever total inventory is) and that would give you number of items per bucket. If you want non-uniform sampling, you could tweak alpha in the Dirichlet sampling.
Items per bucket shall be rounded up/down, of course, but that is pretty trivial
I have Dirichlet sampling in C# somewhere, if you struggle to implement it - tell me and I would dig it out
UPDATE
I found some code, .NET Core 2, below is the excerpt. I used to sample Dirichlet RNs with the sample alphas, making all of them different is trivial.
//
// Dirichlet sampling, using Gamma sampling from Math .NET
//
using MathNet.Numerics.Distributions;
using MathNet.Numerics.Random;
static void SampleDirichlet(double alpha, double[] rn)
{
if (rn == null)
throw new ArgumentException("SampleDirichlet:: Results placeholder is null");
if (alpha <= 0.0)
throw new ArgumentException($"SampleDirichlet:: alpha {alpha} is non-positive");
int n = rn.Length;
if (n == 0)
throw new ArgumentException("SampleDirichlet:: Results placeholder is of zero size");
var gamma = new Gamma(alpha, 1.0);
double sum = 0.0;
for(int k = 0; k != n; ++k) {
double v = gamma.Sample();
sum += v;
rn[k] = v;
}
if (sum <= 0.0)
throw new ApplicationException($"SampleDirichlet:: sum {sum} is non-positive");
// normalize
sum = 1.0 / sum;
for(int k = 0; k != n; ++k) {
rn[k] *= sum;
}
}

Optimize or propose c++, c# code using omp to find all similar k motifs

Given A positive integer k (k≤50), a DNA string s of length at most 5,000 representing a motif, and a DNA string t of length at most 50,000 representing a genome.
The problem consists on returning all substrings t′ of t such that the edit distance d_E(s,t′)
The edit distance between two strings is the minimum number of elementary
operations (insertions, deletions, and substitutions) to transform one
string into the other, for instamce s = 'TGCATAT' and t' = 'ATCCGAT'
here is a c++ implementation by user1131146 account aban
Maybe it would be better to use that??
is less than or equal to k. Each substring should be encoded by a pair containing its location in t followed by its length.
for instance
2
ACGTAG
ACGGATCGGCATCGT
should output
1 4
1 5
1 6
For this example k=2 results mean:
For indices 1 to 4, d_E(s,t′)=2 (ADD T THEN one A BEFORE LAST t's G)
s = ACGTAG
t' = ACGG
For indices 1 to 5 d_E(s,t′)=2 (ADD G to end of t' THEN replace t's G at index 4 by T)
s = ACGTAG
t' = ACGGA
For indices 1 to 6 d_E(s,t′)=2 (REPLACE LAST t's T BY GTHEN replace t's G at index 4 by T)
s = ACGTAG
t' = ACGGAT
Having the solution to get all substrings of a genome that are within a certain fixed distance of the desired motif,What would be the best way to parallelize a solution using omp. As the longer the strings become program takes too much time.
I have tested using omp #pragma omp parallel for then using a lock in the write to file section, and also #pragma omp critical However I do not know if I am paralelizing it correctly.
void alignment(vector<vi>&a, string &x, string y, int k){
string tx,ty;
int i,j;
int ylen=a[0].size();
for(i=1;i<a.size();i++){
for(j=max(1,i-k);j<=min(ylen,i+k);j++){
a[i][j] = max(x[i-1] == y[j-1]?a[i-1][j-1] : (a[i-1][j-1]-1), max(a[i-1][j]-1,a[i][j-1]-1));
}
}
}
int main()
{
int k = 23;
string s = "AATTAGCTAAGGTGTACGATGTCCCATTGTGTAAGATTAGGAACTCCATTTAGGTTACCTCCGTCTTAAGTGATGGACCGTGGGTAGCTGCGTCCGATGGACTCATGCAGCGCCCGGATACCTGCAGTATTTATTATATAGGTCTGCGCACCAAACGATTTCTTTCGTGGTCGGGATTCCGGGGTCCTCCGCTATTCAGAGAGCTAAATA";
string t = "ACAATGCAGCAATCCAGCGCCGGAATTTAAGAATAGGTCAGTTTGTAAGGCACTGTTCCCGTTATTCGTAATGCAGTATTAACGTTAATGCTCGAGACCATATTGGACGTCAGTATGCAGACCTGTGCTAGGGTGGTCTATTTCAAGATCACCGAGCTAGGCGCGTGAGCTAACAGGCCGTAATGGTGGCGCCCGCTCCCATAATCACTTCACGAAGCATTAGGTAGACTACCATTTAGGAAGCCCTCTCGCCCGCGTACTGGTTACAGCCCACTACAATGGATACTCCTTACTTCGGTGCAGGCAAGACTTCTACAAAGAAGCGTCCAAGAAGTTGTCGTAGCTCGTTCTTACCCCACCTGTATAAAATTGATCCAGTCGTACATATGACGATGCTGAGCCTCGGACTGGTAAATACAAGTCAAAGGACCAACCCATTACAGTATGAACTACCGGTGG";
time_t start = time(NULL);
std::ofstream out("output.txt");
ifstream someStream( "data.txt" );
string line;
getline( someStream, line ); int k = atoi(line.c_str() );
getline( someStream, line ); string s =line;
getline( someStream, line ); string t= line;
int slen=s.length(), tlen=t.length();
vector<vi>a( slen+1, vi(slen+k+1));
int i,j;
for(i=1;i<a.size();i++)
fill(a[i].begin(),a[i].end(),-999),a[i][0]=a[i-1][0]-1;
#pragma omp parallel for
{
for(j=1;j<a[0].size();j++)
{
a[0][j]=a[0][j-1]-1;
}
}
//cout << "init";
time_t endINIT = time(NULL);
cout<<"Execution Init Time: "<< (double)(endINIT-start)<<" Seconds"<<std::endl;
//omp_lock_t writelock;
//omp_init_lock(&writelock);
#pragma omp parallel for
{
for(i=0;i<=tlen-slen+k;i++)
{
alignment(a,s,t.substr(i,slen+k),k);
for(j=max(0,slen-k);j<=min(slen+k,tlen-i);j++)
{
if(a[slen][j]>=-k)
{
//omp_set_lock(&writelock);
//cout<<(i+1)<<' '<<j<<endl;
#pragma omp critical
{
out <<(i+1)<<' '<<j<<endl;
}
//omp_unset_lock(&writelock);
}
}
}
}
//omp_destroy_lock(&writelock);
time_t end = time(NULL);
cout<<"Execution Time: "<< (double)(end-start)<<" Seconds"<<std::endl;
out.close();
return 0;
}
I have not been able to complete this or optimize it. Is there a better way?

As mentioned in a comment to your post, in order to parallelize calls to alignment, each thread needs to have its own copy of a. You can do this with the firstprivate OpenMP clause:
#pragma omp parallel for firstprivate(a)
In alignment itself you repeat calculations in your loops that the optimizer might not be eliminating. The calls to a.size and using min in the conditions of your loops could be computed once and stored in a local variable. Constantly calculating a[i] and a[i-1] can also be lifted out of the inner loop.
int asize = int(a.size());
for(i = 1; i < asize; i++) {
int jend = min(ylen, i + k);
vi &cura = a[i];
vi &preva = a[i-1];
for(j = max(1, i - k);j <= jend; j++)
cura[j] = max(x[i-1] == y[j-1]?preva[j-1] : (preva[j-1]-1), max(preva[j]-1,cura[j-1]-1));
}
The Windows headers define max and min macros. If you're using those (and not the inline functions in the STL) that could also repeat code unnecessarily.
Another bottleneck could be the output of matches, depending on how frequently a match is found. One way to improve this is to store the i-1,j pairs into a new local variable (also include it in the private clause), then use a reduction clause to combine the results (although I'm not sure how that works with containers). Once you're done with your for loop you can output the results, possibly sorting them first.
Trying to parallelize the j loop that initializes a[0] is probably not worthwhile. The code needs fixed to get that to work (also mentioned in a comment), and the multiple threads could cause it to run slower if the overhead starting the threads is too much, or if there is cache line contention among threads if multiple threads try to write to nearly adjacent values in memory. If the sample s in your code is typical I'd just run this on one thread.
When you construct your a vector, you can include the -999 initial value in the constructor if the contained vector:
vector<vi> a(slen + 1, vi(slen + k + 1, -999));
then the initialization loop right below it would just have to set the value of the first element in each contained vector.
That should do for a start. Note: Code samples are suggestions, and have not been compiled or tested.
EDIT I've worked thru this, and the results are not what you are hoping for.
Initially I just ran your code (to get my baseline performance numbers). Using VC2010, maximum optimization, 64 bit compile.
cl /nologo /W4 /MD /EHsc /Ox /favor:INTEL64 sequencing.cpp
Since I don't have your data files, I randomly generated a t of 50,000 [AGCT] characters, s of 5,000, and used a k of 50. This took 135 seconds with no hits output. When I set t to be s plus 45,000 random characters, I got a lot of hits with no noticeable impact on the execution time.
I got this working with omp (using firstprivate instead of private to get the initial value of a copied in) and it crashed. Looking in to that I realized that calls to alignment depend on the result of the previous call. So this cannot be executed on multiple cores.
I did rewrite alignment to remove all the redundant computations and got a reduction of around 25% in the execution time (to 103 seconds):
void alignment(vector<vi>&a, const string &x, const string y, int k){
string tx,ty;
int i,j;
int ylen=int(a[0].size());
int asize = int(a.size());
for(i = 1; i < asize; i++) {
auto xi = x[i - 1];
int jend = min(ylen, i + k);
vi &cura = a[i];
const vi &preva = a[i-1];
j = max(1, i - k);
auto caj = cura[j - 1];
const auto *yp = &y[j - 1];
auto *pca = &cura[j];
auto *ppj = &preva[j - 1];
for(;j <= jend; j++) {
caj = *pca = max(*ppj - (xi != *yp), max(ppj[1] - 1, caj - 1));
++yp;
++pca;
++ppj;
}
}
}
The last thing I did was compile this using VS2015. This reduced the execution time another 28% to around 74 seconds.
Similar tweaks and adjustments can be made in main, but with no observable effect on performance as a majority of time is spent in alignment.
Execution times using a 32 bit binary are similar.
It did occur to me that since alignment works with a lot of data, it might be possible to run this on two (or more) threads, and overlap the area the threads work with (so that the second thread is only accessing elements of the first thread's results once they are computed, but before the first thread has completed working with the entire array. However, that would require creating your own thread and some very careful synchronization between the thread to ensure the right data is available in the right place. I didn't try to make this work.
EDIT 2 The source for an optimized version of main:
int main()
{
// int k = 50;
// string s = "AATTAGCTAAGGTGTACGATGTCCCATTGTGTAAGATTAGGAACTCCATTTAGGTTACCTCCGTCTTAAGTGATGGACCGTGGGTAGCTGCGTCCGATGGACTCATGCAGCGCCCGGATACCTGCAGTATTTATTATATAGGTCTGCGCACCAAACGATTTCTTTCGTGGTCGGGATTCCGGGGTCCTCCGCTATTCAGAGAGCTAAATA";
// string t = "ACAATGCAGCAATCCAGCGCCGGAATTTAAGAATAGGTCAGTTTGTAAGGCACTGTTCCCGTTATTCGTAATGCAGTATTAACGTTAATGCTCGAGACCATATTGGACGTCAGTATGCAGACCTGTGCTAGGGTGGTCTATTTCAAGATCACCGAGCTAGGCGCGTGAGCTAACAGGCCGTAATGGTGGCGCCCGCTCCCATAATCACTTCACGAAGCATTAGGTAGACTACCATTTAGGAAGCCCTCTCGCCCGCGTACTGGTTACAGCCCACTACAATGGATACTCCTTACTTCGGTGCAGGCAAGACTTCTACAAAGAAGCGTCCAAGAAGTTGTCGTAGCTCGTTCTTACCCCACCTGTATAAAATTGATCCAGTCGTACATATGACGATGCTGAGCCTCGGACTGGTAAATACAAGTCAAAGGACCAACCCATTACAGTATGAACTACCGGTGG";
time_t start = time(NULL);
std::ofstream out("output.txt");
ifstream someStream( "data.txt" );
string line;
getline( someStream, line ); int k = atoi(line.c_str() );
getline( someStream, line ); string s =line;
getline( someStream, line ); string t= line;
int slen=int(s.length()), tlen=int(t.length());
int i,j;
vector<vi> a(slen + 1, vi(slen + k + 1, -999));
a[0][0]=0;
for(i=1;i<=slen;i++)
a[i][0]=-i;
{
int ej=int(a[0].size());
for(j=1;j<ej;j++)
a[0][j] = -j;
}
//cout << "init";
time_t endINIT = time(NULL);
cout<<"Execution Init Time: "<< (double)(endINIT-start)<<" Seconds"<<std::endl;
{
int endi=tlen-slen+k;
for(i=0;i<=endi;i++)
{
alignment(a,s,t.substr(i,slen+k),k);
int ej=min(slen+k,tlen-i);
j=max(0,slen-k);
const auto *aj = &a[slen][j];
for(;j<=ej;j++,++aj)
{
if(*aj>=-k)
{
//cout<<(i+1)<<' '<<j<<endl;
out <<(i+1)<<' '<<j<<endl;
}
}
}
}
time_t end = time(NULL);
cout<<"Execution Time: "<< (double)(end-start)<<" Seconds"<<std::endl;
out.close();
return 0;
}
The code for alignment is unchanged from what I listed above.

Video rate image construction from binary data performance

First things first:
I have a git repo over here that holds the code of my current efforts and an example data set
Background
The example data set holds a bunch of records in Int32 format. Each record is composed of several bit fields that basically hold info on events where an event is either:
The detection of a photon
The arrival of a synchronizing signal
Each Int32 record can be treated like following C-style struct:
struct {
unsigned TimeTag :16;
unsigned Channel :12;
unsigned Route :2;
unsigned Valid :1;
unsigned Reserved :1; } TTTRrecord;
Whether we are dealing with a photon record or a sync event, time
tag will always hold the time of the event relative to the start of
the experiment (macro-time).
If a record is a photon, valid == 1.
If a record is a sync signal or something else, valid == 0.
If a record is a sync signal, sync type = channel & 7 will give either a value indicating start of frame or end of scan line in a frame.
The last relevant bit of info is that Timetag is 16 bit and thus obviously limited. If the Timetag counter rolls over, the rollover counter is incremented. This rollover (overflow) count can easily be obtained from channel overflow = Channel & 2048.
My Goal
These records come in from a high speed scanning microscope and I would like to use these records to reconstruct images from the recorded photon data, preferably at 60 FPS.
To do so, I obviously have all the info:
I can look over all available data, find all overflows, which allows me to reconstruct the sequential macro time for each record (photon or sync).
I also know when the frame started and when each line composing the frame ended (and thus also how many lines there are).
Therefore, to reconstruct a bitmap of size noOfLines * noOfLines I can process the bulk array of records line by line where each time I basically make a "histogram" of the photon events with edges at the time boundary of each pixel in the line.
Put another way, if I know Tstart and Tend of a line, and I know the number of pixels I want to spread my photons over, I can walk through all records of the line and check if the macro time of my photons falls within the time boundary of the current pixel. If so, I add one to the value of that pixel.
This approach works, current code in the repo gives me the image I expect but it is too slow (several tens of ms to calculate a frame).
What I tried already:
The magic happens in the function int[] Renderline (see repo).
public static int[] RenderlineV(int[] someRecords, int pixelduration, int pixelCount)
{
// Will hold the pixels obviously
int[] linePixels = new int[pixelCount];
// Calculate everything (sync, overflow, ...) from the raw records
int[] timeTag = someRecords.Select(x => Convert.ToInt32(x & 65535)).ToArray();
int[] channel = someRecords.Select(x => Convert.ToInt32((x >> 16) & 4095)).ToArray();
int[] valid = someRecords.Select(x => Convert.ToInt32((x >> 30) & 1)).ToArray();
int[] overflow = channel.Select(x => (x & 2048) >> 11).ToArray();
int[] absTime = new int[overflow.Length];
absTime[0] = 0;
Buffer.BlockCopy(overflow, 0, absTime, 4, (overflow.Length - 1) * 4);
absTime = absTime.Cumsum(0, (prev, next) => prev * 65536 + next).Zip(timeTag, (o, tt) => o + tt).ToArray();
long lineStartTime = absTime[0];
int tempIdx = 0;
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
for (int i = tempIdx; i < someRecords.Length; i++)
{
if (valid[i] == 1 && lineStartTime + (j + 1) * pixelduration >= absTime[i])
{
count++;
}
}
// Avoid checking records in the raw data that were already binned to a pixel.
linePixels[j] = count;
tempIdx += count;
}
return linePixels;
}
Treating photon records in my data set as an array of structs and addressing members of my struct in an iteration was a bad idea. I could increase speed significantly (2X) by dumping all bitfields into an array and addressing these. This version of the render function is already in the repo.
I also realised I could improve the loop speed by making sure I refer to the .Length property of the array I am running through as this supposedly eliminates bounds checking.
The major speed loss is in the inner loop of this nested set of loops:
for (int j = 0; j < linePixels.Length; j++)
{
int count = 0;
lineStartTime += pixelduration;
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
}
// Doing it here is faster.
linePixels[j] = count;
tempIdx += count;
}
Rendering 400 lines like this in a for loop takes roughly 150 ms in a VM (I do not have a dedicated Windows machine right now and I run a Mac myself, I know I know...).
I just installed Win10CTP on a 6 core machine and replacing the normal loops by Parallel.For() increases the speed by almost exactly 6X.
Oddly enough, the non-parallel for loop runs almost at the same speed in the VM or the physical 6 core machine...
Regardless, I cannot imagine that this function cannot be made quicker. I would first like to eke out every bit of efficiency from the line render before I start thinking about other things.
I would like to optimise the function that generates the line to the maximum.
Outlook
Until now, my programming dealt with rather trivial things so I lack some experience but things I think I might consider:
Matlab is/seems very efficient with vectored operations. Could I achieve similar things in C#, i.e. by using Microsoft.Bcl.Simd? Is my case suited for something like this? Would I see gains even in my VM or should I definitely move to real HW?
Could I gain from pointer arithmetic/unsafe code to run through my arrays?
...
Any help would be greatly, greatly appreciated.
I apologize beforehand for the quality of the code in the repo, I am still in the quick and dirty testing stage... Nonetheless, criticism is welcomed if it is constructive :)
Update
As some mentioned, absTime is ordered already. Therefore, once a record is hit that is no longer in the current pixel or bin, there is no need to continue the inner loop.
5X speed gain by adding a break...
for (int i = tempIdx; i < absTime.Length; i++)
{
//if (lineStartTime + (j + 1) * pixelduration >= absTime[i] && valid[i] == 1)
// Seems quicker to calculate the boundary before...
//if (valid[i] == 1 && lineStartTime >= absTime[i] )
// Quicker still...
if (lineStartTime > absTime[i] && valid[i] == 1)
{
// Slow... looking into linePixels[] each iteration is a bad idea.
//linePixels[j]++;
count++;
}
else
{
break;
}
}

Most probable bits in random integer

I've made such experiment - made 10 million random numbers from C and C#. And then counted how much times each bit from 15 bits in random integer is set. (I chose 15 bits because C supports random integer only up to 0x7fff).
What i've got is this:
I have two questions:
Why there are 3 most probable bits ? In C case bits 8,10,12 are most probable. And
in C# bits 6,8,11 are most probable.
Also seems that C# most probable bits is mostly shifted by 2 positions then compared to C most probable bits. Why is this ? Because C# uses other RAND_MAX constant or what ?
My test code for C:
void accumulateResults(int random, int bitSet[15]) {
int i;
int isBitSet;
for (i=0; i < 15; i++) {
isBitSet = ((random & (1<<i)) != 0);
bitSet[i] += isBitSet;
}
}
int main() {
int i;
int bitSet[15] = {0};
int times = 10000000;
srand(0);
for (i=0; i < times; i++) {
accumulateResults(rand(), bitSet);
}
for (i=0; i < 15; i++) {
printf("%d : %d\n", i , bitSet[i]);
}
system("pause");
return 0;
}
And test code for C#:
static void accumulateResults(int random, int[] bitSet)
{
int i;
int isBitSet;
for (i = 0; i < 15; i++)
{
isBitSet = ((random & (1 << i)) != 0) ? 1 : 0;
bitSet[i] += isBitSet;
}
}
static void Main(string[] args)
{
int i;
int[] bitSet = new int[15];
int times = 10000000;
Random r = new Random();
for (i = 0; i < times; i++)
{
accumulateResults(r.Next(), bitSet);
}
for (i = 0; i < 15; i++)
{
Console.WriteLine("{0} : {1}", i, bitSet[i]);
}
Console.ReadKey();
}
Very thanks !! Btw, OS is Windows 7, 64-bit architecture & Visual Studio 2010.
EDIT
Very thanks to #David Heffernan. I made several mistakes here:
Seed in C and C# programs was different (C was using zero and C# - current time).
I didn't tried experiment with different values of Times variable to research reproducibility of results.
Here's what i've got when analyzed how probability that first bit is set depends on number of times random() was called:
So as many noticed - results are not reproducible and shouldn't be taken seriously.
(Except as some form of confirmation that C/C# PRNG are good enough :-) ).

This is just common or garden sampling variation.
Imagine an experiment where you toss a coin ten times, repeatedly. You would not expect to get five heads every single time. That's down to sampling variation.
In just the same way, your experiment will be subject to sampling variation. Each bit follows the same statistical distribution. But sampling variation means that you would not expect an exact 50/50 split between 0 and 1.
Now, your plot is misleading you into thinking the variation is somehow significant or carries meaning. You'd get a much better understanding of this if you plotted the Y axis of the graph starting at 0. That graph looks like this:
If the RNG behaves as it should, then each bit will follow the binomial distribution with probability 0.5. This distribution has variance np(1 − p). For your experiment this gives a variance of 2.5 million. Take the square root to get the standard deviation of around 1,500. So you can see simply from inspecting your results, that the variation you see is not obviously out of the ordinary. You have 15 samples and none are more than 1.6 standard deviations from the true mean. That's nothing to worry about.
You have attempted to discern trends in the results. You have said that there are "3 most probable bits". That's only your particular interpretation of this sample. Try running your programs again with different seeds for your RNGs and you will have graphs that look a little different. They will still have the same quality to them. Some bits are set more than others. But there won't be any discernible patterns, and when you plot them on a graph that includes 0, you will see horizontal lines.
For example, here's what your C program outputs for a random seed of 98723498734.
I think this should be enough to persuade you to run some more trials. When you do so you will see that there are no special bits that are given favoured treatment.

You know that the deviation is about 2500/5,000,000, which comes down to 0,05%?

Note that the difference of frequency of each bit varies by only about 0.08% (-0.03% to +0.05%). I don't think I would consider that significant. If every bit were exactly equally probable, I would find the PRNG very questionable instead of just somewhat questionable. You should expect some level of variance in processes that are supposed to be more or less modelling randomness...

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");

I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..

Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)

There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.

The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.