BinDCT implementation for a 32x32 matrix - c#

So I am playing a bit with DCT implementations and noticed they are (relative) slow due to the necessary multiplier calculations.
After googling a bit, I came across BinDCT, which results in very good approximations of the DCT and only uses bit shifts.
While scanning a paper about it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.834&rep=rep1&type=pdf and http://www.docstoc.com/docs/130118150/Image-Compression-Using-BinDCT) and reading some code I found on ohlo (http://code.ohloh.net/file?fid=vz-HijUWVLFS65NRaGZpLZwZFq8&cid=mt_ZjvIU0Us&s=&fp=461906&projSelected=true#L0), I noticed there are only implementations for a 8x8 matrix.
I am looking for an implementation of this BinDCT for a 32x32 matrix so I can use it in a faster variation of the perceptual hash algorithm (phash).
I am no mathematician and although I tried to understand what's going on in the paper and the c code I found I just can't wrap my head around how to transform this implementation to apply to a 32x32 matrix.
Has anyone ever written one? Is it even possible?
I understand that extending the implementation requires a lot more bit shifting and tmp variables. But although I could try with trial and error, I don't even understand the theory, so I would never know if I get the correct result.
I am writing this in C#, but any language would suffice as it's all basic operations and can be easily translated.

1.you have fixed input size
so you multiply by the same weights all the time
pre-compute them once and then use only them
this ditch all sin,cos operations
2.2D DCT can be computed as 1D DCT (similar to FFT)
first do DCT on rows
then on collumns of the DCTed rows
multiply by normalization constant
so this converts O(N^4) to O(N^3)
3.use FastDCT
well this is very tricky
Fast algorithm is fusion between (I)DST and (I)DCT
there are few papers about it
but there are vague (and all equations are different in different papers and not whole)
I actually newer see a functional equation nor program for it
the only almost functional approach is by use of FFT
but for small N is there no gain because of switching to complex domain
and the values are not really a DCT but a close approximation to it.
of course I am no expert in this field so I can overlooked something
in all that hundreds of paper pages equations
anyway after Fast algorith implementation the 2D (I)DCT and the bullet 2
is complexity around O((N^2).log(N))
4.ditching the FPU multiplications
you can take all the weights and convert them to a1=a0*1024
or any other mask
so:
x*a0 = (x*a1)/1024 = (x*a1)>>10
the same can be done for input data
so now just integer operations remains
but on modern machines can be this approach slower then FPU usage (depends on platform and implementation)
4.ditching integer multiplications
you can ditch all multiplications by shift and add operations (look for binary multiplication)
but on modern machines will this actually slow things down
of course if you are wiring this on some logic board/IO then it has its merit

My only understanding of applying matrices is related to manipulating 3D vectors so I don't know the answer to your question directly. But in looking around, I did find this link to a blog where your specific issue is addressed. The comments at the bottom are from a bunch of people that could be a good pool of resources to chat with who have knowledge in this area. Also, If you follow the links there is a lot of good image compression info.
The author appears to be heavily involved in photo forensics. He explains how pHash is more robust than the average hash and mentions using a 32 x 32 matrix.
This could be a really good starting point. Take care.
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Related

Exocortex.dsp FFT vs mathematically strict DFT. C# in Unity3d

I'm calculating the autocorrelation of audio samples. The direct calculation of autocorrelation can be sped from O(n^2) to O(nlogn) by using the the FFT - exploiting the convolution theorem. Both forward and inverse FFT are needed.
I made a test script in python, just to make sure I knew what I was doing, and it works. But in my C# version, it doesn't.
I know that many implementations of the FFT give answers that differ from the mathematically strict DFT. For instance, you may need to divide your results by N (the number of bins.)
... tried that, still didn't work ...
I've striven mightily to find some documentation about the details of Exocortex's FFT, to no avail. (cue someone finding it in < 1 sec...)
Does anyone out there know the details of the Exocortex implementation of FFT, and how to get the mathematically strict values for the DFT, and inverse DFT of a signal?
I've got my code working!
As everybody has probably guessed, there was an additional bug in my code, which was confounding my efforts to understand Exocortex's fft.
In Exocortex's forward fft, you need to divide by the fft size to get the canonical values for the transform. For the inverse fft, you need to multiply by the fft size. I believe this is the same as Apple's Accelerate, whereas in numpy and matlab you get the actual DFT values.
Perhaps the practice of requiring division or multiplication by the fft size is extremely widespread – if you know the situation, I invite you to comment.
Of course, many times people are only interested in fft values relative to each other, in which case scaling by a constant doesn't matter.
If anyone out there knows where there is decent documentation for Exocortex DSP, I invite you to comment on that as well.

Perceptual image hashing

OK. This is part of an (non-English) OCR project. I have already completed preprocessing steps like deskewing, grayscaling, segmentation of glyphs etc and am now stuck at the most important step: Identifcation of a glyph by comparing it against a database of glyph images, and thus need to devise a robust and efficient perceptual image hashing algorithm.
For many reasons, the function I require won't be as complicated as required by the generic image comparison problem. For one, my images are always grayscale (or even B&W if that makes the task of identification easier). For another, those glyphs are more "stroke-oriented" and have simpler structure than photographs.
I have tried some of my own and some borrowed ideas for defining a good similarity metric. One method was to divide the image into a grid of M x N cells and take average "blackness" of each cell to create a hash for that image, and then take Euclidean distance of the hashes to compare the images. Another was to find "corners" in each glyph and then compare their spatial positions. None of them have proven to be very robust.
I know there are stronger candidates like SIFT and SURF out there, but I have 3 good reasons not to use them. One is that I guess they are proprietary (or somehow patented) and cannot be used in commercial apps. Second is that they are very general purpose and would probably be an overkill for my somewhat simpler domain of images. Third is that there are no implementations available (I'm using C#). I have even tried to convert pHash library to C# but remained unsuccessful.
So I'm finally here. Does anyone know of a code (C# or C++ or Java or VB.NET but shouldn't require any dependencies that cannot be used in .NET world), library, algorithm, method or idea to create a robust and efficient hashing algorithm that could survive minor visual defects like translation, rotation, scaling, blur, spots etc.
It looks like you've already tried something similar to this, but it may still be of some use:
https://www.memonic.com/user/aengus/folder/coding/id/1qVeq

Analyzing audio to create Guitar Hero levels automatically

I'm trying to create a Guitar-Hero-like game (something like this) and I want to be able to analyze an audio file given by the user and create levels automatically, but I am not sure how to do that.
I thought maybe I should use BPM detection algorithm and place an arrow on a beat and a rail on some recurrent pattern, but I have no idea how to implement those.
Also, I'm using NAudio's BlockAlignReductionStream which has a Read method that copys byte[] data, but what happens when I read a 2-channels audio file? does it read 1 byte from the first channel and 1 byte from the second? (because it says 16-bit PCM) and does the same happen with 24-bit and 32-bit float?
Beat detection (or more specifically BPM detection)
Beat detection algorithm overview for using a comb filter:
http://www.clear.rice.edu/elec301/Projects01/beat_sync/beatalgo.html
Looks like they do:
A fast Fourier transform
Hanning Window, full-wave rectification
Multiple low pass filters; one for each range of the FFT output
Differentiation and half-wave rectification
Comb filter
Lots of algorithms you'll have to implement here. Comb filters are supposedly slow, though. The wiki article didn't point me at other specific methods.
Edit: This article has information on streaming statistical methods of beat detection. That sounds like a great idea: http://www.flipcode.com/misc/BeatDetectionAlgorithms.pdf - I'm betting they run better in real time, though are less accurate.
BTW I just skimmed and pulled out keywords. I've only toyed with FFT, rectification, and attenuation filters (low-pass filter). The rest I have no clue about, but you've got links.
This will all get you the BPM of the song, but it won't generate your arrows for you.
Level generation
As for "place an arrow on a beat and a rail on some recurrent pattern", that is going to be a bit trickier to implement to get good results.
You could go with a more aggressive content extraction approach, and try to pull the notes out of the song.
You'd need to use beat detection for this part too. This may be similar to BPM detection above, but at a different range, with a band-pass filter for the instrument range. You also would swap out or remove some parts of the algorithm, and would have to sample the whole song since you're not detecting a global BPM. You'd also need some sort of pitch detection.
I think this approach will be messy and will guarantee you need to hand-scrub the results for every song. If you're okay with this, and just want to avoid the initial hand transcription work, this will probably work well.
You could also try to go with a content generation approach.
Most procedural content generation has been done in a trial-and-error manner, with people publishing or patenting algorithms that don't completely suck. Often there is no real qualitative analysis that can be done on content generation algorithms because they generate aesthetics. So you'd just have to pick ones that seem to give pleasing sample results and try it out.
Most algorithms are centered around visual content generation, including terrain, architecture, humanoids, plants etc. There is some research on audio content generation, Generative Music, etc. Your requirements don't perfectly match either of these.
I think algorithms for procedural "dance steps" (if such a thing exists - I only found animation techniques) or Generative Music would be the closest match, if driven by the rhythms you detect in the song.
If you want to go down the composition generation approach, be prepared for a lot of completely different algorithms that are usually just hinted about, but not explained in detail.
E.g.:
http://tones.wolfram.com/about/faqs/howitworks.html
http://research.microsoft.com/en-us/um/redmond/projects/songsmith/

Determining Similarity of Edge-Detection-Processed Images

I was hoping that I could achieve some guidance from the stackoverflow community regarding a dilemma I have run into for my senior project. First off, I want to state that I am a novice programmer, and I'm sure some of you will quickly tell me this project was way over my head. I've quickly become well aware that this is probably true.
Now that's that's out of the way, let me give some definitions:
Project Goal:
The goal of the project, like many others have sought to achieve in various SO questions (many of which have been very helpful to me in the course of this effort), is to detect
whether a parking space is full or available, eventually reporting such back to the user (ideally via an iPhone or Droid or other mobile app for ease of use -- this aspect was quickly deemed outside the scope of my efforts due to time constraints).
Tools in Use:
I have made heavy use of the resources of the AForge.Net library, which has provided me with all of the building blocks for bringing the project together in terms of capturing video from an IP camera, applying filters to images, and ultimately completing the goal of detection. As a result, you will know that I have selected to program in C#, mainly due to ease-of-use for beginners. Other options included MATLAB/C++, C++ with OpenCV, and other alternatives.
The Problem
Here is where I have run into issues. Below is linked an image that has been pre-processed in the AForge Image Processing Lab. The sequence of filters and processes used was: Grayscale, Histogram Equalization, Sobel Edge Detection and finally Otsu Threshholding (though I'm not convinced the final step is needed).
http://i.stack.imgur.com/u6eqk.jpg
As you can tell from the image with the naked eye of course, there are sequences of detected edges which clearly are parked cars in the spaces I am monitoring with the camera. These cars are clearly defined by the pattern of brightened wheels, the sort of "double railroad track" pattern that essentially represents the outer edging of the side windows, and even the outline of the license plate in this instance. Specifically though, in a continuation of the project the camera chosen would be a PTZ to cover as much of the block as possible, and thus I'd just like to focus on the side features of the car (eliminating factors such as license plate). Features such as a a rectangle for a sunroof may also be considered but obviously this is a not a universal feature of cars, whereas the general window outline is.
We can all see that there are differences to these patterns, varying of course with car make and model. But, generally this sequence not only results in successful retrieval of the desired features, but also eliminates the road from view (important as I intend to use road color as a "first litmus test" if you will for detecting an empty space... if I detect a gray level consistent with data for the road, especially if no edges are detected in a region, I feel I can safely assume an empty space). My question is this, and hopefully it is generic enough to be practically beneficial to others out there on the site:
Focused Question:
Is there a way to take an image segment (via cropping) and then compare the detected edge sequence with future new frames from the camera? More specifically, is there a way to do this while allowing leeway/essentially creating a tolerance threshhold for minor differences in edges?
Personal Thoughts/Brainstorming on The Question:
-- I'm sure there's a way to literally compare pixel-by-pixel -- crop to just the rectangle around your edges and then slide your cropped image through the new processed frame for comparison pixel-by-pixel, but that wouldn't help particularly unless you had an exact match to your detected edges.
All help is appreciated, and I'm more than happy to clarify as needed as well.
Let me give it a shot.
You have two images. Lets call them BeforePic and AfterPic. For each of these two pictures you have a ROI (rectangle of interest) - AKA a cropped segment.
You want to see if AfterPic.ROI is very different from BeforePic.ROI. By "very different" I mean that the difference is greater then some threshold.
If this is indeed your problem, then it should be split into three parts:
get BeforePic and AfterPic (and the ROI for each).
Translate the abstract concept of picture\edge difference into a numerical one.
compare the difference to some threshold.
The first part isn't really a part of your question, so I'll ignore it.
The last part is based basically finding the right threshold. Again out of the scope of the question.
The second part is what I think is the heart of the question (I hope I'm not completely off here). For this I would use the algorithm ShapeContext (In the PDF, it'll be best for you to implement it up to section 3.3, as it gets too robust for your needs from 3.4 and on).
Shape Context is a image matching algorithm using image edges with great success rates.
Implementing this was my finals project, and it seems like a perfect match (no pun intended) for you. If your edges are well, and your ROI is accurate, it won't fail you.
It may take some time to implement, but if done correctly, this will work perfectly for you.
Bare in mind, that a poor implementation might run slowly and I've seen a worst case of 5 seconds per image. A good (yet not perfect) implementation, on the other hand, will take less then 0.1 seconds per image.
Hope this helps, and good luck!
Edit: I found an implementation of ShapeContext in C# # CodeProject, if it's of any interest
I take on a fair number of machine vision problems in my work and the most important thing I can tell you is that simpler is better. The more complex the approach, the more likely it is for unanticipated boundary cases to create failures. In industry, we usually address this by simplifying conditions as much as possible, imposing rigid constraints that limit the number of things we need to consider. Granted, a student project is different than an industry project, as you need to demonstrate an understanding of particular techniques, which may well be more important than whether it is a robust solution to the problem you've chosen to take on.
A few things to consider:
Are there pre-defined parking spaces on the street? Do you have the option to manually pre-define the parking regions that will be observed by the camera? This can greatly simplify the problem.
Are you allowed to provide incorrect results when cars are parked illegally (taking up more than one spot, for instance)?
Are you allowed to provide incorrect results when there are unexpected environmental conditions, such as trash, pot holes, pooled water or snow in the space?
Do you need to support all categories of vehicles (cars, flat-bed trucks, vans, delivery trucks, motorcycles, mini electric cars, tripod vehicles, ?)
Are you allowed to take a baseline snapshot of the street with no cars present?
As to comparing two sets of edges, probably the most robust approach is known as geometric model finding (describing the edges of interest mathematically as a series of 'edgels', combining them into chains and comparing the geometry), but this is over-kill for your application. I would look more toward thresholds of the count of 'edge pixels' present in a parking region or differencing from a baseline image (need to be careful of image shift, however, since material expansion from outdoor temperature changes may cause the field of view to change slightly due to the camera mechanically moving.)

Best Jpeg Encoder for Silverlight 4.0

I want to convert Writablebitmap to Jpeg stream, and it looks like there is no platform support as well as I can see a bunch of opensource Encoder libraries on web, I want to get your opinion on which is the recommended one in terms of performance and reliability.
I made good experience with FJCore.
I also blogged about it a while ago http://kodierer.blogspot.com/2009/11/convert-encode-and-decode-silverlight.html
I've spent quite a bit of time with both FJCore and LibJpeg.Net. FJCore is easier to use, since it was ported over from Java, and has an object model that vaguely resembles what you'd expect to see in C#. However, LibJpeg.NET is by far the more complete library (it's based on the informally canonical libjpeg), and it's significantly faster as well. To give one example, FJCore uses a naive implementation of an inverse discrete cosine transform that involves something like 1024 multiplications and an additional 1024 additions for each 8x8 block. In contrast, LibJpeg.NET uses the high performance AAN algorithm which only takes 144 multiplications and 464 additions (see http://datasheets.chipdb.org/Intel/x86/MMX/MMX/AP528.HTM#AAN Algorithm). In addition, FJCore is fairly inefficient in how it uses memory, constantly recreating objects that could easily be re-used. At the same time, because FJCore has fewer optimizations, it's significantly easier to hack.
For my current project (which involves writing a video codec for Silverlight), I used FJCore as a starting point, fixed a whole bunch of its inefficiencies, replaced its IDCT algorithm with the one from LibJpeg.NET, and ended up with something that gave me about 10x the original performance.
Ken why don't you submit your updated code to the FJCore source?
http://code.google.com/p/fjcore/

Categories