I'm trying to create a program which gets the various "notes" in a sound file (WAV or MP3) and can get the frequency and amplitude of each. I've been searching around for this, and of course there is the problem of distinguishing individual "notes" in a music file which isn't a MIDI, but it seems that something along these lines can be done with NAudio or DirectSound. Any ideas?
Thanks!
What you are asking to do is extremely difficult.
Step one would be to convert your audio from a time domain to a frequency domain. That is, you take a number of samples, and do a Fourier transform (implemented in your software as FFT).
Next, you begin deciding what you call a note or not. This is as not as simple as picking out the loudest of the frequencies! Different instruments have different timbre, which is created by various harmonics. If you had a song of nothing but sine waves, this would be much simpler. However, you'll find that you'll start seeing notes where your ear tells you they don't exist.
Now, psychoacoustics come into play. It is entirely possible for humans to "hear" notes that do not even have a fundamental. This is particularly true in a musical context. If I were to take a trombone and start playing a scale downward, at some point, the fundamental disappears or is mostly gone. However, you will still perceive that scale as going downward, when in fact the fundamental sound has all-but disappeared. Things get really tricky at this point.
To answer your question, start with an FFT. Maybe this is sufficient for your needs. If not, begin reading the significant amount of technical literature on the subject.
Related
I know that a certain delay between the “play-order” and the actual start of the playback of a sound is inevitable.
However, for my current project, I must be able to start a sound-playback at a certain moment in time. This moment is known, so the solution to the problem is ether to reduce the delay-time as much as possible or to somehow predict the latency and start the sound somewhat earlier (depending on the predicted latency).
I describe the problem in detail here:
https://naudio.codeplex.com/discussions/662236
My current solution is to use NAudio to play a sound and simultaneously observe the sound-output-volume. This way I can measure the latency and use it to time the “play-order” for the following sounds.
This way I get decent results (about 30 ms deviation from the supposed play-time), but I wanted to ask if you guys have better suggestions.
Best regards and many thanks
I'm currently working on an idea for a game i have that involves beat detction. Th engine im working with is Unity, and I've never had any experience with audio, coding wise, so be gentle :)
I've looked at several articles and tested out several algorithms including some of my own, but none we're really successful nor accurate enough, and i feel like I've been getting something wrong this entire time.
Specifically I've tried implementing the idea's presented here:
http://archive.gamedev.net/archive/reference/programming/features/beatdetection/index.html
but with little success, i still think im skipping over something and i cant quite pinpoint it.
If someone could provide an explanation about how to make an actually accurate beat detector i would be very grateful.
EDIT:
some people were confused as to what im having trouble with. Here is my latest try at detecting beats, i still dont understand why it's so inaccurate:
http://pastebin.com/BD8y9tfz
in this i used (R1) equation in the link i posted above to compute the instant energy from the 1024 samples i took, and then i used (R3) to calculate the local average sound energy from the buffer containing all the previous instant energy calculations, then i checked if there is a significant rise in instant energy compared to the average local sound energy, if there is, it means there is a beat, if there isn't, the program continues as usual.
(stupid reputation system doesnt let me post links and pictures ): ).
Edit 2:
added implementation for R4,R5 and R6, still not working though.
added a bit of debug, and for some reason the constant is ridicolosely small, numbers like:
Constant: -103416
and Constant: -54793.28, ive got no clue why im getting these numbers, any help?
I am trying to output audio samples, and do so with cswavplay from http://www.codeproject.com/KB/audio-video/cswavplay.aspx which in turn seem to use DllImports from winmm.dll.
I did get it to play using 8-bit samples, however it fails miserably when I try to feed it 16-bit samples. I dug through the code as best as I can and I understand it as this:
I get a pointer to a buffer to fill each time cswavplay finished playing the last buffer. It works for one iteration, it plays back one buffer, sometimes...
I get all sorts of funny exceptions, AccessViolationException just now for instance when I tried to use a buffer size of 44100 to hear more clearly how much gets played. But when I put breakpoints in various places inside the WaveOut class (part of cswavplay) it seems none of the objects it uses, like the buffers and an instance of AutoResetEvent, are still alive the second iteration. My best guess is that these problems are related to threading or GC. The exceptions seem pretty random, and I am far too inexperienced to comprehend fully what's going on.
I'm asking for either of the following:
1) Wild guesses as to what could be the problem
2) Educated guesses as to what could be the problem
3) Pointers to an alternative way of outputting sound in realtime using C#
I'm not asking for a thorough bug tracking of software i didn't write, so don't mind cswavplay...
At the end of the day, i might be doing something wrong here, but it's hard to know when i don't get a relevant exception (along the lines of BufferAllocationException or something)...
EDIT:
Thanks for all the suggestions about other sound API's. They all seem to assume a .wav file. I'm sorry for not being clear, i'm not playing .wav files, i synthesize samples in realtime.
DirectSound and for .NET the XNA framework comes to my mind. There are many very high quality samples out there how to play sound and animate graphics at the same time with .NET.
Im having troubles about this problem. A program that detects a snare drum sound from a sound file, generates it's wave form, and emphasize the part where the snare drum was detected. Does this program, perhaps exist? :) This is the research/thesis assigned to me. I've been researching about possible algorithms and i've seen some of initial research. This is included in the field of sound detection right? Can you please tell me some ideas or any material, code snippets that I can use? I really appreciate it. Thank You! :)
Percussion sounds have different characteristics. Kick drum has most energy in the lower part of the frequency spectrum and cymbals/hats have most energy in the high end of the frequency spectrum. Snare drum distribution is generally fairly wide and similar in timbre to wideband noise. So to detect this, you'd have to perform a fourier transform over the signal to perform analysis in the frequency domain rather than the time domain, and detect percussive wideband noise. You'd definitely be better off asking this on a DSP forum rather than a programmer's forum.
I don't know if that helps you, but HERE you can get a .net Library (free for non-commercial) which can do such stuff. May you can use ILSpy to take a look at their algorithems...
Given two byte arrays of data captured from a microphone, how can I determine which one has more spikes in noise? I would assume there is an algorithm I can apply to the data, but I have no idea where to start.
Getting down to it, I need to be able to determine when a baby is crying vs ambient noise in the room.
If it helps, I am using the Microsoft.Xna.Framework.Audio.Microphone class to capture the sound.
you can convert each sample (normalised to a range 1.0 to -1.0) into a decibel rating by applying the formula
dB = 20 * log-base-10 (sample-value)
To be honest, so long as you don't mind the occasional false positive, and your microphone is set up OK, you should have no problem telling the difference between a baby crying and ambient background noise, without going through the hassle of doing an FFT.
I'd recommend you having a look at the source code for a noise gate, which does pretty much what you are after, with configurable attack times & thresholds.
First use a Fast Fourier Transform to transform the signal into the frequency domain.
Then check if the signal in the typical "cry-frequencies" is significantly higher than the other amplitudes.
The preprocessor of the speex codec supports noise vs signal detection, but I don't know if you can get it to work with XNA.
Or if you really want some kind of loudness calculate the sum of squares of the amplitudes from the frequencies you're interested in (for example 50-20000Hz) and if the average of that over the last 30 seconds is significantly higher than the average over the last 10 minutes or exceeds a certain absolute threshold sound the alarm.
Louder at what point? The signal's average amplitude will tell you which one is louder on average, but that is kind of a dumb, brute force way to go about it. It may work for you in practice though.
Getting down to it, I need to be able to determine when a baby is crying vs ambient noise in the room.
Ok, so, I'm just throwing out ideas here; I am by no means an expert on audio processing.
If you know your input, i.e., a baby crying (relatively loud with a high pitch) versus ambient noise (relatively quiet), you should be able to analyze the signal in terms of pitch (frequency) and amplitude (loudness). Of course, if during he recording someone drops some pots and pans onto the kitchen floor, that will be tough to discern.
As a first pass I would simply traverse the signal, maintaining a standard deviation of pitch and amplitude throughout, and then set a flag when those deviations jump beyond some threshold that you will have to define. When they come back down you may be able to safely assume that you captured the baby's cry.
Again, just throwing you an idea here. You will have to see how it works in practice with actual data.
I agree with #Ed Swangren, it will take a lot of playing with samples of data for a lot of sources. To me, it sounds like the trick will be to limit or hopefully eliminate false positives. My experience with babies is they are much louder crying than the environment. so, keeping track of the average measurements (freq/amp/??) of the normal environment and then classifying how well the changes match the characteristics of a crying baby which changes from kid to kid, so you'll probably want a system that 'learns'. Best of luck.
update: you might find this library useful http://naudio.codeplex.com/