How to convert sound wave to midi in C#?

How to convert sound wave to midi in C#? - c#

Does anyone know how to convert an analog sound wave to a MIDI file?
I know that differs from WAV to MP3, but that's not important for now. I only want to learn the basic logic of the conversion.

I realize that you want to write your own wave to MIDI converter.
However, I had no idea that several companies have software that performs wave to MIDI conversion.
For the benefit of those interested, here's a list of audio to MIDI programs.
I picked the WidiSoft web site to check out because it was at the top of the list, and came up high in a Google search. English is not their first language. However, you can download and try the software before you buy it.
This isn't a product review, but depending on your conversion needs, you should be able to find something that already exists.

A wave is the actual sound wave of some sound.
A midi can be thought of as notes of music played on predefined instruments (stored on the computer or on the soundcard).
Therefore, the sound generated by midi is a subset of sound that can be stored in a wave file. This means that you can not convert wave to midi (although you can do it the other way round).
If you know certain things about the waves you wish to convert it might be possible. If, for example, you know that the wave contains only a paino, it might be possible to convert that into notes and from that to midi.

Use the ideas from Genetic Programming: Evolution of Mona Lisa project, but applied over sounds instead of images.

It seems to me that you should first break the audio into small clips (maybe .25 sec each). Then you could use FFT to get as many of the frequencies and associated amplitudes you require for each clip (maybe 8?). Then you could round the frequencies to the nearest note. Then I think the conversion of that to MIDI is pretty easy, at least at this basic level. So, you need to write C# software that samples the audio clip (done: see NAudio), and then obtain (I don't know where) software that does the FFT on the clip. Conversion of frequency to a note is a table lookup. You can find commercial software that does conversion too: https://www.celemony.com/en/melodyne/new-in-melodyne-5

Related

How sound data is stored .wav format?

I want to make a simple .wav player in C# for learning purposes. I want to get more insight into how audio is stored and played on the computer so I want to play a .wav manually rather than with the simple call of a built in function.
I've looked at the structure of .wav files and found some great resources. What I've found is the wav file format stores the data of the sound starting from the 44th byte. It contains data about channels and sample rates in previous bytes but that is not relevant to my question.
I found that this data is a soundwave. As far as I know the height of a sample of a soundwave represents it's frequency. But I don't get where the timbre comes from? If I only played sounds for the correct amount of time on a correct frequency I would get beeps. I could play them simply with System.Console.Beep(freq, duration); but you could hardly call that music.
I have tried looking through multiple resources but they only described the meta data and didn't cover what is exactly in the sound byte stream. I found a similar question and answer on this site but it doesn't really answer that question, it is not even marked accepted because of that I believe.
What is the data exactly in the wave byte stream and how can you make that into an actual played sound on the computer?

You are mistaken: The height of a sample does not represent a frequency. As a matter of fact, the wav-format doesn't use frequencies at all. wav basically works following way:
An analog signal is sampled at a specific frequency. A common frequency for wav is 44,100 Hz, so 44,100 samples will be created each second.
Each sample contains the height which the analog signal has at the sample time. A common wav-format is the 16 bit format. Here, 16 bit will be used to store the height of the signal.
This all occurs separately for each channel.
I'm not sure in which order the data is stored, but maybe some of the great resources you found will help you with that.

Adding to the above answer the height of the sample is the volume when played back. It represents how far back or forward the speaker is pulled or pushed to re-create the vibration.
The timbre you refer to is determined by the frequency of the audio wave.
There is a a lot going on in audio, a simple drumbeat will produce sound on several frequencies including harmonics or repeated vibrations at different frequencies, but all of this is off topic for a programming site, so you will need to research sound and frequencies and perhaps DSP.
What you need to know from a computers perspective is that sound is stored as samples taken at set frequencies, as long as we sample at twice the frequency of the sound we wish to capture we will be able to re product the original. The samples record the current level (volume) of the audio at that moment in time, turning the samples back into audio is the Job of the Digital to Analogue Converter found on your sound card.
The operating system looks after passing the samples to the hardware via the appropriate driver. In windows WASAPI and ASIO are two API’s you can use to pass the audio to the sound card. Look at open source projects like NAudio to see the code required to call these operating system APIs.
I hope this helps I suspect the topic is broader than you first imagined.

For anyone who wants more clarification, the value of each byte in the data section of the .wav file represents the amplitude (volume) of each sample. Samples are played at a certain frequency defined at near the start of the .wav file. A common frequency is 44100 hertz, which is used for CD-quality audio. 44100 hertz would mean that the samples are played every 1/44100 seconds.
If the samples rise linearly, then the sound produced would get louder as the file played. A regular musical sounds can be generated by using cyclical changes in amplitude, such as sine and saw waves.
Complex sounds such as drums, which produce multiple frequencies, are created by adding each of the sine and cosine waves produced into a single wave, called Fourier series. As an earlier answer said, this process is repeated for each channel, which is normally used to create surround sound.

Some physics and math of sound principles. Hope also helps.
Sound, as we human being perceive, is simply a vibration of energy in air.
The higher frequency of the vibration, the higher pitch we perceive.
When a musical key (say middle C of a piano) is hit, it creates an energy of vibration in air up and down about 260 times a second. The higher the amplitude of this vibration, the louder we perceive.
The difference between a same pitch played with a piano and a violin is, besides the main vibration wave, there are many small harmonic waves embedded in it. The different combinations of the harmonic waves make the different sounds of an identical pitch (from a piano, or violin, or trumpet...).
Back to the main story, as other people described, .wav format only stores the amplitude of energy level in each sample, from which you can plot an amplitudes versus time graph first. This is also known as the time domain of a waveform. You can then use a mathematical principle called 'Fourier Transform' to convert the time domain into the frequency domain, which tells you the information about what frequencies are detected and relatively how often they are detected in this waveform.
For example, if you want to synthesize the sound of violin played in C, first you need to analyse the distribution of it's harmonic waves, then construct the frequency domain consisting of the main 260 Hz plus a series of sub-frequencies from it's harmonic waves. Use Inverse Fourier Transform to convert the frequency domain back to the time domain. Store the data in .wav format.

Audio pitch analysis

I'm very new to sound analysis in fact doing it for the first time all I need to do is to analyse an mp3 file or any other format and detect as pitch varies. simply I want to trim audio file where high notes occur.
sound wave
I've tried NAudio and few articles but of no avail so if someone guides me in right direction for some tutorial and what API to use.

The first you must know is how the pitch is related to waveform.
Notice: simple single-channel waveform is represented by simple byte array. It consists of RIFF header with some necessary parameters and the wave sequence itself. More complicated waveforms (multichannel, high bit rate) are represented in some other way (int instead of byte array, interference and so on).
SO: In order to manipulate audio pitch, you have to learn how waveform is made (step 2) and write (or google) a certain algorithm, which will operate with a waveform's pitch in a way mentioned in step 1.
If you are very new to audio programming, there is a great beginner tutorial: generating constant waveforms with C#.

You could use an FFT to get the spectrum of the recording. By checking the spectrum for specific frequencies you could decide which parts of the audio contain high pitches.
Some theory:
http://en.wikipedia.org/wiki/Fourier_transform
http://en.wikipedia.org/wiki/Spectrogram
Some resources:
How to perform the FFT to a wave-file using NAudio
https://naudio.codeplex.com/discussions/257586
http://naudio.codeplex.com/discussions/242989

Try a C# binding to Aubio
aubio is a tool designed for the extraction of annotations from audio signals. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio.
http://aubio.org/

C# and Audio Generation/Playback

I’m making an audio synthesizer and I’m having issues figuring out what to use for audio playback. I’m using physics and math to calculate the source waveforms and then need to feed that waveform to something which can play it as sound. I need something that can 1) play the waveforms I calculate and 2) play multiple sounds simultaneously (like holding one key down on a piano while pressing other keys). I’ve done a fair bit of research into this and I can’t find something that does both of those things. As far as I know, I have 5 potential options:
DirectSound. It can take a waveform (a short[]) as a parameter and play it as sound, and can play multiple sounds simultaneously. But it won’t work with .NET 4.5.
System.Media.SoundPlayer. It works with .NET 4.5 and has better quality audio than Direct Sound, but it has to play sound from a .wav file and cannot play multiple sounds at once (nor can multiple instances of SoundPlayer). I ‘trick’ SoundPlayer into working by translating my waveform into .wav format in memory and then send SoundPlayer a MemoryStream of the in-memory .wav file. Could I potentially achieve control over the playback by altering the stream? I cannot append bytes to the stream (I tried) but I could potentially make the stream an arbitrary size and just re-write all the bytes in the stream with the next segment of audio data every time the end of the stream is reached.
System.Windows.Controls.MediaElement. I have not experimented with this yet, but from MSDNs documentation I don’t see a way to send it a waveform in memory without saving it to disk first and then reading it; I don’t think I can send it a stream.
System.Windows.Controls.MediaPlayer. I have not experimented with this either, but the documentation says it’s meant to be used as a companion to some kind of animation. I could potentially use this without doing any real (user-perceivable) animation to achieve my desired effect.
An open source solution. I’m hesitant to use an open source solution as I find they are typically poorly documented and not very maintainable, but I am open to ideas if there is one out there that is well documented and can do what I need.
Can anyone offer me any guidance on this or how to create flexible audio playback?

http://naudio.codeplex.com , without a doubt. Mark is a regular here on SO, the product is well alive, there are good code examples.
It works. We built some great stuff with it.

Find audio sample in audio file (spectrogram already exists)

I am trying to achieve the following:
Using Skype, call my mailbox (works)
Enter password and tell the mailbox that I want to record a new welcome message (works)
Now, my mailbox tells me to record the new welcome message after the beep
I want to wait for the beep and then play the new message (doesn't work)
How I tried to achieve the last point:
Create a spectrogram using FFT and sliding windows (works)
Create a "finger print" for the beep
Search for that fingerprint in the audio that comes from skype
The problem I am facing is the following:
The result of the FFTs on the audio from skype and the reference beep are not the same in a digital sense, i.e. they are similar, but not the same, although the beep was extracted from an audio file with a recording of the skype audio. The following picture shows the spectrogram of the beep from the Skype audio on the left side and the spectrogram of the reference beep on the right side. As you can see, they are very similar, but not the same...
uploaded a picture http://img27.imageshack.us/img27/6717/spectrogram.png
I don't know, how to continue from here. Should I average it, i.e. divide it into column and rows and compare the averages of those cells as described here? I am not sure this is the best way, because he already states, that it doesn't work very good with short audio samples, and the beep is less than a second in length...
Any hints on how to proceed?

You should determine the peak frequency and duration (possibly a minumum power over that duration for the frequency (RMS being the simplest measure)
This should be easy enough to measure. To make things even more clever (but probably completely unnecessary for this simple matching task), you could assert the non-existance of other peaks during the window of the beep.
Update
To compare a complete audio fragment, you'll want to use a Convolution algorithm. I suggest using a ready made library implementation instead of rolling your own.
The most common fast convolution algorithms use fast Fourier transform (FFT) algorithms via the circular convolution theorem. Specifically, the circular convolution of two finite-length sequences is found by taking an FFT of each sequence, multiplying pointwise, and then performing an inverse FFT. Convolutions of the type defined above are then efficiently implemented using that technique in conjunction with zero-extension and/or discarding portions of the output. Other fast convolution algorithms, such as the Schönhage–Strassen algorithm, use fast Fourier transforms in other rings.
Wikipedia lists http://freeverb3.sourceforge.net as an open source candidate
Edit Added link to API tutorial page: http://freeverb3.sourceforge.net/tutorial_lib.shtml
Additional resources:
http://en.wikipedia.org/wiki/Finite_impulse_response
http://dspguru.com/dsp/faqs/fir
Existing packages with relevant tools on debian:
[brutefir - a software convolution engine][3]
jconvolver - Convolution reverb Engine for JACK
libzita-convolver2 - C++ library implementing a real-time convolution matrix
teem-apps - Tools to process and visualize scientific data and images - command line tools
teem-doc - Tools to process and visualize scientific data and images - documentation
libteem1 - Tools to process and visualize scientific data and images - runtime
yorick-yeti - utility plugin for the Yorick language

First I'd smooth it a bit in frequency-direction so that small variations in frequency become less relevant. Then simply take each frequency and subtract the two amplitudes. Square the differences and add them up. Perhaps normalize the signals first so differences in total amplitude don't matter. And then compare the difference to a threshold.

Real time audio playback from mic. c#

I am looking to create an application that will allow me to record from my mic and playback the recording through other pc's. At this point however I would just like it to play back on my own computer so I can get it working.
I have been looking at NAudio for the past few hours and it seems like it may be able to help me achieve this goal.
I am just wondering if anyone else has had any experience with this and if it is at all possible?
Thanks,
Stuart

There is an example project on codeproject doing this:
http://www.codeproject.com/KB/cs/Streaming_wave_audio.aspx
I don't know how low the latency is.
As a codec I'd recommend Speex(at least for speech). It's free, open source and offers low latency and low bandwidth.

Bass Audio Library is another solid option worth looking into.

It is possible to do, but you are unlikely to get low latency with WaveIn/WaveOut (possibly better results with WASAPI). You could use the BufferedWaveProvider (in the latest source code) to store up the audio being recorded from the microphone and supplying the output to soundcard.

NAudio is great as a starting point for audio capture and playback but as Mark pointed out the latency might be an issue.
If you take the next step and want to sent the audio data across the network you will need a codec to compress the data as PCM or WAV are uncompressed and for voice you only need a small part of the bandwidth needed for WAV.
As you are working with C# there is a C# port of Speex available, called NSpeex, which might be worth having a look at..

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.