I want to make a simple .wav player in C# for learning purposes. I want to get more insight into how audio is stored and played on the computer so I want to play a .wav manually rather than with the simple call of a built in function.
I've looked at the structure of .wav files and found some great resources. What I've found is the wav file format stores the data of the sound starting from the 44th byte. It contains data about channels and sample rates in previous bytes but that is not relevant to my question.
I found that this data is a soundwave. As far as I know the height of a sample of a soundwave represents it's frequency. But I don't get where the timbre comes from? If I only played sounds for the correct amount of time on a correct frequency I would get beeps. I could play them simply with System.Console.Beep(freq, duration); but you could hardly call that music.
I have tried looking through multiple resources but they only described the meta data and didn't cover what is exactly in the sound byte stream. I found a similar question and answer on this site but it doesn't really answer that question, it is not even marked accepted because of that I believe.
What is the data exactly in the wave byte stream and how can you make that into an actual played sound on the computer?
You are mistaken: The height of a sample does not represent a frequency. As a matter of fact, the wav-format doesn't use frequencies at all. wav basically works following way:
An analog signal is sampled at a specific frequency. A common frequency for wav is 44,100 Hz, so 44,100 samples will be created each second.
Each sample contains the height which the analog signal has at the sample time. A common wav-format is the 16 bit format. Here, 16 bit will be used to store the height of the signal.
This all occurs separately for each channel.
I'm not sure in which order the data is stored, but maybe some of the great resources you found will help you with that.
Adding to the above answer the height of the sample is the volume when played back. It represents how far back or forward the speaker is pulled or pushed to re-create the vibration.
The timbre you refer to is determined by the frequency of the audio wave.
There is a a lot going on in audio, a simple drumbeat will produce sound on several frequencies including harmonics or repeated vibrations at different frequencies, but all of this is off topic for a programming site, so you will need to research sound and frequencies and perhaps DSP.
What you need to know from a computers perspective is that sound is stored as samples taken at set frequencies, as long as we sample at twice the frequency of the sound we wish to capture we will be able to re product the original. The samples record the current level (volume) of the audio at that moment in time, turning the samples back into audio is the Job of the Digital to Analogue Converter found on your sound card.
The operating system looks after passing the samples to the hardware via the appropriate driver. In windows WASAPI and ASIO are two API’s you can use to pass the audio to the sound card. Look at open source projects like NAudio to see the code required to call these operating system APIs.
I hope this helps I suspect the topic is broader than you first imagined.
For anyone who wants more clarification, the value of each byte in the data section of the .wav file represents the amplitude (volume) of each sample. Samples are played at a certain frequency defined at near the start of the .wav file. A common frequency is 44100 hertz, which is used for CD-quality audio. 44100 hertz would mean that the samples are played every 1/44100 seconds.
If the samples rise linearly, then the sound produced would get louder as the file played. A regular musical sounds can be generated by using cyclical changes in amplitude, such as sine and saw waves.
Complex sounds such as drums, which produce multiple frequencies, are created by adding each of the sine and cosine waves produced into a single wave, called Fourier series. As an earlier answer said, this process is repeated for each channel, which is normally used to create surround sound.
Some physics and math of sound principles. Hope also helps.
Sound, as we human being perceive, is simply a vibration of energy in air.
The higher frequency of the vibration, the higher pitch we perceive.
When a musical key (say middle C of a piano) is hit, it creates an energy of vibration in air up and down about 260 times a second. The higher the amplitude of this vibration, the louder we perceive.
The difference between a same pitch played with a piano and a violin is, besides the main vibration wave, there are many small harmonic waves embedded in it. The different combinations of the harmonic waves make the different sounds of an identical pitch (from a piano, or violin, or trumpet...).
Back to the main story, as other people described, .wav format only stores the amplitude of energy level in each sample, from which you can plot an amplitudes versus time graph first. This is also known as the time domain of a waveform. You can then use a mathematical principle called 'Fourier Transform' to convert the time domain into the frequency domain, which tells you the information about what frequencies are detected and relatively how often they are detected in this waveform.
For example, if you want to synthesize the sound of violin played in C, first you need to analyse the distribution of it's harmonic waves, then construct the frequency domain consisting of the main 260 Hz plus a series of sub-frequencies from it's harmonic waves. Use Inverse Fourier Transform to convert the frequency domain back to the time domain. Store the data in .wav format.
Related
I am trying to play multiple sound files at the same time using NAudio MixingSampleProvider's AddMixerInput method. I am using the AudioPlaybackEngine found at this page: https://gist.github.com/markheath/8783999
However, when I tried to play different files, I get an exception while playing some sounds: "All mixer inputs must have the same WaveFormat" It is probably because the sound files I have do not have the same sample rate (I see some files have 7046 as sample rate, instead of 44100).
Is there a way to modify the CachedSound class so that the sample rates are all converted to 44100 when reading the files?
Thanks!
You cant mix audio streams together without converting them to the same sample rate. In NAudio there are three ways of accessing a resampler:
WaveFormatConversionStream - this uses the ACM resampler component under the hood. Note that you may have issues resampling IEEE float with this one.
MediaFoundationResampler - this uses MediaFoundation. It allows you to adjust the resampling quality.
WdlResamplingSampleProvider - this is brand new in the latest NAudio, and offers fully managed resampling.
As well as matching sample rates, you also need matching channel counts. NAudio has various ways to turn mono into stereo and vice versa.
I'm very new to sound analysis in fact doing it for the first time all I need to do is to analyse an mp3 file or any other format and detect as pitch varies. simply I want to trim audio file where high notes occur.
sound wave
I've tried NAudio and few articles but of no avail so if someone guides me in right direction for some tutorial and what API to use.
The first you must know is how the pitch is related to waveform.
Notice: simple single-channel waveform is represented by simple byte array. It consists of RIFF header with some necessary parameters and the wave sequence itself. More complicated waveforms (multichannel, high bit rate) are represented in some other way (int instead of byte array, interference and so on).
SO: In order to manipulate audio pitch, you have to learn how waveform is made (step 2) and write (or google) a certain algorithm, which will operate with a waveform's pitch in a way mentioned in step 1.
If you are very new to audio programming, there is a great beginner tutorial: generating constant waveforms with C#.
You could use an FFT to get the spectrum of the recording. By checking the spectrum for specific frequencies you could decide which parts of the audio contain high pitches.
Some theory:
http://en.wikipedia.org/wiki/Fourier_transform
http://en.wikipedia.org/wiki/Spectrogram
Some resources:
How to perform the FFT to a wave-file using NAudio
https://naudio.codeplex.com/discussions/257586
http://naudio.codeplex.com/discussions/242989
Try a C# binding to Aubio
aubio is a tool designed for the extraction of annotations from audio signals. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio.
http://aubio.org/
I am trying to achieve the following:
Using Skype, call my mailbox (works)
Enter password and tell the mailbox that I want to record a new welcome message (works)
Now, my mailbox tells me to record the new welcome message after the beep
I want to wait for the beep and then play the new message (doesn't work)
How I tried to achieve the last point:
Create a spectrogram using FFT and sliding windows (works)
Create a "finger print" for the beep
Search for that fingerprint in the audio that comes from skype
The problem I am facing is the following:
The result of the FFTs on the audio from skype and the reference beep are not the same in a digital sense, i.e. they are similar, but not the same, although the beep was extracted from an audio file with a recording of the skype audio. The following picture shows the spectrogram of the beep from the Skype audio on the left side and the spectrogram of the reference beep on the right side. As you can see, they are very similar, but not the same...
uploaded a picture http://img27.imageshack.us/img27/6717/spectrogram.png
I don't know, how to continue from here. Should I average it, i.e. divide it into column and rows and compare the averages of those cells as described here? I am not sure this is the best way, because he already states, that it doesn't work very good with short audio samples, and the beep is less than a second in length...
Any hints on how to proceed?
You should determine the peak frequency and duration (possibly a minumum power over that duration for the frequency (RMS being the simplest measure)
This should be easy enough to measure. To make things even more clever (but probably completely unnecessary for this simple matching task), you could assert the non-existance of other peaks during the window of the beep.
Update
To compare a complete audio fragment, you'll want to use a Convolution algorithm. I suggest using a ready made library implementation instead of rolling your own.
The most common fast convolution algorithms use fast Fourier transform (FFT) algorithms via the circular convolution theorem. Specifically, the circular convolution of two finite-length sequences is found by taking an FFT of each sequence, multiplying pointwise, and then performing an inverse FFT. Convolutions of the type defined above are then efficiently implemented using that technique in conjunction with zero-extension and/or discarding portions of the output. Other fast convolution algorithms, such as the Schönhage–Strassen algorithm, use fast Fourier transforms in other rings.
Wikipedia lists http://freeverb3.sourceforge.net as an open source candidate
Edit Added link to API tutorial page: http://freeverb3.sourceforge.net/tutorial_lib.shtml
Additional resources:
http://en.wikipedia.org/wiki/Finite_impulse_response
http://dspguru.com/dsp/faqs/fir
Existing packages with relevant tools on debian:
[brutefir - a software convolution engine][3]
jconvolver - Convolution reverb Engine for JACK
libzita-convolver2 - C++ library implementing a real-time convolution matrix
teem-apps - Tools to process and visualize scientific data and images - command line tools
teem-doc - Tools to process and visualize scientific data and images - documentation
libteem1 - Tools to process and visualize scientific data and images - runtime
yorick-yeti - utility plugin for the Yorick language
First I'd smooth it a bit in frequency-direction so that small variations in frequency become less relevant. Then simply take each frequency and subtract the two amplitudes. Square the differences and add them up. Perhaps normalize the signals first so differences in total amplitude don't matter. And then compare the difference to a threshold.
Does anyone know how to convert an analog sound wave to a MIDI file?
I know that differs from WAV to MP3, but that's not important for now. I only want to learn the basic logic of the conversion.
I realize that you want to write your own wave to MIDI converter.
However, I had no idea that several companies have software that performs wave to MIDI conversion.
For the benefit of those interested, here's a list of audio to MIDI programs.
I picked the WidiSoft web site to check out because it was at the top of the list, and came up high in a Google search. English is not their first language. However, you can download and try the software before you buy it.
This isn't a product review, but depending on your conversion needs, you should be able to find something that already exists.
A wave is the actual sound wave of some sound.
A midi can be thought of as notes of music played on predefined instruments (stored on the computer or on the soundcard).
Therefore, the sound generated by midi is a subset of sound that can be stored in a wave file. This means that you can not convert wave to midi (although you can do it the other way round).
If you know certain things about the waves you wish to convert it might be possible. If, for example, you know that the wave contains only a paino, it might be possible to convert that into notes and from that to midi.
Use the ideas from Genetic Programming: Evolution of Mona Lisa project, but applied over sounds instead of images.
It seems to me that you should first break the audio into small clips (maybe .25 sec each). Then you could use FFT to get as many of the frequencies and associated amplitudes you require for each clip (maybe 8?). Then you could round the frequencies to the nearest note. Then I think the conversion of that to MIDI is pretty easy, at least at this basic level. So, you need to write C# software that samples the audio clip (done: see NAudio), and then obtain (I don't know where) software that does the FFT on the clip. Conversion of frequency to a note is a table lookup. You can find commercial software that does conversion too: https://www.celemony.com/en/melodyne/new-in-melodyne-5
I need some help with an algorithm. I'm using an artificial neural network to read an electrocardiogram and trying to recognize some disturbances in the waves. That's OK, and I have the neural network and I can test it no problem.
What I'd like to do is to give the function to the user to open an electrocardiogram (import a jpeg) and have the program find the waves and convert it in to the arrays that will feed my ANN, but there's the problem. I did some code that reads the image and transforms it into a binary image, but I can't find a nice way for the program to locate the waves, since the exact position can vary from hospital to hospital, I need some suggestions of approaches I should use.
If you've got the wave values in a list, you can use a Fourier transform or FFT (fast Fourier transform) to determine the frequency content at any particular time value. Disturbances typically create additional high-frequency content (ie, sharp, steep waves) that you should be able to use to spot irregularities.
You'd have to assume a certain minimal contrast between the "signal" (the waves) and the background of the image. An edge-finding algorithm might be useful in that case. You could isolate the wave from the background and plot the wave.
This post by Rick Barraza deals with vector fields in Silverlight. You might be able to adapt the concept to your particular problem.