How does speech recognition works?
- ricardocelma83
- Feb 18, 2019
- 4 min read
Updated: Feb 23, 2019
The first step in speech recognition is to feed sound waves into a computer. Analog waves can’t be directly feed in the computer. They have to be converted into numbers. To give an example let’s use this sound clip of somebody saying “Hello”:
Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave. Let’s zoom in on one tiny part of the sound wave and take a look:
To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced point.
This is called sampling. We are taking a reading thousands of times a second and recording a number representing the height of the sound wave at that point in time. That’s basically all an uncompressed.wav audio file is.
CD Quality audio is sampled at 44.1 kHz (44,100 readings per second).For speech recognition, a sampling rate of 16 kHz (16,000 readings per second) is enough to cover the frequency range of human speech.
Let’s sample our “Hello” sound wave 16,000 times per second. Here’s the first 100 samples:
Each number represents the amplitude of the sound wave at 1/16000th of a second intervals.
Can digital samples perfectly recreate the original analog sound wave? What about those gaps?
We know that we can use math to perfectly reconstruct the original sound wave from the spaced-out samples — as long as we sample at least twice as fast as the highest frequency we want to record.
Pre-processing our Sampled Sound Data
We now have an array of numbers with each number representing the sound wave’s amplitude at 1/16,000th of a second intervals.
We could feed these numbers right into a neural network. But trying to recognize speech patterns by processing these samples directly is difficult. Instead, we can make the problem easier by doing some pre-processing on the audio data.
Let’s start by grouping our sampled audio into 20-millisecond-long chunks. Here’s our first 20 milliseconds of audio (i.e., our first 320 samples):
Plotting those numbers as a simple line graph gives us a rough approximation of the original sound wave for that 20 millisecond period of time:
A complex mish-mash of different frequencies of sound. There’s some low sounds, some mid-range sounds, and even some high-pitched sounds sprinkled in. But taken all together, these different frequencies mix together to make up the complex sound of human speech.
To make this data easier for a neural network to process, we are going to break apart this complex sound wave into its component parts. We’ll break out the low-pitched parts, the next-lowest-pitched-parts, and so on. Then by adding up how much energy is in each of those frequency bands (from low to high), we create a “fingerprint” of sorts for this audio snippet.
We do this using a mathematic operation called a Fourier transform. It breaks apart the complex sound wave into the simple sound waves that make it up. Once we have those individual sound waves, we add up how much energy is contained in each one.
The end result is a score of how important each frequency range is, from low pitch (i.e. bass notes) to high pitch. Each number below represents how much energy was in each 50 Hz band of our 20 millisecond audio clip:
If we repeat this process on every 20 millisecond chunk of audio, we end up with a spectrogram (each column from left-to-right is one 20ms chunk):
A neural network can find patterns in this kind of data more easily than raw sound waves. So this is the data representation we’ll actually feed into our neural network.
Now that we have our audio in a format that’s easy to process, we will feed it into a deep neural network. The input to the neural network will be 20 millisecond audio chunks. For each little audio slice, it will try to figure out the letter that corresponds the sound currently being spoken.
We’ll use a recurrent neural network — that is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”. It’s much less likely that we will say something unpronounceable next like “XYZ”. So having that memory of previous predictions helps the neural network make more accurate predictions going forward.
After we run our entire audio clip through the neural network (one chunk at a time), we’ll end up with a mapping of each audio chunk to the letters most likely spoken during that chunk. Here’s what that mapping looks like for me saying “Hello”
:
Our neural net is predicting that one likely thing I said was “HHHEE_LL_LLLOOO”. But it also thinks that it was possible that I said “HHHUU_LL_LLLOOO” or even “AAAUU_LL_LLLOOO”.
That leaves us with three possible transcriptions — “Hello”, “Hullo” and “Aullo”
The trick is to combine these pronunciation-based predictions with likelihood scores based on large database of written text (books, news articles, etc.). You throw out transcriptions that seem the least likely to be real and keep the transcription that seems the most realistic.
Of our possible transcriptions “Hello”, “Hullo” and “Aullo”, obviously “Hello” will appear more frequently in a database of text (not to mention in our original audio-based training data) and thus is probably correct. So we’ll pick “Hello” as our final transcription instead of the others.





Comments