[PLUG] Current state of Linux voice recognition

John Jason Jordan johnxj at gmx.com
Wed Jun 28 23:24:09 UTC 2017


On Wed, 28 Jun 2017 10:48:47 -0500
Richard Owlett <rowlett at cloud85.net> dijo:

>On 06/28/2017 09:54 AM, Larry Brigman wrote:
>> Human voice frequency range tops out at 8khz.  Normal speech is
>> around 2-3khz.  

><chuckle> That's the theory that's been around "forever".
>I'm in possession of a factoid that prompts me to do some research 
>needing high resolution at high sample rates.

There is a lot to say about acoustic phonetics, and I'm not sure where
to start.

It is correct that most human speech tops out at about 3 KHz, but the
lower limit mentioned above is too high. In fact, an adult male with a
large head and vocal tract can produce speech sounds as low as 60 Hz
(e.g., a basso profundo). Most males start at about 85 Hz. For
vowels, sounds above 3 KHz are actually just harmonics, in decreasing
volume the higher the harmonic. The harmonics normally contribute
little to perception of vowels. Some consonants, however, use much
higher frequencies, notably the stridents, of which English has an
embarras de richesse.

Human languages all have a unique phonetic inventory, that is, all the
individual sounds of the language. English has 41 - 43 phonemes
(depending on your dialect), and many of them have two or more
allophones. Of these 10 - 11 (again, depending on dialect) are vowels,
plus there are a handful of diphthongs. Each sound has a specific set
of frequencies, plus there are other issues that hearers pick up, e.g.,
length of the sound.

Now I address the issue of frequency, starting with the vowels. When
you utter a vowel you actually produce three frequencies (called
formants) simultaneously. The lower two are the critical ones, and the
upper one could be considered as a kind of checksum. The formants for
the vowel [i] (as in 'beet') average around 280, 2250, and 2900 Hz,
whereas for the vowel [ɪ] (as in 'bit') the formants are around 400,
1900 and 2550 Hz. 

Now here is the crucial point: It is the distance between the two
lower formants that makes our brains think 'oh I just heard an [i],' or
I just heard an [ɪ]. Why is this important? Because every human has a
different 'fundamental frequency,' determined mostly by the size of the
vocal tract. Just as your 6th grade science teacher demonstrated by
pinging the sides of glasses filled with different levels of water, the
larger the volume of air the lower the frequency that will be produced.
Men tend to have larger vocal tracts than women, so males tend to have a
lower fundamental frequency than women. If our perception of vowels was
determined just by the absolute frequencies we wouldn't be able to
understand anything. But the system works because the distance between
the lower two formants is identical whether the speaker has a high or a
low fundamental frequency. The numbers I gave above for [i] are
actually an average; for a man they might be 120,  2090, 2730, whereas
for a woman they might be 380, 2350, and 2990. Note for  both speakers
the difference between the lower two formants is still 1970 Hz for [i]. 

'Speaker normalization' is a term used by phoneticians to describe an
amazing feature of the human brain - the instantaneous unconscious
ability to perceive the fundamental frequency of a speaker the moment
they open their mouth and utter the first couple of sounds, even if you
have never heard the speaker before. 

Now I turn to consonants. Consonants also have formants, but the upper
formants are the most important, and they can be much higher, even
higher than 3 KHz. For example, the upper formant for [s] (as in 'hiss')
ranges from around 4900 to 6000 Hz, depending on the speaker's
fundamental frequency and the vowel(s) that precede or follow it - which
leads me to problems with telephony.

A long time ago when telephone systems were first being developed the
telephone companies decided, for purely economic reasons, to limit the
bandwidth that their equipment could perceive and reproduce to 300 -
3400 Hz. (Those figures are present-day standards; in the beginning
they weren't even that generous.) Equipment that could do a wider range
would have increased expense massively. Unfortunately, this produces
the famous expressions 's as in Sam' or 'f as in Frank,' because the
equipment doesn't go high enough to reproduce the upper formant of [s],
making it impossible to distinguish it from [f] on a telephone.

Having said all of that, there is a lot more to human speech
recognition than having equipment capable of adequate bandwidth. Our
human brains juggle so much input so rapidly that we have to use
shortcuts. Let me give you just one example: If you hear an article (a,
an, the) your brain knows that it always introduces a noun phrase so the
next word absolutely must be a noun, a nominal modifier, or an
intensifier. If you speak a language then every word in your lexicon is
flagged as to which categories it can be used for. This means that as
you try to decipher the next word that you are hearing you can discard a
vast amount of your lexicon as impossible. 



More information about the PLUG mailing list