Speech recognition technology
In terms of technology, most of the technical text books nowadays emphasize the
use of Hidden Markov Model as the underlying technology. The dynamic programming
approach, the neural network-based approach and the knowledge-based learning
approach have been studied intensively in the 1980s and 1990s.
Performance of speech recognition systems
The performance of a speech recognition systems is usually specified in terms
of accuracy and speed. Accuracy is measured with the word error rate, whereas
speed is measured with the real time factor.
Most speech recognition users would tend to agree that dictation machines can
achieve very high performance in controlled conditions. Part of the confusion mainly
comes from the mixed usage of the term speech recognition and dictation.
Speaker-dependent dictation systems requiring a short period of training can
capture continuous speech with a large vocabulary at normal pace with a very high
accuracy. Most commercial companies claim that recognition software can achieve
between 98% to 99% accuracy (getting one to two words out of one hundred wrong)
if operated under optimal conditions. These optimal conditions usually means the
test subjects have 1) matching speaker characteristics with the training data, 2)
proper speaker adaptation, and 3) clean environment (e.g. office space). (This explains
why some users, especially accented, might actually find that the recognition rate
could be perceptually much lower than the expected 98% to 99%).
Other, limited vocabulary, systems requiring no training can recognize a small
number of words (for instance, the ten digits) from most speakers. Such systems
are popular for routing incoming phone calls to their destinations in large organizations.
Noisy channel formulation of statistical speech recognition
Many modern approaches such as HMM-based and ANN-based speech recognition
are based on noisy channel formulation (See also Alternative formulation of speech
recognition). In that view, the task of a speech recognition system is to search
for the most likely word sequence given the acoustic signal. In other words, the
system is searching for the most likely word sequence among all possible word sequences
W * from the acoustic signal A (what some will call the
observation sequence according to the Hidden Markov Model terminology).
Based on Bayes` rule, the above formulation could be rewritten as
Because the acoustic signal is common regardless of which word sequence chosen,
the above could be usually simplified to
The term is generally called acoustic model. The term is generally known
as language model.
Both acoustic modeling and language modeling are important studies in
modern statistical speech recognition. In this entry, we will focus on explaining
the use of hidden Markov model (HMM) because notably it is very widely used in many
systems. ( Language modeling has many other applications such as smart keyboard
and document classification; please refer to the corresponding entries.)
Approaches of statistical speech recognition
Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on hidden
Markov models (HMMs). This is a statistical model which outputs a sequence of
symbols or quantities.
One possible reason why HMMs are used in speech recognition is that a speech
signal could be viewed as a piece-wise stationary signal or a short-time stationary
signal. That is, one could assume in a short-time in the range of 10 milliseconds,
speech could be approximated as a stationary process. Speech could thus be thought
as a Markov model for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically
and are simple and computationally feasible to use. In speech recognition, to give
the very simplest setup possible, the hidden Markov model would output a sequence
of n-dimensional real-valued vectors with n around, say, 13, outputting one of these
every 10 milliseconds. The vectors, again in the very simplest case, would consist
of cepstral coefficients, which are obtained by taking a Fourier transform
of a short-time window of speech and decorrelating the spectrum using a cosine transform,
then taking the first (most significant) coefficients. The hidden Markov model will
tend to have, in each state, a statistical distribution called a mixture of diagonal
covariance Gaussians which will give a likelihood for each observed vector. Each
word, or (for more general speech recognition systems), each phoneme, will have
a different output distribution; a hidden Markov model for a sequence of words or
phonemes is mad e by concatenating the individual trained hidden Markov models for
the separate words and phonemes.
This article or section may require cleanup to meet Wikipedia`s quality
standards.
Please help improve this article or replace this tag with a more specific
message. This article has been tagged since December 2005. ( help, talk)
The above is a very brief introduction to some of the more central aspects of
speech recognition. Modern speech recognition systems use a host of standard techniques
which it would be too time consuming to properly explain, but just to give a flavor,
a typical large-vocabulary continuous system would probably have the following parts.
It would need context dependency for the phones (so phones with different left and
right context have different realizations); to handle unseen contexts it would need
tree clustering of the contexts; it would of course use cepstral normalization to
normalize for different recording conditions and depending on the length of time
that the system had to adapt on different speakers and conditions it might use cepstral
mean and variance normalization for channel differences, vocal tract length normalization
(VTLN) for male-female normalization and maximum likelihood linear regression (MLLR)
for more general speaker adaptation. The features would have delta and delta-delta
coeffici ents to capture speech dynamics and in addition might use heteroscedastic
linear discriminant analysis (HLDA); or might skip the delta and delta-delta
coefficients and use LDA followed perhaps by heteroscedastic linear discriminant
analysis or a global semitied covariance transform (also known as maximum likelihood
linear transform (MLLT)). A serious company with a large amount of training data
would probably want to consider discriminative training techniques like maximum
mutual information (MMI), MPE, or (for short utterances) MCE, and if a large amount
of speaker-specific enrollment data was available a more wholesale speaker adaptation
could be done using MAP or, at least, tree-based maximum likelihood linear regression.
Decoding of the speech (the term for what happens when the system is presented with
a new utterance and must compute the most likely source sentence) would probably
use the Viterbi algorithm to find the best path, but there is a choice between dynamically
creating combination hidden Markov models which includes both the acoustic and language
model information, or combining it statically beforehand (the AT&T approach,
for which their FSM toolkit might be useful). Those who value their sanity might
consider the AT&T approach, but be warned that it is memory hungry.
Neural network-based speech recognition
This article or section may require cleanup to meet Wikipedia`s quality
standards.
Please help improve this article or replace this tag with a more specific
message. This article has been tagged since December 2005. ( help, talk)
Another approach in acoustic modeling is the use of neural networks.
They are capable of solving much more complicated recognition tasks, but do not
scale as well as HMMs when it comes to large vocabularies. Rather than being used
in general-purpose speech recognition applications they can handle low quality,
noisy data and speaker independence. Such systems can achieve greater accuracy than
HMM based systems, as long as there is training data and the vocabulary is limited.
A more general approach using neural networks is phoneme recognition. This is an
active field of research, but generally the results are better than for HMMs. There
are also NN-HMM hybrid systems that use the neural network part for phoneme recognition
and the hidden markov model part for language modeling.
Dynamic time warping (DTW)-based speech recognition
- Main article: Dynamic time warping
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns
would be detected, even if in one video the person was walking slowly and if in
another they were walking more quickly, or even if there were accelerations and
decelerations during the course of one observation. DTW has been applied to video,
audio, and graphics -- indeed, any data which can be turned into a linear representation
can be analysized with DTW.
A well known application has been automatic speech recognition, to cope with
different speaking speeds. In general, it is a method that allows a computer to
find an optimal match between two given sequences (e.g. time series) with certain
restrictions, i.e. the sequences are "warped" non-linearly to match each other.
This sequence alignment method is often used in the context of hidden Markov models.
Knowledge-based speech recognition
This method uses a stored data base of commands that compares simple words with
ones in the data base.
For further information
Popular speech recognition conferences held each year or two include ICASSP,
Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field
of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning
to include papers on speech processing. Important journals include the IEEE
Transactions on Speech and Audio Processing (now named IEEE Transactions on
Audio, Speech and Language Processing), Computer Speech and Language, and Speech
Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner
can be useful to acquire basic knowledge but may not be fully up to date (1993).
Another good source can be "Statistical Methods for Speech Recognition" by Frederick
Jelinek which is a more up to date book (1998). Keep an eye on government sponsored
competitions such as those organised by DARPA (the telephone speech evaluation
was most recently known as Rich Transcription). In terms of freely available resources,
the HTK book ( and the accompanying HTK toolkit) is one place to start to both learn
about speech recognition and to start experimenting (if you are very brave). You
could also search for Carnegie Mellon University`s SPHINX toolkit.
Applications of speech recognition
- Command recognition - Voice user interface with the computer
- Dictation
- Interactive Voice Response
- Automotive speech recognition
- Medical Transcription
- Pronunciation Teaching in computer-aided language learning applications
- Automatic Translation
See also
- Guided Speech IVR
- Speech processing
- Audio visual speech recognition
- Speech verification
- Speaker identification
- Speech synthesis
- Speech Analytics
- Keyword spotting
- VoiceXML
- Macfarlane`s Law - the conflict between typing and reading speed anticipated
the importance of speech recognition
References
- "Survey of the State of the Art in Human Language Technology (1997) by Ron
Cole et all"
Books
- Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff,
April 2006--Researchers and developers in industry and academia with different
backgrounds but a common interest in multilingual speech processing will find
an excellent overview of research problems and solutions detailed from theoretical
and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics
/ CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH
5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7:
Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH
9: Other Challenges /
External links
- NIST Speech Group
- Sphinx Open Source Speech Recognition Engine
- Entropic/Cambridge Hidden Markov Model Toolkit
- Julius Open Source Speech Recognition Engine
- The SPRACHcore software package
- Open CV library, especially the multi-stream speech and vision combination
programs
- Xvoice: Speech control of X applications
- LT-World: Portal to information and resources on the internet
- LDC The Linguistic Data Consortium
- Evaluations and Language resources Distribution Agency
- OLAC Open Language Archives Community
- BAS Bavarian Archive for Speech Signals
- VoxForge - Free GPL Speech Corpus and Acoustic Model repository
My life with furniture - by Cathy Goodwin, Ph.D.
This article may be reprinted or reposted in its entirety if you also include
my resource box.
Every so often I think of writing a Back to School article. However, I now live
in a warm climate, The weather feels like a lazy summer school, not a serious winter
term. No need to lay in a supply of sweaters and sweatshirts.
But the real reason is that, increasingly, the lines are blurred between school
and Real Life. These days, student life often means spending a cozy evening with
your computer, e-mailing your classmates and posting your assignments to a website.
You might be catching a class on weekends, evenings or two-week learning modules.
Even traditional campus life is designed for grown-ups. Two years ago, the New
York Times Magazine carried a story about life in the New Dorms. Apparently some
upscale schools are decorating the dorms to look like yuppie condominiums, complete
with carpeting and what the Times calls "adult-sized refrigerators."
Meanwhile, a lot of grown-ups who are old enough to remember typing their term
papers are still living like students. Books, magazines and loose stacks of paper
are strewn everywhere.
Books call for bookshelves. A Real Student secretly misses the bricks and boards,
although today the bricks and boards cost more than particle board shelves and are
impossible to move.
When I lived in Alaska, I realized there was no point in buying Real Furniture.
You could equip a ten-room house for the cost of shipping the contents of a studio
apartment to the Lower 48. I ended up buying a couch from a graduating student and
added an extra futon to the Bedroom Set. When I moved to my next job, I fully intended
to do the same until a colleague asked me, "Isn?t there a time in your life when
you stop buying used couches from students?"
A friend told me she had a similar experience when she visited a Real Furniture
Store, seeking bookshelves. The salesperson showed her a nice unit for $450. Seeing
that my friend was about to pass out, the salesperson explained, "This is a piece
of furniture that you will be proud to display in your home."
My friend left the store in a daze. Somehow, she explained later, she had never
thought of bookshelves as furniture.
I?d like to think we?re all grown up now, but it?s hard. For one thing, many
professions encourage us to live like a student with five term papers due at the
end of the term and no graduation in sight. If you?re writing a book, teaching a
seminar, preparing for a court case, coaching a sports team or putting together
a sales presentation, there?s always something more you could be doing, twenty-four
hours a day, seven days a week. People who have the souls of Real Students seem
attracted to those jobs.
Still, I see progress. A friend called to say he bought a house because he was
tired of living like a student and was ready to grow up. He was forty-five at the
time.
I myself have acquired some Real Furniture, including the Beautiful New Couch
I bought eight years ago, although I still insist that sleeping on a few layers
of futons is healthier than a conventional bedroom set. Thanks to my lawn service
person, who is a student, I have a real, grown-up yard. Recently, while walking
the dog, I met a young student who had transformed her rental cottage into a home
worthy of House Beautiful. I suggested she moonlight as a decorator to help those
who have graduated and finally decided to become adults.
We will never succeed completely. My friend with the house just called to say
that his two cats have shredded most of the trappings of his adult life. I understand
perfectly. My Beautiful New Couch has served as a place for me, my house-sitters
and my guests to take naps, and the cats have carried out extensive performance
tests on each cushion.
The moving companies see a couch as a challenge to their insurance guidelines.
I haven?t been a student but the couch has gone through a reverse graduation: it
looks far more exhausted than its predecessor
--
the couch I bought, ten years ago, from a student.
Cathy Goodwin, Ph.D. author, coach,
speakerHelps mid-career professionals move to career freedom
Nine Magic Keys to Career Freedom
http://www.movinglady.com/freedombook.html
Career Freedom Ezine mailto:subscribe@movinglady.com
emai: cathy@movinglady.com
Typing articles index
|