Results (
Thai) 1:
[Copy]Copied!
Audiovisual methods of automatic speech recognition (ASR)have been widely studied as they o er improved robustnessand accuracy, especially in the presence of noise. Traditionalaudio-based ASR systems perform reasonably well in controlledlab environments. In many environments, however,such as oces or outdoors, the recognition performance decreasesdrastically due to background noise. One way to increaserobustness with respect to acoustic signal distortion isto consider the visual speech modality jointly with the auditorymodality. Previous studies have shown that in both ASRand human speech perception, the audio and visual sensorymodalities have di erent strengths and weaknesses, and infact to a large extent they complement each other [7]. Visiblespeech is usually most informative for just those distinctionsthat are most ambiguous auditorily. For example, perceivingplace of articulation, such as the di erence between /b/ and/d/, is dicult via sound but relatively easy via sight. Onthe other hand, voicing information which is dicult to seevisually is relatively easy to resolve via sound. Thus, visiblespeech is to a large degree not redundant with auditoryspeech.The primary motivation for using visual information is to improvethe robustness of the system with respect to environ- mental variations. Thus, a ma jor goal is that an audiovisualsystem should perform at least as well as its audio subsystemdoes, over the entire range of conditions which might
be encountered. This requirement implies that in situations
where the audio subsystem performs accurately, the role of the
visual information should be very limited, and as the audio
subsystem loses accuracy, the role of the visual information
should increase.
Since a system cannot know" whether or not it is performing
accurately, some measure of condence must accompany the
classication. A natural measure of condence is the ratio of
the highest score (probability estimate) to the nearest competing
score. This condence measure is easy to exploit such
that when condence for either subsystem (audio or visual)
is high, then the decision of that subsystem carries a lot of weight, while if it is low, the other subsystem will have a
substantial eect. Note that, in a phoneme-based HMM, this
condence and the associated decisions may be connected
with individual states or time steps.
1.2. Integration of Audio and Visual Information
Several methods for integration of audio and visual sources
have been proposed (e.g. [5, 4, 6, 9, 10, 11]). RobertRibes
[4] has proposed a classication scheme for integra-
tion strategies. Two broad classes of strategy are early"
and late" integration models. Early integration refers to
strategies which combine evidence from dierent modalities
prior to making any decisions, whereas late integration
strategies perform some sort of independent single-modality
scoring before combining evidence. Although there remains much to be discovered concerning this process, the evidence
suggests that early integration strategies are the most successful
Being translated, please wait..