Speech recognition, by both humans and machines,
benefits from visual observation of the face, especially
at low signal-to-noise ratios (SNRs). It has often
been noticed, however, that the audible and visible
correlates of a phoneme may be asynchronous;
perhaps for this reason, automatic speech recognition
structures that allow asynchrony between the
audible phoneme and the visible viseme outperform
recognizers that allow no such asynchrony.
This paper proposes, and tests using experimental
speech recognition systems, a new explanation
for audio-visual asynchrony. Specifically, we propose
that audio-visual asynchrony may be the result
of asynchrony between the gestures implemented
by different articulators, such that the most visibly
salient articulator (e.g., the lips) and the most audibly
salient articulator (e.g., the glottis) may, at
any given time, be dominated by gestures associated
with different phonemes. The proposed model of
audio-visual asynchrony is tested by implementing
an “articulatory-feature model” audiovisual speech
recognizer: a system with multiple hidden state variables,
each representing the gestures of one articulator.
The proposed system performs as well as a
standard audiovisual recognizer on a digit recognition
task; the best results are achieved by combining
the outputs of the two systems.