![]() VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. Using NTCD-TIMIT with its freely available visual features and 37 clean and noisy acoustic signals allows for this study to be a common benchmark, to which novel LVCSR AV-ASR models and approaches can be compared. The task of NTCD-TIMIT is phone recognition in continuous speech. All experiments have been applied to the recently released NTCD-TIMIT audio-visual corpus. A complete evaluation of these fusion models is conducted using a standard speaker-independent DNN-based LVCSR Kaldi recipe in three experimental setups: a clean-train-clean-test, a clean-train-noisy-test, and a matched-training setup. This paper reviews and compares the performance of five audio-visual fusion models: the feature fusion model, the decision fusion model, the multi-stream hidden Markov model (HMM), the coupled HMM, and the turbo decoders. ![]() ![]() Even less research work compares audio-visual fusion models for large vocabulary continuous speech recognition (LVCSR) models using deep neural networks (DNNs). However, very few studies can be found in the literature that compare different fusion models for AV-ASR. In the last few decades, many approaches for integrating the audio and video modalities were proposed to enhance the performance of automatic speech recognition in both clean and noisy conditions. Audio-visual fusion is one of the most challenging tasks that continues to attract substantial research interest in the field of audio-visual automatic speech recognition (AV-ASR).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |