slu papers

整理口语意图识别论文

JD AI 论文

Building Robust Spoken Language Understanding by Cross Attention between Phoneme Sequence and ASR Hypothesis

核心方法：Cross Attention Network for Few-shot Classification Ruibing

lattice

[6] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister, “Latticernn: Recurrent neural networks over lat- tices.” in Interspeech, 2016, pp. 695–699.
[7] C. Huang and Y. Chen, “Adapting pretrained transformer to lat- tices for spoken language understanding,” in 2019 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 845–852.
[8] C. Huang and Y. Chen, “Learning spoken language representa- tions with neural lattice language modeling,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3764–3769.

n-best

[9] F. Peng, S. Roy, B. Shahshahani, and F. Beaufays, “Search results based n-best hypothesis rescoring with maximum entropy classi- fication,” in 2013 IEEE Workshop on Automatic Speech Recogni- tion and Understanding. IEEE, 2013, pp. 422–427.
[10] A. J. Kumar, C. Morales, M.-E. Vidal, C. Schmidt, and S. Auer, “Use of knowledge graph in rescoring the n-best list in automatic speech recognition,” arXiv preprint arXiv:1705.08018, 2017.
[11] A. Ogawa, M. Delcroix, S. Karita, and T. Nakatani, “Rescoring n- best speech recognition list based on one-on-one hypothesis com- parison using encoder-classifier model,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6099–6103.
[12] A. Ogawa, M. Delcroix, S. Karita, and T. Nakatani, “Improved deep duel model for rescoring n-best speech recognition list using backward lstmlm and ensemble encoders,” in INTERSPEECH, 2019, pp. 3900–3904.
[13] M. Li, X. Liu, W. Ruan, L. Soldaini, W. Hamza, and C. Su, “Multi-task learning of spoken language understanding by inte- grating n-best hypotheses with hierarchical attention,” in Proceed- ings ofthe 28th International Conference on Computational Lin- guistics: Industry Track, 2020, pp. 113–123.
[14] M. Li, W. Ruan, X. Liu, L. Soldaini, W. Hamza, and C. Su, “Im- proving spoken language understanding by exploiting asr n-best hypotheses,” arXiv preprint arXiv:2001.05284, 2020.

word confusion network

[15] G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-T¨ur, “Improving spoken language understanding using word confusion networks,” in Seventh International Conference on Spoken Lan- guage Processing, 2002.
[16] D. Hakkani-T¨ur, F. B´echet, G. Riccardi, and G. Tur, “Beyond asr 1-best: Using word confusion networks in spoken language un- derstanding,” Computer Speech & Language, vol. 20, no. 4, pp. 495–514, 2006.
[17] M. Henderson, M. Gaˇsi´c, B. Thomson, P. Tsiakoulis, K. Yu, and S. Young, “Discriminative spoken language understanding using word confusion networks,” in 2012 IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2012, pp. 176–181.
[18] G. T¨ur, A. Deoras, and D. Hakkani-T¨ur, “Semantic parsing us- ing word confusion networks with conditional random fields.” in INTERSPEECH. Citeseer, 2013, pp. 2579–2583.
[19] P. G. Shivakumar and P. Georgiou, “Confusion2vec: Towards en- riching vector space word representations with representational ambiguities,” PeerJ Computer Science, vol. 5, p. e195, 2019.
[20] P. G. Shivakumar, M. Yang, and P. Georgiou, “Spoken lan- guage intent detection using confusion2vec,” arXiv preprint arXiv:1904.03576, 2019.

End2end SLU

[21] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audio to semantics: Approaches to end-to-end spoken language under- standing,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 720–726.
[22] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Ben- gio, “Towards end-to-end spoken language understanding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5754–5758.
[23] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Ben- gio, “Speech model pre-training for end-to-end spoken language understanding,” in Interspeech 2019, 2019.
[24] N. Tomashenko, A. Caubriere, and Y. Esteve, “Investigating adap- tation and transfer learning for end-to-end spoken language un- derstanding from speech,” in Interspeech 2019. ISCA, 2019, pp. 824–828

using phoneme data

[25] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech emotion recognition using spectrogram & phoneme em- bedding.” in Interspeech, 2018, pp. 3688–3692.
[26] A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko, “Using phoneme representations to build predictive models robust to asr errors,” in Proceedings ofthe 43rd International ACMSIGIR Con- ference on Research and Development in Information Retrieval, 2020, pp. 699–708.
[27] M. N. Sundararaman, A. Kumar, and J. Vepa, “Phoneme-bert: Joint language modelling of phoneme sequence and asr tran- script,” arXiv preprint arXiv:2102.00804, 2021.

独立项目端到端意图识别

厦大智能语音实验室

《TOWARDS END-TO-END SPOKEN LANGUAGE UNDERSTANDING》

Spoken language understanding system is traditionally designed as a pipeline of a number of components.

First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses.
Second, with the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications.

These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation.

This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features. Index Terms— Spoken language understanding, end-to-end training, recurrent neural networks

端到端意图识别，直接从音频特征 –>> 口语意图解除了两阶段独立优化的缺点；其次，人类并非一字一字的理解口语意图，端到端模拟人类处理方式

一般基于发音机制或人耳感知机制提取得到频谱空间的向量表示，即声学特征。

语音特征提取：预处理（预加重、分帧、加窗）——>> 傅里叶变换——>> 取功率谱

声学特征：

MFCC 梅尔频率倒谱系数 Mel-Frequencey Cepstral Coefficient
PLP 感知线性预测系数 perceptual Linear Predictive
F-bank 滤波器组 Filter-bank
Spectrogram 语谱图
CQCC 常数Q倒谱系数 constant-Q Cepstral Coefficient

![](2021418_frequent_audio_feature.png =300x200)

MFCC优势：

• 将人耳的听觉感知特性和语音的产生机制相结合。
• 前12个MFCC通常被用作特征向量(也就是移除F0的信息)，表示非常紧凑，因为这12个特征描述了一帧语音数据中的信息。
• 相对FBank特征有着更小的相关性，更容易建立高斯混合模型(GMM)。
• 可惜的是MFCC抵抗噪声的鲁棒性不强。

ASR

SpeechProcessForMachineLearning