life | 山止川行

歌

旅游

舞蹈

基于规则模板的意图识别方法一般需要人为构建规则模板以及类别信息对用户意图文本进行分类[13]。
Ramanand 等人[14]针对消费意图识别，提出基于规则和图的方法来获取意图模板，在单一领域取得了较好的分类效果。.
Li等人[15]研究发现在同一领域下，不同的表达方式会导致规则模板数量的增加，需要耗费大量的人力物力。所以，基于规则模板匹配的方法虽然不需要大量
的训练数据，就可以保证识别的准确性，但是却无法解决意图文本更换类别时带来重新构造模板的高成本问题。

基于统计特征分类的方法，则需要对语料文本进行关键特征的提取，如字、词特征、N-Gram等，然后通过训练分类器实现意图分类。常用的方法有朴素贝叶斯（Naive Bayes，NB）[16]、Adaboost[17]、支持向量机（Support Vector Machine，SVM）[18]和逻辑回归[19]等。

陈浩辰[20]分别使用 SVM 和 Naive Bayes 分类器对微博语料进行消费意图分类，F1值都达到 70%以上，但这两种分类器都需要人工提取特征，不仅成本高，而且特征的准确性无法得到保障，同时还会导致数据稀疏问
题。由于SVM对多类别数据信息的分类效果不好而且泛化性能较差，贾俊华[21]通过引入AdaBoost算法和PSO算法，利用 PSO 优化 SVM 参数，并且用 AdaBoost 算法
集成 PSOSVM 分类器，得到一种 AdaBoost-PSOSVM 强分类器，在相同数据集上分类性能明显高于 SVM 分类
器。但这些方法都不能准确理解用户文本的深层次语义信息。

Aiming at the problem of low accuracy of downstream intent recognition caused by inaccurate speech recognition text, this paper combines the output of N-best text from the speech recognition module, acoustic model score, and language model score to propose a discrete bucketing method and a weighted sentence vector method to improve the robustness of spoken intention recognition

The speech recognition module will produce N recognized texts for one input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of this information, the discrete bucketing method is based on the acoustic model score and the language model score. The corresponding N-best text is divided into different intervals for splicing experiments, and the N-best text information and the corresponding sentence score information are integrated in a coarse-grained manner.

Generally, the speech recognition module will produce N recognized texts for a piece of input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of these information, the discrete bucket method divides the corresponding n-best text into different sections according to the acoustic model scores and language model scores for splicing experiments, and coarse-grained fusion of n-best text information and corresponding sentence score information.

Generally, the speech recognition module produces N recognized texts for a piece of input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of this information, the discrete bucketing method is based on the acoustic model score and the language model score divides the corresponding N-best text into different intervals for splicing experiments, and combines the N-best text information and the corresponding sentence score information in a coarse-grained manner.

According to the acoustic model score and language model score, the discrete bucket method divides the corresponding n-best text into different sections for splicing experiment, and coarse-grained fusion of n-best text information and corresponding sentence score information is carried out.

Existing pre-training language models usually use standard written text for pre-training, and perform poorly on noisy speech recognition text. This paper proposes a method of combining speech recognition text and corresponding phoneme information for pre-training, which enhances pre-training. The characterization ability of word vectors in the language model at the pronunciation level enhances the robustness of oral intention recognition.

Existing pre-training language models usually use standard written text for pre-training, and perform poorly on noisy speech recognition text. This paper proposes a method of combining speech recognition text and corresponding phoneme information for pre-training, which enhances the representation ability of word vectors in pre-training language model at the pronunciation level and improves the robustness of oral intention recognition.

研究背景与意义

随着深度学习技术的飞速发展，自然语言处理技术也得到了极大的提升，部分任务式的自然语言处理项目的智能水平已经接近人类的表现。为了更自然的实现机器与人类的交互，智能口语对话系统的研究自开始以来从未间断。目前越来越多的智能对话系统已经融入到人类的生活中，如苹果手机的Siri助手、亚马逊的智能音箱echo以及微软亚洲研究院的智能聊天机器人小冰等。智能口语对话系统的组成主要包括语音识别（Automatic Speech Recognition, ASR）、口语理解（Spoken Language Understanding, SLU）、对话管理（Dialog Management, DM）、对话生成（Dialogue Generation, DG）和语音合成（Text to Speech, TTS）这五部分组成。

[1] Chen Hongshen，Liu Xiaorui，Yin Dawei，et al.A survey on dialogue systems：recent advances and new frontiers[J]. SIGKDD Explorations，2017，19（2）：25-35.

口语理解模块在智能对话系统中起到了承上启下的作用，该模块通过接收语音识别模块识别的文本进行领域识别、意图识别、槽位填充等任务，应用处理得到的结果进一步指导对话系统的下一步动作。在智能语音对话系统的口语理解模块中，领域识别和槽位填充可以根据任务的难易程度进行选择性的配置，但意图识别是不可或缺的部分。意图就是说话者在与对话系统的一次交互中想要表达的意愿，即用户想要做什么，一般通过提取语音识别文本中的动词和名词进行意图命名，如推荐景点、查询话费等。意图识别任务本质上是文本分类任务，根据人工预先设定的意图类别，使用标注好的数据训练意图识别模型用于预测用户所表达的意图。

覃杰在京东实习的期间，参与了物流外呼意图识别的鲁邦性实验与相关前沿论文的调研和复现工作。

针对目前口语意图识别准确率较低，鲁棒性较差的问题，本文通过结合语音识别输出的其他信息

Aiming at the problem of inaccurate speech recognition text leading to lower prediction accuracy and poor robustness of downstream downstream recognition models, deep learning algorithms are used here, combined with other useful information output by speech recognition, such as N-best text, phonemes, sentence acoustic models The main work here is as follows:

Aiming at the problem of low accuracy and poor robustness of downstream intent recognition models caused by inaccurate speech recognition text, this paper uses deep learning algorithms combined with other useful information output by speech recognition, such as N-best text, phonemes, and sentence acoustic models. Information such as scores and language model scores are used to carry out oral robustness intent recognition research. The main work of this paper is as follows:

Aiming at the problem of low accuracy and poor robustness of downstream intent recognition models caused by inaccurate speech recognition text, this paper uses deep learning algorithms combined with other useful information output by speech recognition such as N-best text, phonemes, sentence scores, etc. Information to carry out robust oral intention recognition research. The main work of this paper is as follows:

Aiming at the problem of inaccurate speech recognition text leading to low prediction accuracy and poor robustness of downstream intent recognition models, this paper uses deep learning algorithms combined with other useful information output by speech recognition such as N-best text, phonemes, sentence score information Carry out robust oral intention recognition research. The main work of this paper is as follows:

Aiming at the problem of low prediction accuracy and poor robustness of downstream intention recognition model caused by inaccurate speech recognition text, this paper adopts deep learning algorithm and combines other useful information output by speech recognition, such as N-best text, phonemes and sentence score information, to conduct robust speech intention recognition research.The main work of this paper is as follows:

研究口语对话系统中意图识别准确度提升的相关论文。
查阅有关意图识别的开源数据资料
了解目前语音识别模型输出的识别信息，包括音素、N-best信息、声学模型得分、语言模型得分等。