山止川行

definition

Few-shot classification aims to recognize unlabeled samples from unseen classes given only few labeled samples

小样本学习（few shot learning）是元学习（meta learning）的一个子类，目标是learn to learn，跟监督学习的思路很不一样：

【在大样本中学习】在领域1中使用大样本中学习得到距离函数，
【在小样本中预测】在领域2中使用该距离函数找出最接近的样本，其类别就是预测结果。

few-shot learning名字有歧义，其实并没有在小样本中学习，叫few-shot prediction会更贴切。

领域1中的样本叫training set。

领域2中的样本集叫support set，新预测的实例叫query sample

三者关系如下图：

support set有两个属性：k-way和n-shot

k-way: support set中的类别数，上图为6，

n-shot: support set中每一类的样本数，上图为1

进一步，k和n对预测准确率的影响（如下图）：

refer

Cross Attention Network for Few-shot Classification

重整

重新安装了博客，捣鼓了半天，一切就绪。

需要好好做笔记，好好写博客，这样才能走的更远。

毕业的事和工作的事以及爱情的事已经走上正轨了。

仿佛一生的奋斗到了与喜悦相伴的时刻了。

还是那句话：好好吃饭，好好学习，好好娱乐，好好休息。

临，谢谢你伴我度过了最艰难的岁月，往后的日子，我会朝着你的光亮，一直前行。

JD

感谢JD，虽然我现在实力可能不够，但是接下来的日子，我会证明自己。

情愫

清空自己，净化自己，你本就是个正直阳光的人儿呐。

NLP

技术是你安身立命之根本。

规划

环环相扣，一环缺失，环环尽落。

晚上10点50洗漱完毕上床休息。
8点起床洗漱&运动
9点准时上班工作
午餐后回去好好放松眼睛和午休
晚餐完了散散步好好放松
晚上如果不忙总结一天所学和扩展学习
22:00 听歌、运动、相依

life

Veröffentlicht am 2021-04-04 Bearbeitet am 2021-04-10

歌

旅游

舞蹈

基于规则模板的意图识别方法一般需要人为构建规则模板以及类别信息对用户意图文本进行分类[13]。
Ramanand 等人[14]针对消费意图识别，提出基于规则和图的方法来获取意图模板，在单一领域取得了较好的分类效果。.
Li等人[15]研究发现在同一领域下，不同的表达方式会导致规则模板数量的增加，需要耗费大量的人力物力。所以，基于规则模板匹配的方法虽然不需要大量
的训练数据，就可以保证识别的准确性，但是却无法解决意图文本更换类别时带来重新构造模板的高成本问题。

基于统计特征分类的方法，则需要对语料文本进行关键特征的提取，如字、词特征、N-Gram等，然后通过训练分类器实现意图分类。常用的方法有朴素贝叶斯（Naive Bayes，NB）[16]、Adaboost[17]、支持向量机（Support Vector Machine，SVM）[18]和逻辑回归[19]等。

陈浩辰[20]分别使用 SVM 和 Naive Bayes 分类器对微博语料进行消费意图分类，F1值都达到 70%以上，但这两种分类器都需要人工提取特征，不仅成本高，而且特征的准确性无法得到保障，同时还会导致数据稀疏问
题。由于SVM对多类别数据信息的分类效果不好而且泛化性能较差，贾俊华[21]通过引入AdaBoost算法和PSO算法，利用 PSO 优化 SVM 参数，并且用 AdaBoost 算法
集成 PSOSVM 分类器，得到一种 AdaBoost-PSOSVM 强分类器，在相同数据集上分类性能明显高于 SVM 分类
器。但这些方法都不能准确理解用户文本的深层次语义信息。

Aiming at the problem of low accuracy of downstream intent recognition caused by inaccurate speech recognition text, this paper combines the output of N-best text from the speech recognition module, acoustic model score, and language model score to propose a discrete bucketing method and a weighted sentence vector method to improve the robustness of spoken intention recognition

The speech recognition module will produce N recognized texts for one input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of this information, the discrete bucketing method is based on the acoustic model score and the language model score. The corresponding N-best text is divided into different intervals for splicing experiments, and the N-best text information and the corresponding sentence score information are integrated in a coarse-grained manner.

Generally, the speech recognition module will produce N recognized texts for a piece of input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of these information, the discrete bucket method divides the corresponding n-best text into different sections according to the acoustic model scores and language model scores for splicing experiments, and coarse-grained fusion of n-best text information and corresponding sentence score information.

Generally, the speech recognition module produces N recognized texts for a piece of input audio, and each recognized text corresponds to an acoustic model score and a language model score. In order to make better use of this information, the discrete bucketing method is based on the acoustic model score and the language model score divides the corresponding N-best text into different intervals for splicing experiments, and combines the N-best text information and the corresponding sentence score information in a coarse-grained manner.

According to the acoustic model score and language model score, the discrete bucket method divides the corresponding n-best text into different sections for splicing experiment, and coarse-grained fusion of n-best text information and corresponding sentence score information is carried out.

Existing pre-training language models usually use standard written text for pre-training, and perform poorly on noisy speech recognition text. This paper proposes a method of combining speech recognition text and corresponding phoneme information for pre-training, which enhances pre-training. The characterization ability of word vectors in the language model at the pronunciation level enhances the robustness of oral intention recognition.

Existing pre-training language models usually use standard written text for pre-training, and perform poorly on noisy speech recognition text. This paper proposes a method of combining speech recognition text and corresponding phoneme information for pre-training, which enhances the representation ability of word vectors in pre-training language model at the pronunciation level and improves the robustness of oral intention recognition.

研究背景与意义

随着深度学习技术的飞速发展，自然语言处理技术也得到了极大的提升，部分任务式的自然语言处理项目的智能水平已经接近人类的表现。为了更自然的实现机器与人类的交互，智能口语对话系统的研究自开始以来从未间断。目前越来越多的智能对话系统已经融入到人类的生活中，如苹果手机的Siri助手、亚马逊的智能音箱echo以及微软亚洲研究院的智能聊天机器人小冰等。智能口语对话系统的组成主要包括语音识别（Automatic Speech Recognition, ASR）、口语理解（Spoken Language Understanding, SLU）、对话管理（Dialog Management, DM）、对话生成（Dialogue Generation, DG）和语音合成（Text to Speech, TTS）这五部分组成。

[1] Chen Hongshen，Liu Xiaorui，Yin Dawei，et al.A survey on dialogue systems：recent advances and new frontiers[J]. SIGKDD Explorations，2017，19（2）：25-35.

口语理解模块在智能对话系统中起到了承上启下的作用，该模块通过接收语音识别模块识别的文本进行领域识别、意图识别、槽位填充等任务，应用处理得到的结果进一步指导对话系统的下一步动作。在智能语音对话系统的口语理解模块中，领域识别和槽位填充可以根据任务的难易程度进行选择性的配置，但意图识别是不可或缺的部分。意图就是说话者在与对话系统的一次交互中想要表达的意愿，即用户想要做什么，一般通过提取语音识别文本中的动词和名词进行意图命名，如推荐景点、查询话费等。意图识别任务本质上是文本分类任务，根据人工预先设定的意图类别，使用标注好的数据训练意图识别模型用于预测用户所表达的意图。

覃杰在京东实习的期间，参与了物流外呼意图识别的鲁邦性实验与相关前沿论文的调研和复现工作。

针对目前口语意图识别准确率较低，鲁棒性较差的问题，本文通过结合语音识别输出的其他信息

Aiming at the problem of inaccurate speech recognition text leading to lower prediction accuracy and poor robustness of downstream downstream recognition models, deep learning algorithms are used here, combined with other useful information output by speech recognition, such as N-best text, phonemes, sentence acoustic models The main work here is as follows:

Aiming at the problem of low accuracy and poor robustness of downstream intent recognition models caused by inaccurate speech recognition text, this paper uses deep learning algorithms combined with other useful information output by speech recognition, such as N-best text, phonemes, and sentence acoustic models. Information such as scores and language model scores are used to carry out oral robustness intent recognition research. The main work of this paper is as follows:

Aiming at the problem of low accuracy and poor robustness of downstream intent recognition models caused by inaccurate speech recognition text, this paper uses deep learning algorithms combined with other useful information output by speech recognition such as N-best text, phonemes, sentence scores, etc. Information to carry out robust oral intention recognition research. The main work of this paper is as follows:

Aiming at the problem of inaccurate speech recognition text leading to low prediction accuracy and poor robustness of downstream intent recognition models, this paper uses deep learning algorithms combined with other useful information output by speech recognition such as N-best text, phonemes, sentence score information Carry out robust oral intention recognition research. The main work of this paper is as follows:

Aiming at the problem of low prediction accuracy and poor robustness of downstream intention recognition model caused by inaccurate speech recognition text, this paper adopts deep learning algorithm and combines other useful information output by speech recognition, such as N-best text, phonemes and sentence score information, to conduct robust speech intention recognition research.The main work of this paper is as follows:

研究口语对话系统中意图识别准确度提升的相关论文。
查阅有关意图识别的开源数据资料
了解目前语音识别模型输出的识别信息，包括音素、N-best信息、声学模型得分、语言模型得分等。

slu papers

Veröffentlicht am 2021-04-04 Bearbeitet am 2021-04-19

整理口语意图识别论文

JD AI 论文

Building Robust Spoken Language Understanding by Cross Attention between Phoneme Sequence and ASR Hypothesis

核心方法：Cross Attention Network for Few-shot Classification Ruibing

lattice

[6] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister, “Latticernn: Recurrent neural networks over lat- tices.” in Interspeech, 2016, pp. 695–699.
[7] C. Huang and Y. Chen, “Adapting pretrained transformer to lat- tices for spoken language understanding,” in 2019 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 845–852.
[8] C. Huang and Y. Chen, “Learning spoken language representa- tions with neural lattice language modeling,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3764–3769.

n-best

[9] F. Peng, S. Roy, B. Shahshahani, and F. Beaufays, “Search results based n-best hypothesis rescoring with maximum entropy classi- fication,” in 2013 IEEE Workshop on Automatic Speech Recogni- tion and Understanding. IEEE, 2013, pp. 422–427.
[10] A. J. Kumar, C. Morales, M.-E. Vidal, C. Schmidt, and S. Auer, “Use of knowledge graph in rescoring the n-best list in automatic speech recognition,” arXiv preprint arXiv:1705.08018, 2017.
[11] A. Ogawa, M. Delcroix, S. Karita, and T. Nakatani, “Rescoring n- best speech recognition list based on one-on-one hypothesis com- parison using encoder-classifier model,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6099–6103.
[12] A. Ogawa, M. Delcroix, S. Karita, and T. Nakatani, “Improved deep duel model for rescoring n-best speech recognition list using backward lstmlm and ensemble encoders,” in INTERSPEECH, 2019, pp. 3900–3904.
[13] M. Li, X. Liu, W. Ruan, L. Soldaini, W. Hamza, and C. Su, “Multi-task learning of spoken language understanding by inte- grating n-best hypotheses with hierarchical attention,” in Proceed- ings ofthe 28th International Conference on Computational Lin- guistics: Industry Track, 2020, pp. 113–123.
[14] M. Li, W. Ruan, X. Liu, L. Soldaini, W. Hamza, and C. Su, “Im- proving spoken language understanding by exploiting asr n-best hypotheses,” arXiv preprint arXiv:2001.05284, 2020.

word confusion network

[15] G. Tur, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-T¨ur, “Improving spoken language understanding using word confusion networks,” in Seventh International Conference on Spoken Lan- guage Processing, 2002.
[16] D. Hakkani-T¨ur, F. B´echet, G. Riccardi, and G. Tur, “Beyond asr 1-best: Using word confusion networks in spoken language un- derstanding,” Computer Speech & Language, vol. 20, no. 4, pp. 495–514, 2006.
[17] M. Henderson, M. Gaˇsi´c, B. Thomson, P. Tsiakoulis, K. Yu, and S. Young, “Discriminative spoken language understanding using word confusion networks,” in 2012 IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2012, pp. 176–181.
[18] G. T¨ur, A. Deoras, and D. Hakkani-T¨ur, “Semantic parsing us- ing word confusion networks with conditional random fields.” in INTERSPEECH. Citeseer, 2013, pp. 2579–2583.
[19] P. G. Shivakumar and P. Georgiou, “Confusion2vec: Towards en- riching vector space word representations with representational ambiguities,” PeerJ Computer Science, vol. 5, p. e195, 2019.
[20] P. G. Shivakumar, M. Yang, and P. Georgiou, “Spoken lan- guage intent detection using confusion2vec,” arXiv preprint arXiv:1904.03576, 2019.

End2end SLU

[21] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audio to semantics: Approaches to end-to-end spoken language under- standing,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 720–726.
[22] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Ben- gio, “Towards end-to-end spoken language understanding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5754–5758.
[23] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Ben- gio, “Speech model pre-training for end-to-end spoken language understanding,” in Interspeech 2019, 2019.
[24] N. Tomashenko, A. Caubriere, and Y. Esteve, “Investigating adap- tation and transfer learning for end-to-end spoken language un- derstanding from speech,” in Interspeech 2019. ISCA, 2019, pp. 824–828

using phoneme data

[25] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech emotion recognition using spectrogram & phoneme em- bedding.” in Interspeech, 2018, pp. 3688–3692.
[26] A. Fang, S. Filice, N. Limsopatham, and O. Rokhlenko, “Using phoneme representations to build predictive models robust to asr errors,” in Proceedings ofthe 43rd International ACMSIGIR Con- ference on Research and Development in Information Retrieval, 2020, pp. 699–708.
[27] M. N. Sundararaman, A. Kumar, and J. Vepa, “Phoneme-bert: Joint language modelling of phoneme sequence and asr tran- script,” arXiv preprint arXiv:2102.00804, 2021.

独立项目端到端意图识别

厦大智能语音实验室

《TOWARDS END-TO-END SPOKEN LANGUAGE UNDERSTANDING》

Spoken language understanding system is traditionally designed as a pipeline of a number of components.

First, the audio signal is processed by an automatic speech recognizer for transcription or n-best hypotheses.
Second, with the recognition results, a natural language understanding system classifies the text to structured data as domain, intent and slots for down-streaming consumers, such as dialog system, hands-free applications.

These components are usually developed and optimized independently. In this paper, we present our study on an end-to-end learning system for spoken language understanding. With this unified approach, we can infer the semantic meaning directly from audio features without the intermediate text representation.

This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features. Index Terms— Spoken language understanding, end-to-end training, recurrent neural networks

端到端意图识别，直接从音频特征 –>> 口语意图解除了两阶段独立优化的缺点；其次，人类并非一字一字的理解口语意图，端到端模拟人类处理方式

一般基于发音机制或人耳感知机制提取得到频谱空间的向量表示，即声学特征。

语音特征提取：预处理（预加重、分帧、加窗）——>> 傅里叶变换——>> 取功率谱

声学特征：

MFCC 梅尔频率倒谱系数 Mel-Frequencey Cepstral Coefficient
PLP 感知线性预测系数 perceptual Linear Predictive
F-bank 滤波器组 Filter-bank
Spectrogram 语谱图
CQCC 常数Q倒谱系数 constant-Q Cepstral Coefficient

![](2021418_frequent_audio_feature.png =300x200)

MFCC优势：

• 将人耳的听觉感知特性和语音的产生机制相结合。
• 前12个MFCC通常被用作特征向量(也就是移除F0的信息)，表示非常紧凑，因为这12个特征描述了一帧语音数据中的信息。
• 相对FBank特征有着更小的相关性，更容易建立高斯混合模型(GMM)。
• 可惜的是MFCC抵抗噪声的鲁棒性不强。

ASR

SpeechProcessForMachineLearning

Hello World

Veröffentlicht am 2021-04-04

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment

hexo help

MarkDown Grammar

Veröffentlicht am 2019-10-26 Bearbeitet am 2021-04-04

---

将图片存到md文件的名字下，直接用名字

https://zhuanlan.zhihu.com/p/104996801

用户冷启动（重点）

这是加粗的文字

这是倾斜的文字`

这是斜体加粗的文字

~~这是加删除线的文字~~

这是引用的内容

这是引用的内容

![图片alt](图片地址 ‘’图片title’’)
![](https://raw.githubusercontent.com/rejae/rejae.github.io/master/img/
20191028attention.jpg)
![blockchain](https://ss0.bdstatic.com/70cFvHSh_Q1YnxGkpoWK1HF6hhy/it/
u=702257389,1274025419&fm=27&gp=0.jpg “区块链”)

超链接名
title可加可不加

简书
 百度

列表内容

列表内容

列表内容

1.列表内容
2.列表内容
3.列表内容

列表嵌套
上一级和下一级之间敲三个空格即可

一级无序列表内容

二级无序列表内容
二级无序列表内容
二级无序列表内容
   
    三级

表格
姓名|技能|排行
–|:–:|–:
刘备|哭|大哥
关羽|打|二哥
张飞|骂|三弟

代码块

() function fun(){ echo "这是一句非常牛逼的代码"; } fun(); ()

流程图

st=>start: 开始
op=>operation: My Operation
cond=>condition: Yes or No?
e=>end
st->op->cond
cond(yes)->e
cond(no)->op
&

A_i

aⁱ

α
β
γ
δ
ε
ζ
η
θ
ι
κ
λ
μ
ν
ξ
ο
π
ρ
σ
τ
υ
φ
χ
ψ
ω

㏒㏑

∑

∏
∅
±﹢
∮
∫
∬
∭
∂
∆