Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
Leah Nicolich-Henkin, Taichi Nakatani , Zach Trozenski, Joel Whiteman, Nathan Susanj
All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data. In this paper, we compare two data-centric AI methods for improving performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data. Our experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. This method leads to improvement in intent recognition error rate (IRER) on our golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.