Data-centric AI
Search by



      Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models

      Published on
      Leah Nicolich-Henkin, Taichi Nakatani , Zach Trozenski, Joel Whiteman, Nathan Susanj

      All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data. In this paper, we compare two data-centric AI methods for improving performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data. Our experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. This method leads to improvement in intent recognition error rate (IRER) on our golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.

      This video is from the NeurIPS 2021 Data-centric AI workshop proceedings.

      Join the Data-centric AI Movement

      We want to share your Data-centric AI story. Fill out this Google form so we can feature your work!



      © 2022 Data-centric AI