Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al., 2011; Evanini et al., Proceedings of Interspeech, 2009), Penn Aligner (Yuan and Liberman, J. Acoust. Soc. America, 2008, 123, 3878), and DARLA (Dartmouth Linguistic Automation), (Reddy and Stanford, DARLA Dartmouth Linguistic Automation: Online Tools for Linguistic Research, 2015a). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. Moreover, if sociolinguists could quickly and accurately extract phonetic information from the millions of hours of new audio content posted on the Internet every day, a virtual ocean of speech from newly created podcasts, videos, live-streams, and other audio content would now inform research. How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy et al., Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75) showed that an HMM-based Automated Speech Recognition system, trained with CMU Sphinx ( Lamere et al., 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcription. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end automatic speech recognition (ASR) algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN/CTC-based DeepSpeech from Mozilla ( Hannun et al., 2014). (RNN stands for recurrent neural networks, the learning mechanism for DeepSpeech. CTC stands for connectionist temporal classification, the mechanism to merge phones into words). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. Specifically, we used these techniques on audio recordings from 352 North American English speakers in the International Dialects of English Archive (IDEA ¹ ), extracting 88,500 tokens of vowels in stressed position from spontaneous, free speech passages. With this large dataset we conducted acoustic sociophonetic analyses of the Southern Vowel Shift and the Northern Cities Chain Shift in the North American IDEA speakers. We compared the results using three different sources of transcriptions: 1) IDEA’s manual transcriptions as the baseline “ground truth”, 2) the ASR built on CMU Sphinx used by Reddy et al. (Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75), and 3) the latest publicly available Mozilla DeepSpeech system. We input these three different transcriptions to DARLA, which automatically aligned and extracted the vowel formants from the 352 IDEA speakers. Our quantitative results show that newer ASR systems like DeepSpeech show considerable promise for sociolinguistic applications like DARLA. We found that DeepSpeech’s automated transcriptions had significantly fewer character error rates than those from the prior Sphinx system (from 46 to 35%). When we performed the sociolinguistic analysis of the extracted vowel formants from DARLA, we found that the automated transcriptions from DeepSpeech matched the results from the ground truth for the Southern Vowel Shift (SVS): five vowels showed a shift in both transcriptions, and two vowels didn’t show a shift in either transcription. The Northern Cities Shift (NCS) was more difficult to detect, but ground truth and DeepSpeech matched for four vowels: One of the vowels showed a clear shift, and three showed no shift in either transcription. Our study therefore shows how technology has made progress toward greater automation in vowel sociophonetics, while also showing what remains to be done. Our statistical modeling provides a quantified view of both the abilities and the limitations of a completely “hands-free” analysis of vowel shifts in a large dataset. Naturally, when comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand [Kendall and Joseph, Towards best practices in sociophonetics (with Marianna DiPaolo), 2014]. The amount of “noise” that can be tolerated for a given study will depend on the particular research goals and researchers’ preferences. Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.

Related collections

Most cited references 49

Record: found
Abstract: not found
Article: not found

Fitting Linear Mixed-Effects Models Usinglme4

Steve Walker, Martin Mächler, Ben Bolker … (2022)

0 comments Cited 12056 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Random effects structure for confirmatory hypothesis testing: Keep it maximal.

Dale Barr, Roger Levy, Christoph Scheepers … (2013)

Linear mixed-effects models (LMEMs) have become increasingly prominent in psycholinguistics and related areas. However, many researchers do not seem to appreciate how random effects structures affect the generalizability of an analysis. Here, we argue that researchers using LMEMs for confirmatory hypothesis testing should minimally adhere to the standards that have been in place for many decades. Through theoretical arguments and Monte Carlo simulation, we show that LMEMs generalize best when they include the maximal random effects structure justified by the design. The generalization performance of LMEMs including data-driven random effects structures strongly depends upon modeling criteria and sample size, yielding reasonable results on moderately-sized samples when conservative criteria are used, but with little or no power advantage over maximal models. Finally, random-intercepts-only LMEMs used on within-subjects and/or within-items data from populations where subjects and/or items vary in their sensitivity to experimental manipulations always generalize worse than separate F 1 and F 2 tests, and in many cases, even worse than F 1 alone. Maximal LMEMs should be the 'gold standard' for confirmatory hypothesis testing in psycholinguistics and beyond.

0 comments Cited 1408 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

The package “adehabitat” for the R software: A tool for the analysis of space and habitat use by animals

Clément Calenge (2006)

0 comments Cited 578 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Rolando Coto-Solano: URI : https://loop.frontiersin.org/people/895479/overview

James N. Stanford: URI : https://loop.frontiersin.org/people/528082/overview

Sravana K. Reddy: URI : https://loop.frontiersin.org/people/1423213/overview

Journal

Journal ID (nlm-ta): Front Artif Intell

Journal ID (iso-abbrev): Front Artif Intell

Journal ID (publisher-id): Front. Artif. Intell.

Title: Frontiers in Artificial Intelligence

Publisher: Frontiers Media S.A.

ISSN (Electronic): 2624-8212

Publication date (Electronic): 24 September 2021

Publication date Collection: 2021

Volume: 4

Electronic Location Identifier: 662097

Affiliations

Dartmouth College, Hanover, NH, United States

Author notes

Edited by: Joey Stanley, Brigham Young University, United States

Reviewed by: Morgan Sonderegger, McGill University, Canada

Eleanor Chodroff, University of York, United Kingdom

Valerie Fridland, University of Nevada, Reno, United States

*Correspondence: Rolando Coto-Solano, Rolando.A.Coto.Solano@ 123456Dartmouth.edu ; James N. Stanford, James.N.Stanford@ 123456Dartmouth.edu ; Sravana K. Reddy, sravana.reddy@ 123456gmail.com

This article was submitted to Language and Computation, a section of the journal Frontiers in Artificial Intelligence

Article

Publisher ID: 662097

DOI: 10.3389/frai.2021.662097

PMC ID: 8498339

SO-VID: 14e34054-2909-4b02-8880-c9c2df713b40

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA

Read this article at

Abstract

Related collections

Glossa: a journal of general linguistics

Most cited references 49

Fitting Linear Mixed-Effects Models Usinglme4

Random effects structure for confirmatory hypothesis testing: Keep it maximal.

The package “adehabitat” for the R software: A tool for the analysis of space and habitat use by animals

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 130

Most referenced authors 432