11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Plain-to-clear speech video conversion for enhanced intelligibility

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies.

          Related collections

          Most cited references39

          • Record: found
          • Abstract: not found
          • Article: not found

          Dlib-ml: a machine learning toolkit

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Robust speech perception: recognize the familiar, generalize to the similar, and adapt to the novel.

            Successful speech perception requires that listeners map the acoustic signal to linguistic categories. These mappings are not only probabilistic, but change depending on the situation. For example, one talker's /p/ might be physically indistinguishable from another talker's /b/ (cf. lack of invariance). We characterize the computational problem posed by such a subjectively nonstationary world and propose that the speech perception system overcomes this challenge by (a) recognizing previously encountered situations, (b) generalizing to other situations based on previous similar experience, and (c) adapting to novel situations. We formalize this proposal in the ideal adapter framework: (a) to (c) can be understood as inference under uncertainty about the appropriate generative model for the current talker, thereby facilitating robust speech perception despite the lack of invariance. We focus on 2 critical aspects of the ideal adapter. First, in situations that clearly deviate from previous experience, listeners need to adapt. We develop a distributional (belief-updating) learning model of incremental adaptation. The model provides a good fit against known and novel phonetic adaptation data, including perceptual recalibration and selective adaptation. Second, robust speech recognition requires that listeners learn to represent the structured component of cross-situation variability in the speech signal. We discuss how these 2 aspects of the ideal adapter provide a unifying explanation for adaptation, talker-specificity, and generalization across talkers and groups of talkers (e.g., accents and dialects). The ideal adapter provides a guiding framework for future investigations into speech perception and adaptation, and more broadly language comprehension.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes.

              This article provides an overview of the research concerning the nature of the distinct, listener-oriented speaking style called 'clear speech' and its effect on intelligibility for various listener populations. We review major findings that identify talker, listener and signal characteristics that contribute to the characteristically high intelligibility of clear speech. Understanding the interplay of these factors sheds light on the interaction between higher level cognitive and lower-level sensory and perceptual factors that affect language processing. Clear speech research is, thus, relevant for both its theoretical insights and practical applications. Throughout the review, we highlight open questions and promising future directions.
                Bookmark

                Author and article information

                Contributors
                shubam_sachdeva@sfu.ca
                haoyao_ruan@sfu.ca
                hamarneh@sfu.ca
                dawn.behne@ntnu.no
                jongman@ku.edu
                sereno@ku.edu
                yuew@sfu.ca
                Journal
                Int J Speech Technol
                Int J Speech Technol
                International Journal of Speech Technology
                Springer US (New York )
                1381-2416
                1572-8110
                28 January 2023
                28 January 2023
                2023
                : 26
                : 1
                : 163-184
                Affiliations
                [1 ]GRID grid.61971.38, ISNI 0000 0004 1936 7494, Language and Brain Lab, Department of Linguistics, , Simon Fraser University, ; Burnaby, BC Canada
                [2 ]GRID grid.61971.38, ISNI 0000 0004 1936 7494, Medical Image, Analysis Research Group, School of Computing Science, , Simon Fraser University, ; Burnaby, BC Canada
                [3 ]GRID grid.5947.f, ISNI 0000 0001 1516 2393, NTNU Speech Lab, Department of Psychology, , Norwegian University of Science and Technology, ; Trondheim, Norway
                [4 ]GRID grid.266515.3, ISNI 0000 0001 2106 0692, KU Phonetics and Psycholinguistics Lab, Department of Linguistics, , University of Kansas, ; Lawrence, KS USA
                Author information
                http://orcid.org/0000-0003-3862-3767
                Article
                10018
                10.1007/s10772-023-10018-z
                10042924
                b05db79a-0ded-4de7-9bbb-752fc797fec1
                © The Author(s) 2023

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 28 May 2022
                : 8 January 2023
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100000155, Social Sciences and Humanities Research Council of Canada;
                Award ID: 435-2019-1065
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100004326, Simon Fraser University;
                Categories
                Article
                Custom metadata
                © Springer Science+Business Media, LLC, part of Springer Nature 2023

                video speech synthesis,speech style,intelligibility,ai lip reading,speech enhancement

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content48

                Most referenced authors200