Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Long short-term memory (LSTM) has been effectively used to represent sequential data in recent years. However, LSTM still struggles with capturing the long-term temporal dependencies. In this paper, we propose an hourglass-shaped LSTM that is able to capture long-term temporal correlations by reducing the feature resolutions without data loss. We have used skip connections in non-adjacent layers to avoid gradient decay. In addition, an attention process is incorporated into skip connections to emphasize the essential spectral features and spectral regions. The proposed LSTM model is applied to speech enhancement and recognition applications. The proposed LSTM model uses no future information, resulting in a causal system suitable for real-time processing. The combined spectral feature sets are used to train the LSTM model for improved performance. Using the proposed model, the ideal ratio mask (IRM) is estimated as a training objective. The experimental evaluations using short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) have demonstrated that the proposed model with robust feature representation obtained higher speech intelligibility and perceptual quality. With the TIMIT, LibriSpeech, and VoiceBank datasets, the proposed model improved STOI by 16.21%, 16.41%, and 18.33% over noisy speech, whereas PESQ is improved by 31.1%, 32.9%, and 32%. In seen and unseen noisy situations, the proposed model outperformed existing deep neural networks (DNNs), including baseline LSTM, feedforward neural network (FDNN), convolutional neural network (CNN), and generative adversarial network (GAN). With the Kaldi toolkit for automated speech recognition (ASR), the proposed model significantly reduced the word error rates (WERs) and reached an average WER of 15.13% in noisy backgrounds.

Related collections

Most cited references 38

Record: found
Abstract: found
Article: not found

Long Short-Term Memory

Jürgen Schmidhuber, Jürgen Schmidhuber (2002)

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

0 comments Cited 6796 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems

Herman Steeneken, Andrew Varga (1993)

0 comments Cited 114 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Chin-Hui Lee, Yong Xu, Li-Rong Dai … (2014)

0 comments Cited 97 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Zhenqing Li: Role: ConceptualizationRole: Data curationRole: InvestigationRole: Supervision

Abdul Basit:

ORCID: https://orcid.org/0000-0003-2864-2809

Role: Funding acquisitionRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: Writing – original draft

Amil Daraz:

ORCID: https://orcid.org/0000-0002-5532-9175

Role: MethodologyRole: Project administrationRole: SoftwareRole: Writing – review & editing

Atif Jan: Role: Data curationRole: Formal analysisRole: Validation

Mohamed Hammad: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS One

Journal ID (publisher-id): plos

Title: PLOS ONE

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2024

Publication date (Electronic): 3 January 2024

Volume: 19

Issue: 1

Electronic Location Identifier: e0291240

Affiliations

[1 ] School of Information Science and Engineering, NingboTech University, Ningbo, China

[2 ] Department of Electrical Engineering, University of Engineering and Technology, Peshawar, Pakistan

Menoufia University, EGYPT

Author notes

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: abdulbasit@ 123456nbt.edu.cn

Author information

Abdul Basit https://orcid.org/0000-0003-2864-2809

Amil Daraz https://orcid.org/0000-0002-5532-9175

Article

Publisher ID: PONE-D-23-08666

DOI: 10.1371/journal.pone.0291240

PMC ID: 10763955

PubMed ID: 38170703

SO-VID: b464cb3a-d7eb-44c1-b4b0-b77912971212

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 23 March 2023

Date accepted : 25 August 2023

Page count

Figures: 6, Tables: 10, Pages: 19

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100007834, Natural Science Foundation of Ningbo;

Award ID: Talent Introduction Fund Project of Ningbo Tech University under grant no 20211009

Award Recipient : Zhenqing Li

This work is supported by“ Talent Introduction Fund Project of Ningbo Tech University under grant no 20211009. The funder Dr Li is the main author of this work and he contributed fully to this work in the way of simulations and original paper writing. He is the project leader of the “Talent Introduction Fund Project of Ningbo Tech University under grant no 20211009."

Custom metadata

Data Availability All relevant data are within the paper.

ScienceOpen disciplines: Uncategorized

Data availability:

ScienceOpen disciplines: Uncategorized

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 38

Long Short-Term Memory

Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 75

Cited by 1

Most referenced authors 242