Estimation of the glottal source from coded telephone speech using deep neural networks

N. P. Narendra, Manu Airaksinen, Brad H Story, Paavo Alku

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Estimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately1 extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.

Original languageEnglish (US)
Pages (from-to)95-104
Number of pages10
JournalSpeech Communication
Volume106
DOIs
StatePublished - Jan 1 2019

Fingerprint

Telephone
neural network
telephone
Neural Networks
Filtering
Neural Nets
Speech Signal
Neural networks
Speech
Deep neural networks
Speech Production
Speaker Recognition
Emotion Recognition
source of information
Bandwidth
emotion
Closed
Scenarios

Keywords

  • Coded telephone speech
  • Deep neural network
  • Glottal inverse filtering
  • Glottal source estimation

ASJC Scopus subject areas

  • Software
  • Modeling and Simulation
  • Communication
  • Language and Linguistics
  • Linguistics and Language
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this

Estimation of the glottal source from coded telephone speech using deep neural networks. / Narendra, N. P.; Airaksinen, Manu; Story, Brad H; Alku, Paavo.

In: Speech Communication, Vol. 106, 01.01.2019, p. 95-104.

Research output: Contribution to journalArticle

Narendra, N. P. ; Airaksinen, Manu ; Story, Brad H ; Alku, Paavo. / Estimation of the glottal source from coded telephone speech using deep neural networks. In: Speech Communication. 2019 ; Vol. 106. pp. 95-104.
@article{7e3abd885b95442497f2db394916a1e7,
title = "Estimation of the glottal source from coded telephone speech using deep neural networks",
abstract = "Estimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately1 extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.",
keywords = "Coded telephone speech, Deep neural network, Glottal inverse filtering, Glottal source estimation",
author = "Narendra, {N. P.} and Manu Airaksinen and Story, {Brad H} and Paavo Alku",
year = "2019",
month = "1",
day = "1",
doi = "10.1016/j.specom.2018.12.002",
language = "English (US)",
volume = "106",
pages = "95--104",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",

}

TY - JOUR

T1 - Estimation of the glottal source from coded telephone speech using deep neural networks

AU - Narendra, N. P.

AU - Airaksinen, Manu

AU - Story, Brad H

AU - Alku, Paavo

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Estimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately1 extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.

AB - Estimation of glottal source information can be performed non-invasively from speech by using glottal inverse filtering (GIF) methods. However, the existing GIF methods are sensitive even to slight distortions in speech signals under different realistic scenarios, for example, in coded telephone speech. Therefore, there is a need for robust GIF methods which could accurately estimate glottal flows from coded telephone speech. To address the issue of robust GIF, this paper proposes a new deep neural net-based glottal inverse filtering (DNN-GIF) method for estimation of glottal source from coded telephone speech. The proposed DNN-GIF method utilizes both coded and clean versions of speech signal during training. DNN is used to map the speech features extracted from coded speech with the glottal flows estimated from the corresponding clean speech. The glottal flows are estimated from the clean speech by using quasi closed phase analysis (QCP). To generate coded telephone speech, adaptive multi-rate (AMR) codec is utilized which operates in two transmission bandwidths: narrow band (300 Hz - 3.4 kHz) and wide band (50 Hz - 7 kHz). The glottal source parameters were computed from the proposed and existing GIF methods by using vowels obtained from natural speech data as well as from artificial speech production models. The errors in glottal source parameters indicate that the proposed DNN-GIF method has considerably improved the glottal flow estimation under coded condition for both low- and high-pitched vowels. The proposed DNN-GIF method can be utilized to accurately1 extract glottal source -based features from coded telephone speech which can be used to improve the performance of speech technology applications such as speaker recognition, emotion recognition and telemonitoring of neurodegerenerative diseases.

KW - Coded telephone speech

KW - Deep neural network

KW - Glottal inverse filtering

KW - Glottal source estimation

UR - http://www.scopus.com/inward/record.url?scp=85058619832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058619832&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2018.12.002

DO - 10.1016/j.specom.2018.12.002

M3 - Article

VL - 106

SP - 95

EP - 104

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

ER -