Lecture 1: Speech Production-Perception Link via Energy Measure
In this lecture, a link between speech production and perception mechanism via a suitable energy measure will be established. To that effect, first, a brief discussion of elements of speech production and basics of human hearing as a process of detecting energy will be discussed. In this context, limitations of usual energy measure using L2 norm of a signal in traditional signal processing literature will be discussed. Then, a development of new energy measure in the context of speech production, namely, Teager Energy Operator (TEO) will be presented. A capability of TEO w.r.t. AM-FM modeling and noise suppression will be discussed. Furthermore, mathematical modeling of cochlea will be discussed along with its link with TEO to bring out lina k of TEO for both production and perception. Finally, various potential app applications of in speech, speaker and emotion recognition, stressed speech analysis, energy separation, etc. will be discussed.
Lecture 2: Design of Speaker Recognition in Asian Languages: A case study in Indian languages
This lecture discusses design of speaker recognition systems in Indian languages for tape recorded speech and improving their performance with emphasis on system features. The details of the experimental setup such as dialectal zones (for Marathi, Hindi, Urdu and Oriya-Indian languages) selected for data collection, corpora design and text material used for recordings in different languages are discussed. The baseline ASR system using LP-based features (such as LPC and LPCC) and filterbank- based features (such as MFCC) with polynomial classifiers of 2nd or 3rd approximation is described thereafter. A relative comparison of experiments on speaker identification for monolingual, cross-lingual and multilingual modes is made. The spectral resolution problem associated with female speech is resolved to a large extent by employing filterbank-based features. The problem of speaker classification and language identification is identified from the standpoint of ASR and the solution to this problem is accomplished by modifying the structure of a polynomial classifier. The work on speaker classification is first supported by spectrogram analysis of voices from rural males followed by experimental results for open set and closed set modes for different Indian languages. For speaker classification, the wavelet packet cepstrum and sub-band cepstrum are employed and the performances have been compared with the performance of MFCC. Furthermore, the effect of different speech coding standards on the performance of ASR is investigated. Finally, some conclusions and different future research
issues in speaker recognition are discussed.
Lecture 3: Spoofing Attacks in Automatic Speaker Verification (ASV)
Speech is most powerful form of communication between humans and it carries various levels of information such as linguistic content, emotion, acoustic environment, language, speaker’s identity and health conditions, etc. Automatic Speaker Verification (ASV) deals with the verifying claimed speaker’s identity with the help of machines. There are various research issues in speaker recognition such as variability in speaker microphone, intersession, acoustic noise, etc. In addition, one of the most
challenging but practical research issue in this area is analysis of spoofing attacks
and deveopment of various
countermeasures to alleviate such possible attacks. In this lecture, we will present analysis of various spoofing attacks for
ASVs. In this lecture, we will present work related to technological challenges voice conversion (VC), speech synthesis (SS), replay, twins and professional mimics including the detailed literature search and recent synergistic activities of ASV Spoof 2015 and ASV Spoof 2017 Challenge campaign in INTERSPEECH conferences.
Lecture 4: Person Recognition from Humming
Voice biometrics refers to the task of identifying or verifying a person’s identity based on his or her voice with the help of machines. In this lecture, I will present our work addressing this problem using humming signal rather than normal speech. This kind of biometric may be useful for person with disorder. In addition, this work may be useful to design person-dependent Query-by-Humming (QBH) system in the context of music information retrieval (MIR) systems. This lecture will first give brief overview speaker recognition technology along with various research issues in this area. Newly proposed feature set (by the speaker), viz., such as Variable length Teager Energy Based Features (VTMFCC) will be discussed. Furthermore, development of a new feature extraction technique to exploit phase spectrum information implicitly along with magnitude spectrum information from hum signal will be discussed. To that effect, we have modified structure of state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC). In addition, a new energy measure, viz., Variable length Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain subband signals (i.e., output of 24 triangular shaped filters used in Mel filterbank). Discriminatively-trained polynomial classifier of 2nd order approximations are used as the basis for all person recognition experiments. Proposed feature set is evaluated (and found to be better than state-of-the-art MFCC) under various experimental conditions such as polynomial classifier order, dimension of feature vector, signal degradation, class separability and static vs. dynamic features. Finally, lecture will conclude future research directions and brief mention of various sponsored projects in the area of speech processing and acoustics at DA-IICT Gandhinagar.