Speech - Intelligibility, Privacy and Security

Speech intelligibility is a measure of the comprehensibility of speech whereas speech privacy concerns the lack of speech intelligibility from other talkers. Speech security concerns meeting rooms where confidential discussions take place and there is a need to ensure adequate protection from a casual overhearing or deliberate interception by a covert listener.

The effects of interference from combined noises on speech transmission have been investigated in a simulated open public space [2]. Sound fields for dominant noises were predicted using a typical urban square model surrounded by buildings. Road traffic noise and two types of construction noises, corresponding to stationary and impulsive noises, were selected as background noises. Listening tests were performed on a group of adults, and the quality of speech transmission was evaluated using listening difficulty as well as intelligibility scores. During the listening tests, two factors that affect speech transmission performance were considered: (1) temporal characteristics of construction noise (stationary or impulsive) and (2) the levels of the construction and road traffic noises. The results indicated that word intelligibility scores and listening difficulty ratings were affected by the temporal characteristics of construction noise due to fluctuations in the background noise level. It was also observed that listening difficulty is unable to describe the speech transmission in noisy open public spaces, showing larger variation than word intelligibility scores. 

A study has also been carried out to assess speech privacy in open-plan office using two recently introduced single-number quantities: the spatial decay rate of speech, DL2,S, and the A-weighted sound pressure level of speech at a distance of 4m, Lp,A,S,4m [3]. Open-plan offices were modelled using a DL2,S of 4, 8, and 12dB, and Lp,A,S,4m was changed in three steps, from 43 to 57dB. Auditory experiments were conducted at three locations with source–receiver distances of 8, 16, and 24m, while background noise level was fixed at 30dBA. A total of 20 subjects were asked to rate the speech intelligibility and listening difficulty of 240 Korean sentences in such surroundings. The speech intelligibility scores were not affected by DL2,S or Lp,A,S,4m at a source-receiver distance of 8m; however, listening difficulty ratings were significantly changed with increasing DL2,S and Lp,A,S,4m values. At other locations, the influences of DL2,S and Lp,A,S,4m on speech intelligibility and listening difficulty ratings were significant. It was also found that the speech intelligibility scores and listening difficulty ratings were considerably changed with increasing the distraction distance (rD). Furthermore, listening difficulty is more sensitive to variations in DL2,S and Lp,A,S,4m than intelligibility scores for sound fields with high speech transmission performances.

Research into thresholds of information leakage for speech security outside meeting rooms introduces a new approach to provide speech security outside meeting rooms where a covert listener might attempt to extract confidential information [4]. Decision-based experiments are used to establish a relationship between an objective measurement of the Speech Transmission Index (STI) and a subjective assessment relating to the threshold of information leakage. This threshold is defined for a specific percentage of English words that are identifiable with a maximum safe vocal effort (e.g. ‘normal’ speech) used by the meeting participants. The results demonstrate that it is possible to quantify an offset that links STI with a specific threshold of information leakage which describes the percentage of words identified. The offsets for male talkers are shown to be approximately 10dB larger than for female talkers. Hence for speech security it is possible to determine offsets for the threshold of information leakage using male talkers as the ‘worst case scenario’. To define a suitable threshold of information leakage, the results show that a robust definition can be based upon 1%, 2% or 5% of words identified. For these percentages, results are presented for offset values corresponding to different STI values in a range from 0.1 to 0.3.

Research has been carried out to assess the intelligibility of noisy speech at very low signal-to-noise ratios (SNRs) and to evaluate the performance of a relatively new objective intelligibility metric, the Short-Time Objective Intelligibility metric (STOI), with these signals [5]. The STOI output, d, is monotonic with speech intelligibility, where d=1 indicates 100% speech intelligibility, and d<1 indicates less than 100% intelligibility. Unlike STI, STOI is a correlation-based method, which was designed to characterise Ideal Time-Frequency segregated (i.e., enhanced) speech. However, it is also described by the developers as being appropriate for degraded signals before enhancement. In this work, speech was degraded by four types of additive noise or ‘masker’: white Gaussian noise, a 400 Hz sine wave, white noise with a 400 Hz sine wave and white noise with a 400 Hz sine wave and harmonics up to 3200 Hz. Listening tests involving normal-hearing human listeners were conducted for two male and two female British English talkers using four SNRs per masker, ranging from -10 dB to as low as -50 dB. These SNRs were chosen to obtain intelligibility scores lower than 10%, which are relevant in the context of speech security. A modified STOI was introduced, which, unlike STOI, extends down to d=0 for signals in which listeners are unable to correctly identify words. This modification is easy to implement and results in improved STOI performance for highly noisy signals with or without tonal components.

In a follow-up study [6], the effectiveness of Ideal Binary Masking (IBM) algorithms in enhancing speech degraded by noise at very low SNRs was evaluated.  Signals were enhanced using IBMs with a Local Criterion (LC) equal to zero or to the mixture SNR. Each signal was presented three times consecutively to normal-hearing human listeners to obtain percentages of words correctly identified. The performance of STOI and STOIWC was evaluated on the basis of the extent to which they correlated with these percentages. The findings indicate that the intelligibility of signals processed with LC = 0 is dependent on binary mask density (i.e., the number of ones in the mask). Further, at low mixture SNRs, LC = SNR resulted in better signal intelligibility than LC = 0. It was shown that STOI performance in predicting speech intelligibility is affected by mask density and the spread of STOI values.

Research concerning speech produced in noise examined differences in intelligibility and listening difficulty between loud speech due to the presence of noise (Lombard effect) and loud speech due to instruction (‘Speak in a loud voice’). A modified form of the listening difficulty measure was developed to improve its metrological properties. In [7], a ten-point scale listening difficulty metric and statistical approach designed to address both issues of saturation and listener variation was presented. It was confirmed that speech produced in multi-talker babble noise relative to quiet was associated with an increased fundamental frequency, increased spectral energy between 1 and 4 kHz relative to energy below 1 kHz, and increased vowel duration. However, only the proportion of high to low spectral energy reliably predicted listening difficulty for normal-hearing listeners.

Selected publications

[1] El-Zeky L and Oldham DJ (1998) The use of virtual reflectors to improve speech intelligibility in open stage auditoria. Building Acoustics vol 5 issue 1 pp 57-68.

[2] Lee PJ and Jeon, JY (2011) Evaluation of speech transmission in open public spaces affected by combined noises. Journal of the Acoustical Society of America vol 130 pp 1357-1366.

[3] Lee PJ and Jeon JY (2014) A laboratory study for assessing speech privacy in a simulated open plan office. Indoor Air vol 24 issue 3 pp 307-314

[4] Robinson M, Hopkins C, Worrall K, Jackson T (2014) Thresholds of information leakage for speech security outside meeting rooms. Journal of the Acoustical Society of America vol 136 issue 3 pp 1149-1159.

[5] Graetzer S and Hopkins C (2017) An assessment of objective indicators of speech intelligibility in noise at low signal-to-noise ratios. Proceedings of the 24th International Conference of Sound and Vibration, London, 2017.

[6] Graetzer S and Hopkins C (2018) Evaluation of STOI for speech at low signal-to-noise ratios after enhancement with Ideal Binary Masks. Proceedings of the 25th International Conference of Sound and Vibration, Hiroshima, 2018.

[7] Graetzer S, Bottalico P, Hunter E J (2017) Speech produced in noise: Relationship between listening difficulty and acoustic and durational parameters. Journal of the Acoustical Society of America vol 142 issue 2 pp 974-983.