In this paper we describe a system that separates signals by comparing the interaural time delays... more In this paper we describe a system that separates signals by comparing the interaural time delays (ITDs) of their timefrequency components to a fixed threshold ITD. While in previous algorithms the fixed threshold ITD had been obtained empirically from training data in a specific environment, in real environments the characteristics that affect the optimal value of this threshold are unknown and possibly time varying. If these configurations are different from the environment under which the ITD threshold had been pre-computed, the performance of the source separation system is degraded. In this paper, we present an algorithm which chooses a threshold ITD that minimizes the cross-correlation of the target and interfering signals, after a compressive nonlinearity. We demonstrate that the algorithm described in this paper provides speech recognition accuracy that is much more robust to changes in environment than would be obtained using a fixed threshold ITD.
In this paper we describe a system that separates signals by comparing the interaural time delays... more In this paper we describe a system that separates signals by comparing the interaural time delays (ITDs) of their timefrequency components to a fixed threshold ITD. While in previous algorithms the fixed threshold ITD had been obtained empirically from training data in a specific environment, in real environments the characteristics that affect the optimal value of this threshold are unknown and possibly time varying. If these configurations are different from the environment under which the ITD threshold had been pre-computed, the performance of the source separation system is degraded. In this paper, we present an algorithm which chooses a threshold ITD that minimizes the cross-correlation of the target and interfering signals, after a compressive nonlinearity. We demonstrate that the algorithm described in this paper provides speech recognition accuracy that is much more robust to changes in environment than would be obtained using a fixed threshold ITD.
A novel power function-based power distribution normalization (PPDN) scheme is presented in this ... more A novel power function-based power distribution normalization (PPDN) scheme is presented in this paper. This algorithm is based on the observation that the ratio of arithmetic mean to geometric mean is very different between clean and corrupt speech. Parametric power function is used for equalizing this ratio. We also observe that for normalization, mediumduration window (around 100 ms) is better suited for this purpose so this medium-duration window is used for spectral analysis and re-synthesis. Also, an online version can be easily implemented using forgetting factors without lookahead buffer. Experimental results shows that this algorithm is showing comparable or slightly better result than the state of the art algorithm like vector Taylor series for speech recognition while requiring small computation. Thus, this algorithm is suitable for both realtime speech communication or real-time preprocessing stage for speech recognition systems.
In this paper we present a novel algorithm called Suppression of Slowly-varying components and th... more In this paper we present a novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments. This algorithm is motivated by the precedence effect and by the modulation frequency characteristics of the human auditory system. We describe two slightly different types of processing that differ in whether or not the falling edges of power trajectories are suppressed using a lowpassed power envelope signal. The SSF algorithms can be implemented for online processing. Speech recognition results show that this algorithm provides especially good robustness in reverberant environments. 1
In this paper we present a novel algorithm called Suppression of Slowly-varying components and th... more In this paper we present a novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments. This algorithm is motivated by the precedence effect and by the modulation frequency characteristics of the human auditory system. We describe two slightly different types of processing that differ in whether or not the falling edges of power trajectories are suppressed using a lowpassed power envelope signal. The SSF algorithms can be implemented for online processing. Speech recognition results show that this algorithm provides especially good robustness in reverberant environments. 1
This paper presents a new robust feature extraction algorithm based on a modified approach to pow... more This paper presents a new robust feature extraction algorithm based on a modified approach to power bias subtraction combined with applying a threshold to the power spectral density. Power bias level is selected as a level above which the signal power distribution is sharpest. The sharpness is measured using the ratio of arithmetic mean to the geometric mean of medium-duration power. When subtracting this bias level, power flooring is applied to enhance robustness. These new ideas are employed to enhance our recently introduced feature extraction algorithm PNCC (Power Normalized Cepstral Coefficient). While simpler than our previous PNCC, experimental results show that this new PNCC is showing better performance than our previous implementation.
This paper presents a new feature extraction algorithm called PNCC that is based on auditory. Maj... more This paper presents a new feature extraction algorithm called PNCC that is based on auditory. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used in MFCC coefficients, and a novel algorithm to suppress background excitation using medium-duration power estimation based on the ratio of the arithmetic mean to the geometric mean, and subtracting the medium-duration background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing. Index Terms: Robust speech recognition, physiological modeling, rate-level curve, power function, ratio of arithmetic mean to geometric mean, power distribution normalization
This paper describes the structure and performance of a new signal processing scheme, motivated b... more This paper describes the structure and performance of a new signal processing scheme, motivated by the physiology of the peripheral auditory system, that improves speech recognition accuracy in the presence of broadband noise. An important attribute of the peripheral processing is a novel mechanism to represent the cycle-by-cycle synchrony in the response of low-frequency auditory-nerve fibers, in addition to the more conventional processing based on mean rate of response. It is shown that the use of the physiologically-motivated peripheral processing improves recognition accuracy in the presence of both broadband and transient noise, and that the use of the synchrony mechanism provides further improvement beyond that which is provided by the mean rate mechanism.
In this paper, we present a new two-microphone approach that improves speech recognition accuracy... more In this paper, we present a new two-microphone approach that improves speech recognition accuracy when speech is masked by other speech. The algorithm improves on previous systems that have been successful in separating signals based on differences in arrival time of signal components from two microphones. The present algorithm differs from these efforts in that the signal selection takes place in the frequency domain. We observe that additional smoothing of the phase estimates over time and frequency is needed to support adequate speech recognition performance. We demonstrate that the algorithm described in this paper provides better recognition accuracy than timedomain-based signal separation algorithms, and at less than 10 percent of the computation cost.
This paper presents a new robust feature extraction algorithm based on a modified approach to pow... more This paper presents a new robust feature extraction algorithm based on a modified approach to power bias subtraction combined with applying a threshold to the power spectral density. Power bias level is selected as a level above which the signal power distribution is sharpest. The sharpness is measured using the ratio of arithmetic mean to the geometric mean of medium-duration power. When subtracting this bias level, power flooring is applied to enhance robustness. These new ideas are employed to enhance our recently introduced feature extraction algorithm PNCC (Power Normalized Cepstral Coefficient). While simpler than our previous PNCC, experimental results show that this new PNCC is showing better performance than our previous implementation.
Almost all current automatic speech recognition (ASR) systems conventionally append delta and dou... more Almost all current automatic speech recognition (ASR) systems conventionally append delta and double-delta cepstral features to static cepstral features. In this work we describe a modified feature-extraction procedure in which the time-difference operation is performed in the spectral domain, rather than the cepstral domain as is generally presently done. We argue that this approach based on "delta-spectral" features is needed because even though delta-cepstral features capture dynamic speech information and generally greatly improve ASR recognition accuracy, they are not robust to noise and reverberation. We support the validity of the delta-spectral approach both with observations about the modulation spectrum of speech and noise, and with objective experiments that document the benefit that the delta-spectral approach brings to a variety of currently popular feature extraction algorithms. We found that the use of delta-spectral features, rather than the more traditional delta-cepstral features, improves the effective SNR by between 5 and 8 dB for background music and white noise, and recognition accuracy in reverberant environments is improved as well.
In this paper, we present a noise robustness algorithm called Small Power Boosting (SPB). We obse... more In this paper, we present a noise robustness algorithm called Small Power Boosting (SPB). We observe that in the spectral domain, time-frequency bins with smaller power are more affected by additive noise. The conventional way of handling this problem is estimating the noise from the test utterance and doing normalization or subtraction. In our work, in contrast, we intentionally boost the power of time-frequency bins with small energy for both the training and testing datasets. Since timefrequency bins with small power no longer exist after this power boosting, the spectral distortion between the clean and corrupt test sets becomes reduced. This type of small power boosting is also highly related to physiological nonlinearity. We observe that when small power boosting is done, suitable weighting smoothing becomes highly important. Our experimental results indicate that this simple idea is very helpful for very difficult noisy environments such as corruption by background music.
In this paper we present a new method of signal processing for robust speech recognition using tw... more In this paper we present a new method of signal processing for robust speech recognition using two microphones. The method, loosely based on the human binaural hearing system, consists of passing the speech signals detected by two microphones through bandpass filtering. We develop a spatial masking function based on normalized cross-correlation, which provides rejection of off-axis interfering signals. To obtain improvements in reverberant environments, a temporal masking component, which is closely related to our previously-described de-reverberation technique known as SSF. We demonstrate that this approach provides substantially better recognition accuracy than conventional binaural sound-source separation algorithms.
It is well known that binaural processing is very useful for separating incoming sound sources as... more It is well known that binaural processing is very useful for separating incoming sound sources as well as for improving the intelligibility of speech in reverberant environments. This paper describes and compares a number of ways in which the classic model of interaural cross-correlation proposed by Jeffress, quantified by Colburn, and further elaborated by Blauert, Lindemann, and others, can be applied to improving the accuracy of automatic speech recognition systems operating in cluttered, noisy, and reverberant environments. Typical implementations begin with an abstraction of cross-correlation of the incoming signals after nonlinear monaural bandpass processing, but there are many alternative implementation choices that can be considered. These implementations differ in the ways in which an enhanced version of the desired signal is developed using binaural principles, in the extent to which specific processing mechanisms are used to impose suppression motivated by the precedence effect, and in the precise mechanism used to extract interaural time differences.
In this paper we describe a system that separates signals by comparing the interaural time delays... more In this paper we describe a system that separates signals by comparing the interaural time delays (ITDs) of their timefrequency components to a fixed threshold ITD. While in previous algorithms the fixed threshold ITD had been obtained empirically from training data in a specific environment, in real environments the characteristics that affect the optimal value of this threshold are unknown and possibly time varying. If these configurations are different from the environment under which the ITD threshold had been pre-computed, the performance of the source separation system is degraded. In this paper, we present an algorithm which chooses a threshold ITD that minimizes the cross-correlation of the target and interfering signals, after a compressive nonlinearity. We demonstrate that the algorithm described in this paper provides speech recognition accuracy that is much more robust to changes in environment than would be obtained using a fixed threshold ITD.
In this paper we describe a system that separates signals by comparing the interaural time delays... more In this paper we describe a system that separates signals by comparing the interaural time delays (ITDs) of their timefrequency components to a fixed threshold ITD. While in previous algorithms the fixed threshold ITD had been obtained empirically from training data in a specific environment, in real environments the characteristics that affect the optimal value of this threshold are unknown and possibly time varying. If these configurations are different from the environment under which the ITD threshold had been pre-computed, the performance of the source separation system is degraded. In this paper, we present an algorithm which chooses a threshold ITD that minimizes the cross-correlation of the target and interfering signals, after a compressive nonlinearity. We demonstrate that the algorithm described in this paper provides speech recognition accuracy that is much more robust to changes in environment than would be obtained using a fixed threshold ITD.
A novel power function-based power distribution normalization (PPDN) scheme is presented in this ... more A novel power function-based power distribution normalization (PPDN) scheme is presented in this paper. This algorithm is based on the observation that the ratio of arithmetic mean to geometric mean is very different between clean and corrupt speech. Parametric power function is used for equalizing this ratio. We also observe that for normalization, mediumduration window (around 100 ms) is better suited for this purpose so this medium-duration window is used for spectral analysis and re-synthesis. Also, an online version can be easily implemented using forgetting factors without lookahead buffer. Experimental results shows that this algorithm is showing comparable or slightly better result than the state of the art algorithm like vector Taylor series for speech recognition while requiring small computation. Thus, this algorithm is suitable for both realtime speech communication or real-time preprocessing stage for speech recognition systems.
In this paper we present a novel algorithm called Suppression of Slowly-varying components and th... more In this paper we present a novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments. This algorithm is motivated by the precedence effect and by the modulation frequency characteristics of the human auditory system. We describe two slightly different types of processing that differ in whether or not the falling edges of power trajectories are suppressed using a lowpassed power envelope signal. The SSF algorithms can be implemented for online processing. Speech recognition results show that this algorithm provides especially good robustness in reverberant environments. 1
In this paper we present a novel algorithm called Suppression of Slowly-varying components and th... more In this paper we present a novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments. This algorithm is motivated by the precedence effect and by the modulation frequency characteristics of the human auditory system. We describe two slightly different types of processing that differ in whether or not the falling edges of power trajectories are suppressed using a lowpassed power envelope signal. The SSF algorithms can be implemented for online processing. Speech recognition results show that this algorithm provides especially good robustness in reverberant environments. 1
Uploads
Papers by chanwoo kim