VXI - Connecting people
 
VXI Products: How can we help you?

Browse All Products
Browse All Solutions

Microphone Enhancement for Speech Recognition



Costas Papadopoulos
Vice President of Research and Development
VXI Corporation

To view a printable PDF version of this article, click here.

Microphone Enhancement for Speech Recognition

Overview

Somewhere between the plain analog microphone of yesterday and the digital version of tomorrow there is a high performance, cost effective solution for today that will improve speech recognition results immediately.

The familiar headset microphone, perhaps the one that comes with the software will perform just adequately even when every measure of good acoustic design has been taken. Before discussing electronic improvements to the headset, we will take for granted some noise canceling mic fundamentals:

Factors Limiting Microphone Performance

 

Cause

 

Effect

1.      Inappropriate mic jack wiring at the computer Voice signal too low
2.      Frequency response extending too low Poor real world noise canceling
3.      Mic sensitivity reaching high levels at high frequencies Poor signal-to-noise ratio
4.      Limited, unpredictable dynamic range Distortion at higher voice levels
5.      Mic voltage inadequate or absent Low output or no output

The ParrottTM TranslatorTM Solution

These five limitations can be overcome by attaching a TranslatorTM unit (see Figure 1) between the microphone and the computer. About seventeen electronic components and one factory adjustment later, speech recognition accuracy is maximized 'out of the box.' We will review briefly each of the five problems and the corresponding TranslatorTM solution.

1. Inappropriate PC Wiring

Practically every microphone for voice enabled applications uses a 3.5 mm stereo plug that connects to a corresponding jack at the PC. 'Stereo' simply implies two connections (tip and ring) with a common ground (sleeve). Electret mics require an external bias voltage. This DC voltage is generally provided by the sound card or motherboard mic jack but in seemingly endless ways and levels. At times (see Figure 1 right section) bias voltage is at the ring terminal of the jack or it may be on the tip. Sometimes both tip and ring are trying to source one or two voltages. In the poorest but most frequent case, there is also a low value resistor from tip to ground. This termination is a remnant from dynamic microphones and a sad fact of life for electrets since DC voltage may be divided or wasted when the mic is connected.

The mic plug on the headset cord is often wired in a way that is least painful for most users (tip and ring together) but optimum for few. We have all seen signal adapter or attenuator accessories provided with some headsets to address wiring incompatibility and prevent bias voltage division. Users may not know when to use these adapters or where to find them later. A transparent solution requiring no user steps is preferred. The TranslatorTM has a circuit section that obtains mic bias from tip or ring separately and provides a voice signal to either terminal. The circuit has two capacitor-diode branches which are isolated, symmetrical and impartial....

2.  Low Frequency Response Too Rich

Human speech 'dwells' at frequencies between a few hundred and a few thousand Hertz while most electret microphones are sensitive well below and above the voice band. At the same time wind noise, breath pops, traffic rumble, AC hum and PC monitors have strong components at very low frequencies. It is beneficial to place a high pass filter between the mic and the computer to reject these unavoidable real world sounds without reducing voice fidelity. In circuit design, such filtering is commonly done with a series capacitor. In speech recognition reality however, a capacitor will block DC bias from the computer and the mic will be dead. For the TranslatorTM, a shunt inductor-capacitor network (across the signal path) is applied instead. The filter that attenuates unwanted

low frequency sounds also contributes to good 60 Hz rejection and provides an additional benefit as we will see later.

3. High Frequencies Too Shrill

Many noise canceling electrets have a 'ski slope' microphone response rising steeply above 1 kHz, a feature required in communications, termed pre-emphasis. In speech recognition, voice is sampled at 11.025 kHz so only components up to half that frequency are needed. In addition pre-emphasis is unnatural since voice will not be transmitted but digitized linearly at the PC. Low pass filtering (one or two shunt capacitors) is used in the TranslatorTM first to flatten and then to limit the microphone passband. Unwanted sensitivity to high frequency sounds and digital hash is avoided so signal-to-noise ratio is increased. Figure 2 shows mic output versus frequency with and without filtering. Clearly the enhanced response is flat in the middle and bounded above and below the voice band.

Figure 2. Microphone frequency response

4. Unpredictable Dynamic Range

Electrets are not all created equal! Even mics with the same part number will vary from unit to unit in key parameters. The DC current the mic needs from the sound card may range from one hundred to several hundred microamps for a given mic type. Low current units will have small signal swings leading to distortion (positive peak clipping) at high voice levels, for example in a noisy environment. High current mics will be starved (negative peak clipping) if current is limited and precious, quite common when a notebook computer is the source.

A current mirror transistor stage is used in the TranslatorTM to provide gain and to set all mics to a constant, optimum level at the factory. All mirrored current additional to what the electret draws is active and increases voice signal headroom. Stiffer output impedance is an added benefit because mic output level will be more consistent among PC sound cards.

5.  Inadequate Microphone Bias

Up to this point we have seen mic improvements when starting performance was fair or adequate. What about more desperate cases? What can cause them? Many times mic bias from the PC is poor: voltage too low, current too low or both. In other cases bias is turned off and must be software or hardware enabled. You are familiar with add-on battery box solutions for 'dead' mics. The unfortunate user will not know that a battery box may be needed until the most inopportune moment, right after software installation. 'All dressed up and no place to go....'

The high-pass filter section in the TranslatorTM allows for an optional method to add batteries without disturbing performance. Two alkaline 'AA' batteries can be inserted when needed. They will supply proper bias continuously for at least fifteen months. In case of doubt there is no harm if batteries are inserted when not needed, but the 'green' option should be tried first.

Microphone Muting with 'Keep Alive' Signal

'Go to sleep' or 'stop listening' are familiar commands but not always convenient. A mute switch is often needed when we are interrupted, if we need to cough etc. A simple on/off may work in theory, but in practice a loud click or pop will be 'heard' by the software. The result can be a spurious short word appearing on the screen. The click is caused by the interruption of mic bias current by the switch. A silent or clickless mute function is familiar in the world of telephony headsets where a click heard by the calling party will sound as if the call was dropped.

A silent mute circuit (switch plus resistor and capacitor) can be added to a speech recognition headset. The transient click is avoided by maintaining bias current but diverting mic audio. Now a new situation may develop, where the computer 'hears' silence and increases gain progressively while searching for a voice signal. If the microphone is unmuted after a short interval, the software will return to normal almost in real time. After a longer period of total silence, unmuting will present a loud voice signal causing saturation and a long delay before normal recognition is restored. Any benefit or convenience from a clickless mute switch is lost.

Figure 3. Frequency spectrum of 'keep alive' signal

A tailored 'keep alive' mute instead of total silence will prevent an overload, saturation and delay. To avoid spurious word 'recognition,' this mute signal must be unlike speech but have a similar volume level. For example, a repetitive waveform of narrow, low frequency pulses will work well. The result in the frequency domain (Figure 3) is a constant picket fence of narrow spikes. This 'keep alive' will put the software in a stable but suspended state while muted. Recovery to normal dictation will be immediate when unmuted.

Microphone Performance in Computer Telephony

Headset amplifiers for CTI such as VXI ParrottTM 60V provide a hands free solution with a common headset for the computer and the telephone. Computer telephony operation can be sequential (dictate, take a call, dictate, make a call and so on) in a small office/home office or simultaneous (dictate a purchase order to the PC while listening to a customer) for a call center environment. A high degree of electrical isolation and noise immunity in the headset amplifier is vital when the telephone, the computer, AC power and ground are brought together. Just as nature abhors a vacuum, telephony abhors grounding!

After these obvious CTI design requirements have been met, headset microphone performance must be addressed. What is needed amounts to 'keep alive' functionality with TranslatorTM enhancement. The length of a call is unpredictable and a spoken 'go to sleep' command undesirable. With the push of a button the user can direct voice to the computer or the telephone. While speech to the computer is muted, the 'keep alive' drone prevents delays caused by software instability.

TranslatorTM frequency response and dynamic range are emulated in the ParrottTM 60V amplifier with less novel but equally important circuitry. Mic jack wiring conflicts at the PC are largely avoided by obtaining power from the ubiquitous AC power adaptor. For the same reason, batteries are not included or required.

Finally for users with limited use of their hands, a foot switch can be added simply to the 60V. Two types are available so circumstances will determine whether a momentary or maintained (step on/step off) foot switch is needed.

Summary

In addition to good headset acoustics and ergonomics, there is a simple electronic solution to provide robust microphone performance in voice enabled applications: the TranslatorTM. Higher speech recognition 'hits' are achieved with voice frequency shaping and increased dynamic range. Problems with sound card compatibility are eliminated. Optional batteries can be added simply when needed. A mute switch with 'keep alive' circuit is vital for maintaining software stability while the mic is idle. These enhancements to the 'bare' microphone are obtainable for dictation only, or for computer telephony needs.