Karaoke audio processing
September 19, 2017

Unraveling the Karaoke Technology

by Sujith P. (Senior Engineer, Media Client Technologies)

Take a peek into karaoke audio processing. Explore how the various components of of a software driven karaoke player improve perceptual quality, enabling truly entertaining experiences.

Fitted with disco balls, colors and lights, and powered by the latest technologies including high definition graphics and Bluetooth, today’s karaoke machines offer a complete package of fun and entertainment. Though next-generation video games threatened to push karaoke off into history’s graveyard in the early 2000s, karaoke managed to survive over the years. Today, we are seeing a cultural resurgence of the interactive entertainment system, as it continues to capture new audiences with many innovations inspired by the latest breed of smartphones and apps.

Singing along
Karaoke, which allows users to croon along to the recorded music of their favorite songs, means ‘empty orchestra’ in Japanese.
Read about the evolution of the technology @ blog

Curious about the inner workings of the karaoke player and how it enables even amateur singers to feel like a star? Read on.

Key building blocks of a karaoke player

The basic components of a karaoke system include the microphones, the processing unit and the speakers. The signal captured through the microphone is processed, mixed with the recorded audio and played through the speakers. The heart of the player is the processing unit comprising various signal processing components that help improve the perceptual quality of the audio. What the system basically does is process the signals through two paths, mix them, and play the output through the speakers (as shown in Figure 1.)

While traditional karaoke machines relied on dedicated hardware processors/ASICs, we can now leverage advanced processing and audio software technology to build an effective software based processing system.

Figure 1. Processing unit: The heart of both hardware & software driven karaoke player

Take a look at some of the key components of the karaoke audio processing unit (common to both hardware and software based systems).

Vocal equalizer

The vocal equalizer consists of a set of filters to improve a voice recording and the quality of vocal mixing. The filters process the voice signal serially to change the character of a voice.

Here are the key filters:

Compressor filter: It supports three configurations – normal, powerful and soft – with each of them offering varying intelligibility, clarity and brightness to a voice.

De-esser filter: This is used to lower the level of sibilant sounds such as s, z, c and sh in a voice.

Sharpness filter: It applies one filter from a set of predefined filters, depending upon the set value, and helps change the sharpness of a voice.

Low cut filter: It applies one filter from a set of pre-defined high pass filters to the input signal. This helps reduce the low frequency noise component that is captured along with the voice, such as the unwanted sound from the floor if the signer is dancing.

High cut filter: It selects and applies one filter from a set of pre-defined low pass filters, to attenuate the high frequency component, which can be used to create the ‘radio effect’.

Power: This control is used to adjust the voice level while mixing voice with music, and also when the singers are in ‘lead and backing’ configurations.

Dynamic Range Compressor

Dynamic range compression (DRC), also called audio level compression, volume compression, compression, or limiting, is a process that manipulates the dynamic range of an audio signal to render it suitable for a given listening environment.

Figure 2. Dynamic Range Compression (DRC)

We can adjust the characteristics and performance of the compressor by adjusting the following parameters:

Threshold: The input level above which the compression process starts.

Ratio: The input-output ratio, where the input level is above the threshold, which determines the amount of compression to be applied.

Attack time: The time taken by the compressor to decrease the input level to the right level determined by the ratio parameter.

Release time: The time it takes to bring the level back to the normal level, once the signal has fallen below the threshold.

Make-up gain: The gain applied to the processed output to compensate for the reduction in loudness due to compression, and achieve the desired loudness level (as shown in Figure 2).

Auto gain: The gain that automatically compensates the gain reduction due to compression.

De-esser: De-essing is a technique to reduce or eliminate the excessive prominence of sibilant consonants in human voice recordings. The de-essing components of the signal are separated and processed in parallel to the main processing path, and finally mixed with the main component. This enables us to use different controls over the de-essing component.

Howling canceller

Figure 3. Cancel abnormal amplifications with Howling Canceller

Howling canceller is used to compensate the abnormal amplifications that cause howling because of the feedback from speakers to the microphone. By working in full frequency range and leveraging state-of-the-art algorithms, it can detect and attenuate unwanted feedbacks or sound of any frequency captured by the microphone from the loudspeakers. The attenuation rate and maximum attenuation to be applied are configurable based on the noise floor and how fast the system has to attenuate the howling.

Pitch shifter

Pitch shifter is used to control the fundamental frequency or pitch of a voice and the music instruments without changing the tempo. It can also be used to create special effects by increasing the range of an instrument, and to change the pitch from -12 to +12 semitones. Pitch shifter could offer time domain, frequency domain and time-frequency domain algorithms. While the time domain algorithm offers lesser complexity, the frequency domain algorithm enables better audio quality.

The latest breed of pitch shifters incorporates a transient detection technique that helps preserve transient while shifting pitch. It also maintains the position of formants, thus preserving the natural tonal quality of the pitch-shifted output.


Reverberation is the collection of reflected sounds from the surfaces in an enclosure like a concert hall. The reverb effect serves to simulate the acoustics of various room types ranging from small rooms to large concert halls, thus adding a sense of depth of space to audio. Karaoke players could include parametric convolution reverb and convolutional reverb based on pre-recorded impulse response. In parametric convolution reverb, the impulse response is generated synthetically based on a room’s parameters, and then convolved with the input signal.

In convolution reverb, the signal is convolved with the impulse response of the desired environment using state-of-the-art algorithms. Each environment can be characterized by an impulse response by assuming it as a linear time invariant system. The quality of the reverb mainly depends on the quality of the impulse response. While the intensity of the reverb effect can be adjusted by controlling the reverb level from 0 to 100% reverb, the processing time and memory requirements depends on the length of the impulse response.


Delay is a classic, time based audio effect, where the sound is modified by adding the delayed output signal to the input at repeated intervals to create a decaying echo. The latest karaoke players are equipped with the capability to support any number of feedback, with up to 500ms of feedback interval. By changing the number of feedback and the delay time interval, we can get a range of echo effects. We can also change the nature and intensity of the echo effect by adjusting the attenuation rate and the gain to be applied to the feedback.

Figure 4. The delay effect


Equalizer is a set of filters that modifies the frequency envelope of audio samples for practical or aesthetic reasons. The user can manipulate the sound using equalizer controls to create the desired blend of sound characteristics like bass, treble and mid-range to suit the user preference and music genre. Since each filter controls a fixed range of audio frequencies, we can control a specific part of the spectrum without affecting the other parts. We can also control the gain for each band, thereby boosting or attenuating the region of interest.

Typical equalizers support three band equalization – bass, mid and treble – which serves as a simple but effective equalizing solution for an amateur. However, if you are a professional, you would require equalizers that offer more than three bands.

Volume control

Volume control is used for adjusting the sound level to a desired level.

In a nutshell

From the 1980’s vintage home cassette and 1990’s laser disc based karaoke machines to today’s next-gen players driven by advanced software algorithms, karaoke has come a long way indeed. It is a fairly technical system with a large number of modules that need to work in sync with each other to deliver the perfect user experience. With an efficient karaoke audio processing software, you can deliver superior value in terms of quality of experience.

Reach out to us @ [email protected] for more insights into the karaoke player Check out Ittiam’s Audio Codecs, Audio Post Processing and Loudness Metering & Leveling solutions