Exploring the Discrete Cosine Transform
The DCT is used extensively in lossy audio and image compression. It can be found in MP3 audio compression, JPEG image compression, and MPEG-1/2 video compression. Let's take a closer look the fundamentals of how a DCT is practically used.
A DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.
For JPEG and MPEG video, here is the formal definition of an 8×8 DCT:
And the corresponding inverse DCT:
For reference, there are also several high performance variations and approximations, that reduce the number of mathematical operations required, at the cost of readability. For example:
- The LLM DCT: Practical fast 1-D DCT algorithms with 11 multiplications
- The AAN DCT: A Fast DCT-SQ Scheme for Images
So, what does it do?
I'm going to start by looking at a simpler one-dimensional example.
We will begin with a signal, a linear ramp, like this:
The input signal has been sampled 8 times, each sample is independent from it's neighboring samples; i.e. it represents the value of the signal at a specific point in space or time. A DCT transforms these discrete data points into a sum of cosine functions, each oscillating at a different frequencies and at different magnitudes.
Specifically, the DCT will transform this of 8 samples into 8 coefficients. Each coefficient will multiply a specific cosine frequency -- altering that magnitude of that function and its corresponding impact on the reconstructed signal.
Let's look at the formal definition of the forward and inverse one-dimensional DCT that we will be using.
This function will take an array of samples and return an array of equal length DCT coefficients. Now, we can use this function to transform our input signal. Something like:
var coefficients = dct1d([8,16,24,32,40,48,56,64]);
The resulting array will be 8 elements long and will look like this:
Reconstructing the signal
Now that we have our set of coefficients, how do we transform it back into the original signal? For that, we use the inverse DCT.
Let use it to reconstruct our signal:
const reconstructedSignal = idct1d(coefficients);
Again, this function returns the same number of samples as our coefficients. And aside from some small floating-point rounding errors, the reconstructed signal is identical to the original signal.
Okay, but why?
Up to now, you may have noticed that each transformation has been of equivalent length; e.g., n samples become n coefficients and vice versa. So, how is this actually useful for compression?
Look again at the coefficients of our compressed signal:
Notice, that the first two coefficients have a relatively large magnitude, while the rest are fairly close to zero. This is because our source signal was a simple ramp: it's value increased by
8 units at each sample.
As such, most of the energy of the signal can be expressed in the lower frequencies, while the higher frequencies have less of an overall impact on the desired signal. The DCT exploits this property; this is referred to as energy compaction.
If our initial signal was comprised of white noise, i.e. static, there would be less -- if any -- energy compaction. However many real-world samples, whether aural or visual, the signals tend to be somewhat ordered, and better suited for this type of energy compaction.
We use can quantization to squash our coefficients, which are currently real numbers, into a smaller range of integers. As a simplistic implementation, we can divide each coefficient by
50—the choice of
50 is a completely arbitrary selection on my part, it can be any number really for our purposes—and truncate the result.
const quantizer = (v) => v/50|0; const quantized = coefficients.map(quantizer);
After quantization, we now only have two coefficients that have values, and a long run of values of zero. This set can be run-length encoded much smaller than the original set of samples.
This is fundamentally how the DCT is used for audio and visual compression.
If you have a keen eye then you may have noticed something interesting during the quantization step in the last section. We truncated our real values into integers, i.e. we threw away some data.
While that made the data more compressible, what effect does that have on our reconstructed signal? Let's find out!
First, we need to de-quantize our coefficients:
const dequantizer = v => v*50; const dequantized = quantized.map(dequantizer);
Notice they are not the same as the coefficients we calculated before, due to the truncation. Now, let's run the inverse DCT and see what signal we get back:
const decompressedSignal = idct1d(dequantized);
At first glance, the reconstructed signal appears similar. However, on closer inspection, you can see they are actually different. That is because we threw away some of the smaller, high-frequency coefficients that were subtly adjusting the reconstructed signal. Without those frequencies, the new signal drifts away from the original.
However, compare them together on the same chart:
Not quite the same, but close—and that's the idea. We can use the DCT, to transform our set into a more compressible set data that can be reconstructed into a similar signal, but not identical. We have, however, lost some of the original data in the process; that is what is meant by "lossy" compression.
By adjusting the quantization value (or using a quantization matrix), one can control the balance between compressibility and signal fidelity of the transformation.
Let's explore deeper into the DCT:
- Visualization: Behold the waveforms of the DCT and how the signal is regenerated.
- Take it to the Next Dimension: Look at some 2D versions of the DCT.
- Indiana Jones: Explore some visual implications/artifacts of compression.