The other day I was playing around in Matlab, and although I can’t remember what I set out to do I did end up making a small lossy audio compression/decompression system! It seemed like a good topic for a blog post.
The discrete cosine transformation
Before I show the code I’ll have to very briefly introduce the discrete cosine transform (DCT). We should be able to ignore the maths and implementation of the DCT and treat it as a magic box which comes with Matlab or octave. If your interested in the details (and they are interesting) this
book is a great place to start if you want more depth than wikipedia offers.
An audio sample is a sequence real numbers
X={x1,…xN}. The
DCT of this audio sample is the sequence,
DCT(X)=Y={y1,…,yN} such that
xn=∑k=1nykw(k)cos(π(2n−1)(k−1)2N)
where
w(k)=⎧⎩⎨1N√,2N−−√, k=1 otherwise.
Don’t worry too much about that expression. We just need note that the DCT represents the original signal as a sum of cosines, and that the coefficients specify the amplitude of these cosines.
If we have the DCT coefficients we can transform them back to the original sequence with the inverse discrete cosine transform (IDCT). This could be calculated with the above expression but more efficient algorithms exist for both the DCT and IDCT (these algorithms are based on the fast Fourier transform, which is again an interesting topic that I won’t get into).
The compression scheme
So what does this have to do with audio compression? The coefficients of the DCT are amplitudes of cosines that are “within” the original single. Small coefficients will result in cosines with small amplitudes, which we are less likely to hear. So instead of storing the original sample we could take the DCT of the sample, discard small coefficients, and keep that. We would store fewer numbers and so compress the audio data.
The decompression algorithm would be simple, we would simply take the IDCT of whatever we stored play that back. We will be missing some of the signal, but one of the properties of DCT’s is that a few of the larger coefficients account for a large amount of the power in the original signal. Also the coefficients we discard will usually be from quiet high frequency parts of the sound, which we hear less. These are some of the reasons why DCT is often used in compression.
Filling in the details
There are a few details we are missing. When compressing with DCTs you typically compress small slices (windows) of the audio at once. This is partly so that seeking through the compressed stream is easier but mostly because we want the coefficients in our window to represent frequencies we hear (with large window the majority of the coefficients would represent frequencies well out of the human hearing range).
In addition we need to consider the binary format of the data. We could store the results of the DCT as floating point values, but that would be 32 bits per coefficient – which seems a little high given that .wav format files are stored as 16 bit integers. So let’s instead linearly map the range of the DCT coefficients to 16 bit integers and store those instead.
We’ll have to store not just the coefficients, but their index too, let’s store them as 16 bit integers as well. It may seem inefficient to do this since a few bits of our integers will never be used. This is somewhat offset by Matlab/Octave saving files with gzip compression. This will be able to compress those runs of zeros caused by using an overly large integer fairly well. This is a bit of a kludge, we are keeping things simple, so using non Matlab data types would be out of the question. After some testing I realised that I could map the actual coefficients to n bit range (with n<16), store them in 16 bit integers, and still get a saving in space which was nearly as good as using real n bit integers! I think that pretty much cover it! Here's the code for compression:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| % Simple DCT compression.
% Works in matlab with signal processing toolbox or octave.
% X : (audio) samples, vector with each element in [-1,1]
% window : window size, length(X) must be divisible by this.
% num_components : number of DCT components to store per window.
% coeff_bits: number of bits to use to store each coefficient.
function result = compress_dct(X, window, num_components, coeff_bits)
num_win = length(X)/window;
X = reshape(X, window, num_win); % reshape so each window is a row
Y = dct(X); % applies dct to each row
% find top components and their indices
[a, I] = sort(abs(Y), 'descend');
I = I(1:num_components, :);
% build struct
result.coeffs = int16(zeros(num_components, num_win));
result.ind = int16(I);
result.window = window;
result.coeff_bits = coeff_bits;
for i = 1:num_win
% store each coefficient (in [-1,1]) as an integer mapped to range
% (-2^(coeff_bits-1), 2^(coeff_bits_1))
result.coeffs(:,i) = int16(Y(I(:,i), i)*2^(coeff_bits-1));
end
end
|
Here’s the decompression function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| function X = decompress_dct(data)
num_win = size(data.coeffs, 2);
coeffs = double(data.coeffs)/(2^(data.coeff_bits-1)); % Rescale coeffs to [-1,1]
% Construct full DCT windows from sparse.
Y = zeros(data.window, num_win);
for i = 1:num_win
Y(data.ind(:,i),i) = coeffs(:,i);
end
% Inverse DCT each window.
X = idct(Y);
% Stitch windows into one long vector.
X = reshape(X, num_win*data.window, 1);
end
|
And here’s an example of using compression and decompression functions from Octave.
1
2
3
4
5
6
7
8
9
10
11
12
13
| window_size = 2048;
n_coeffs_keep = 100;
coeff_n_bits = 10;
% Load wav file, must be mono, number of samples divisible by window size.
X = wavread('bach_clip.wav');
% Compress wav
comp = compress_dct(X, window_size, n_coeffs_keep, coeff_n_bits);
% Save comp structure in a binary format with extra gzip compression
% so we can see how big it really is.
save -binary -z bach.mat comp
% Decompress and write back to wav for comparison.
Xdecomp = decompress_dct(comp);
wavwrite(Xdecomp, 44100, 16, 'bach_decomp.wav');
|
What does it sound like?
Here are some examples of a compressed piece of audio at various settings.
I realise that soundcloud will stream in mp3 which could obscure the results, but the compression artefacts are large enough to hear though mp3 and you can download the .wavs though soundcloud if you want. EDIT: The soundcloud player seems to be very noisy for some reason – I suggest you click download and listen to the .wavs.
The most annoying artifact, that occurs even in the higher bit rate example, is a slight “clicking” noise. I think this is caused by the windowing – the sample is not forced to be “continuous” over the boundaries of windows so you hear small clicks on windows where it does not line up. Aside from that the highest bitrate version is not totally awful to listen to, although even a fairly poor set of headphones I can hear “garbling”. The cool thing is that even two mid bit-rate streams are fairly intelligible (e.g you could probably understand speech), which is impressive considering the level of compression achieved and the simplicity of the code. The lowest bit rate stream is really bad, but it’s a good example of what very drastic lossy audio compression sounds like.
In conclusion I think it’s rather impressive how far you can get with lossy audio compression by only using the DCT and some generic lossless compression. A core part of MP3 audio compression is the DCT, but MP3 goes well beyond this to achieve much better results.
Related Posts