Hacker Guide/Audio Output
The audio output layer
Audio output overview
This chapter documents the audio output layer known under the "audio output 3" codename. It has first been released with VLC version 0.5.0. Previous versions use an antic API, which is no longer documented nor supported. You definitely should write new code only for aout3 and later.
The audio output's main purpose is to take sound samples from one or several decoders (called "input streams" in this chapter), to mix them and write them to an output device (called "output stream"). During this process, transformations may be needed or asked by the user, and they will be performed by audio filters.
(insert here a schematic of the data flow in aout3)
- Sample : A sample is an elementary piece of audio information, containing the value for all channels. For instance, a stream at 44100 Hz features 44100 samples per second, no matter how many channels are coded, nor the coding type of the coefficients.
- Frame : A set of samples of arbitrary size. Codecs usually have a fixed frame size (for instance an A/52 frame contains 1536 samples). Frames do not have much importance in the audio output, since it can manage buffers of arbitrary sizes. However, for undecoded formats, the developer must indicate the number of bytes required to carry a frame of n samples, since it depends on the compression ratio of the stream.
- Coefficient : A sample contains one coefficient per channel. For instance a stereo stream features 2 coefficients per sample. Many audio items (such as the float32 audio mixer) deal directly with the coefficients. Of course, an undecoded sample format doesn't have the notion of "coefficient", since a sample cannot be materialized independantly in the stream.
- Resampling : Changing the number of samples per second of an audio stream.
- Downmixing/upmixing : Changing the configuration of the channels (see below).
Audio sample formats
The whole audio output can viewed as a pipeline transforming one audio format to another in successive steps. Consequently, it is essential to understand what an audio sample format is.
The audio_sample_format_t structure is defined in include/audio_output.h. It contains the following members :
- i_format : Define the format of the coefficients. This is a FOURCC field. For instance 'fl32' (float32), 'fi32' (fixed32), 's16b' (signed 16-bit big endian), 's16l' (signed 16-bit little endian), AOUT_FMT_S16_NE (shortcut to either 's16b' or 's16l'), 'u16b', 'u16l','s8 ', 'u8 ', 'ac3 ', 'spdi' (S/PDIF). Undecoded sample formats include 'a52 ', 'dts ', 'spdi', 'mpga' (MPEG audio layer I and II), 'mpg3' (MPEG audio layer III). An audio filter allowing to go from one format to another is called, by definition, a "converter". Some converters play the role of a decoder (for instance a52tofloat32.c), but are in fact "audio filters".
- i_rate : Define the number of samples per second the audio output will have to deal with. Common values are 22050, 24000, 44100, 48000. i_rate is in Hz.
- i_physical_channels : Define the channels which are physically encoded in the buffer. This field is a bitmask of values defined in audio_output.h, for instance AOUT_CHAN_CENTER, AOUT_CHAN_LEFT, etc. Beware : the numeric value doesn't represent the number of coefficients per sample, see aout_FormatNbChannels() for that. The coefficients for each channel are always stored interleaved, because it is much easier for the mixer to deal with interleaved coefficients. Consequently, decoders which output planar data must implement an interleaving function. Coefficients must be output in the following order (WG-4 specification) : left, right, left surround, right surround, center, LFE.
- i_original_channels : Define the channels from the original stream which have been used to constitute a buffer. For instance, imagine your output plug-ins only has mono output (AOUT_CHAN_CENTER), and your stream is stereo. You can either use both channels of the stream (i_original_channels == AOUT_CHAN_LEFT | AOUT_CHAN_RIGHT), or select one of them. i_original_channels uses the same bitmask as i_physical_channels, and also features special bits AOUT_CHAN_DOLBYSTEREO, which indicates whether the input stream is downmixed to Dolby surround sound, and AOUT_CHAN_DUALMONO, which indicates that the stereo stream is actually constituted of two mono streams, and only one of them should be selected (for instance, two languages on one VCD).
For 16-bit integer format types, we make a distinction between big-endian and little-endian storage types. However, floats are also stored in either big endian or little endian formats, and we didn't make a difference. The reason is, samples are hardly stored in float32 format in a file, and transferred from one machine to another ; so we assume float32 always use the native endianness.
Yet, samples are quite often stored as big-endian signed 16-bit integers, such as in DVD's LPCM format. So the LPCM decoder allocates an 's16b' input stream, and on little-endian machines, an 's16b'->'s16l' converter is automatically invoked by the input pipeline.
In most cases though, AOUT_FMT_S16_NE and AOUT_FMT_U16_NE should be used.
The aout core provides macros to compare two audio sample formats. AOUT_FMTS_IDENTICAL() tests if i_format, i_rate, i_physical_channels and i_original_channels are identical. AOUT_FMTS_SIMILAR tests if i_rate and i_channels are identical (useful to write a pure converter filter).
The audio_sample_format_t structure then contains two additional parameters, which you are not supposed to write directly, except if you're dealing with undecoded formats. For PCM formats they are automatically filled in by aout_FormatPrepare(), which is called by the core functions when necessary.
- i_frame_length : Define the number of samples of the "natural" frame. For instance for A/52 it is 1536, since 1536 samples are compressed in an undecoded buffer. For PCM formats, the frame size is 1, because every sample in the buffer can be independantly accessed.
- i_bytes_per_frame : Define the size (in bytes) of a frame. For A/52 it depends on the bitrate of the input stream (read in the sync info). For instance for stereo float32 samples, i_bytes_per_frame == 8 (i_frame_length == 1).
These last two fields (which are always meaningful as soon as aout_FormatPrepare() has been called) make it easy to calculate the size of an audio buffer : i_nb_samples * i_bytes_per_frame / i_frame_length.
The input spawns a new audio decoder, say for instance an A/52 decoder. The A/52 decoder parses the sync info for format information (eg. it finds 48 kHz, 5.1, 196 kbi/s), and creates a new aout "input stream" with aout_InputNew(). The sample format is :
- i_format = 'a52 '
- i_rate = 48000
- i_physical_channels = i_original_channels = AOUT_CHAN_LEFT | AOUT_CHAN_RIGHT | AOUT_CHAN_CENTER | AOUT_CHAN_REARLEFT | AOUT_CHAN_REARRIGHT | AOUT_CHAN_LFE
- i_frame_length = 1536
- i_bytes_per_frame = 24000
This input format won't be modified, and will be stored in the aout_input_t structure corresponding to this input stream : p_aout->pp_inputs->input. Since it is our first input stream, the aout core will try to configure the output device with this audio sample format (p_aout->output.output), to avoid unnecessary transformations.
The core will probe for an output module in the usual fashion, and its behavior will depend. Either the output device has the S/PDIF capability, and then it will set p_aout->output.output.i_format to 'spdi', or it's a PCM-only device. It will thus ask for the native sample format, such as 'fl32' (for Darwin CoreAudio) or AOUT_FMT_S16_NE (for OSS). The output device may also have constraints on the number of channels or the rate. For instance, the p_aout->output.output structure may look like :
- i_format = AOUT_FMT_S16_NE
- i_rate = 44100
- i_channels = AOUT_CHAN_LEFT | AOUT_CHAN_RIGHT
- i_frame_length = 1
- i_bytes_per_frame = 4
Once we have an output format, we deduce the mixer format. It is strictly forbidden to change the audio sample format between the mixer and the output (because all transformations happen in the input pipeline), except for i_format. The reason is that we have only developed three mixers (float32 and S/PDIF, plus fixed32 for embedded devices which do not feature an FPU), so all other types must be cast into one of those. Still with our example, the p_aout->mixer.mixer structure looks like :
- i_format = 'fl32'
- i_rate = 44100
- i_channels = AOUT_CHAN_LEFT | AOUT_CHAN_RIGHT
- i_frame_length = 1
- i_bytes_per_frame = 8
The aout core will thus allocate an audio filter to convert 'fl32' to AOUT_FMT_S16_NE. This is the only audio filter in the output pipeline. It will also allocate a float32 mixer. Since only one input stream is present, the trivial mixer will be used (only copies samples from the first input stream). Otherwise it would have used a more precise float32 mixer.
The last step of the initialization is to build an input pipeline. When several properties have to be changed, the aout core searches first for an audio filter capable of changing :
- All parameters ;
- i_format and i_physical_channels/i_original_channels ;
- i_format ;
If the whole transformation cannot be done by only one audio filter, it will allocate a second and maybe a third filter to deal with the rest. To follow up on our example, we will allocate two filters : a52tofloat32 (which will deal with the conversion and the downmixing), and a resampler. Quite often, for undecoded formats, the converter will also deal with the downmixing, for efficiency reasons.
When this initialization is over, the "decoder" plug-in can run its main loop. Typically the decoder requests a buffer of length i_nb_samples, and copies the undecoded samples there (using GetChunk()). The buffer then goes along the input pipeline, which will do the decoding (to 'fl32'), and downmixing and resampling. Additional resampling will occur if complex latency issues in the output layer impose us to go temporarily faster or slower to achieve perfect lipsync (this is decided on a per-buffer basis). At the end of the input pipeline, the buffer is placed in a FIFO, and the decoder thread runs the audio mixer.
The audio mixer then calculates whether it has enough samples to build a new output buffer. If it does, it mixes the input streams, and passes the buffer to the output layer. The buffer goes along the output pipeline (which in our case only contains a converter filter), and then it is put in the output FIFO for the device.
Regularly, the output device will fetch the next buffer from the output FIFO, either through a callback of the audio subsystem (Mac OS X' CoreAudio, SDL), or thanks to a dedicated audio output thread (OSS, ALSA...). This mechanism uses aout_OutputNextBuffer(), and gives the estimated playing date of the buffer. If the computed playing date isn't equal to the estimated playing date (with a small tolerance), the output layer changes the date of all buffers in the audio output module, triggering some resampling at the beginning of the input pipeline when the next buffer will come from the decoder. That way, we shall resynchronize audio and video streams. When the buffer is played, it is finally released.
Mutual exclusion mechanism
The access to the internal structures must be carefully protected, because contrary to other objects in the VLC framework (input, video output, decoders...), the audio output doesn't have an associated thread. It means that parts of the audio output run in different threads (decoders, audio output IO thread, interface), and we do not control when the functions are called. Thus, much care must be taken to avoid concurrent access on the same part of the audio output, without creating a bottleneck which would cause latency problems at the output layer.
Consequently, we have set up a locking mechanism in five parts :
- p_aout->mixer_lock : This lock is taken when the audio mixer is entered. The decoder thread in which the mixer runs must hold the mutex during the mixing, until the buffer comes out of the output pipeline. Without holding this mutex, the interface thread cannot change the output pipeline, and a decoder cannot add a new input stream.
- p_input->lock : This lock is taken when a decoder calls aout_BufferPlay(), as long as the buffer is in the input pipeline. The interface thread cannot change the input pipeline without holding this lock.
- p_aout->output_fifo_lock : This lock must be taken to add or remove a packet from the output FIFO, or change its dates.
- p_aout->input_fifos_lock : This lock must be taken to add or remove a packet from one of the input FIFOs, or change its dates.
Having so many mutexes makes it easy to fall into deadlocks (ie. when a thread has the mixer lock and wants the input fifos lock, and the other has the input fifos lock and wants the mixer lock). We could have worked with fewer locks (and even one global_lock), but for instance when the mixer is running, we do not want to block the audio output IO thread from picking up the next buffer. So for efficiency reasons we want to keep that many locks.
So we have set up a strong discipline in taking the locks. If you need several of the locks, you must take them in the order indicated above. For instance if you already the hold input fifos lock, it is strictly forbidden to try and take the mixer lock. You must first release the input fifos lock, then take the mixer lock, and finally take again the input fifos lock.
It might seem a big constraint, but the order has been chosen so that in most cases, it is the most natural order to take the locks.
The aout_buffer_t structure is only allocated by the aout core functions, and goes from the decoder to the output device. A new aout buffer is allocated in these circumstances :
- Whenever the decoder calls aout_BufferNew().
- In the input and output pipeline, when an audio filter requests a new output buffer (ie. when b_in_place == 0, see below).
- In the audio mixer, when a new output buffer is being prepared.
Most audio filters are able to place the output result in the same buffer as the input data, so most buffers can be reused that way, and we avoid massive allocations. However, some filters require the allocation of an output buffer.
The core functions are smart enough to determine if the buffer is ephemer (for instance if it will only be used between two audio filters, and disposed of immediately therafter), or if it will need to be shared among several threads (as soon as it will need to stay in an input or output FIFO).
In the first case, the aout_buffer_t structure and its associated buffer will be allocated in the thread's stack (via the alloca() system call), whereas in the latter in the process's heap (via malloc()). You, codec or filter developer, don't have to deal with the allocation or deallocation of the buffers.
The fields you'll probably need to use are : p_buffer (pointer to the raw data), i_nb_bytes (size of the significative portion of the data), i_nb_samples, start_date and end_date.
On the first impression, you might be tempted to think that to calculate the starting date of a buffer, it might be enough to regularly fetch the PTS i_pts from the input, and then : i_pts += i_nb_past_samples * 1000000 / i_rate. Well, I'm sorry to deceive you, but you'll end up with rounding problems, resulting in a crack every few seconds.
Indeed, if you have 1536 samples per buffer (as is often the case for A/52) at 44.1 kHz, it gives : 1536 * 1000000 / 44100 = 34829.9319727891. The decimal part of this figure will drive you mad (note that with 48 kHz samples it is an integral digit, so it will work well in many cases).
One solution could have been to work in nanoseconds instead of milliseconds, but you'd only be making the problem 1000 times less frequent. The only exact solution is to add 34829 for every buffer, and keep the remainder of the division somewhere. For every buffer you add the remainders, and when it's greater than 44100, you add 34830 instead of 34829. That way you don't have the rounding error which would occur in the long run (this is called the Bresenham algorithm).
The good news is, the audio output core provides a structure (audio_date_t) and functions to deal with it :
- aout_DateInit( audio_date_t * p_date, u32 i_divider ) : Initialize the Bresenham algorithm with the divider i_divider. Usually, i_divider will be the rate of the stream.
- aout_DateSet( audio_date_t * p_date, mtime_t new_date ) : Initialize the date, and set the remainder to 0. You will usually need this whenever you get a new PTS from the input.
- aout_DateMove( audio_date_t * p_date, mtime_t difference ) : Add or subtract microseconds from the stored date (used by the aout core when the output layer reports a lipsync problem).
- aout_DateGet( audio_date_t * p_date ) : Return the current stored date.
- aout_DateIncrement( audio_date_t * p_date, u32 i_nb_samples ) : Add i_nb_samples * 1000000 to the stored date, taking into account rounding errors, and return the result.
FIFOs are used at two places in the audio output : at the end of the input pipeline, before entering the audio mixer, to store the buffers which haven't been mixed yet ; and at the end of the output pipeline, to queue the buffers for the output device.
FIFOs store a chained list of buffers. They also keep the ending date of the last buffer, and whenever you pass a new buffer, they will enforce the time continuity of the stream by changing its start_date and end_date to match the FIFO's end_date (in case of stream discontinuity, the aout core will have to reset the date). The aout core provides functions to access the FIFO. Please understand than none of these functions use mutexes to protect exclusive access, so you must deal with race conditions yourself if you want to use them directly !
- aout_FifoInit( aout_instance_t * p_aout, aout_fifo_t * p_fifo, u32 i_rate ) : Initialize the FIFO pointers, and the aout_date_t with the appropriate rate of the stream (see above for an explanation of aout dates).
- aout_FifoPush( aout_instance_t * p_aout, aout_fifo_t * p_fifo, aout_buffer_t * p_buffer ) : Add p_buffer at the end of the chained list, update its start_date and end_date according to the FIFO's end_date, and update the internal end_date.
- aout_FifoSet( aout_instance_t * p_aout, aout_fifo_t * p_fifo, mtime_t date ) : Trash all buffers, and set a new end_date. Used when a stream discontinuity has been detected.
- aout_FifoMoveDates( aout_instance_t * p_aout, aout_fifo_t * p_fifo, mtime_t difference ) : Add or subtract microseconds from end_date and from start_date and end_date of all buffers in the FIFO. The aout core will use this function to force resampling, after lipsync issues.
- aout_FifoNextStart( aout_instance_t * p_aout, aout_fifo_t * p_fifo ) : Return the start_date which will be given to the next buffer passed to aout_FifoPush().
- aout_FifoPop( aout_instance_t * p_aout, aout_fifo_t * p_fifo ) : Return the first buffer of the FIFO, and remove it from the chained list.
- aout_FifoDestroy( aout_instance_t * p_aout, aout_fifo_t * p_fifo ) : Free all buffers in the FIFO.
API for the decoders
The API between the audio output and the decoders is quite simple. As soon as the decoder has the required information to fill in an audio_sample_format_t, it can call : p_dec->p_aout_input = aout_InputNew( p_dec->p_fifo, &p_dec->p_aout, &p_dec->output_format ).
In the next operations, the decoder will need both p_aout and p_aout_input. To retrieve a buffer, it calls : p_buffer = aout_BufferNew( p_dec->p_aout, p_dec->p_aout_input, i_nb_frames ).
The decoder must at least fill in start_date (using an audio_date_t is recommended), and then it can play the buffer : aout_BufferPlay( p_dec->p_aout, p_dec->p_aout_input, p_buffer ). In case of error, the buffer can be deleted (without being played) with aout_BufferDelete( p_dec->p_aout, p_dec->p_aout_input, p_buffer ).
When the decoder dies, or the sample format changes, the input stream must be destroyed with : aout_InputDelete( p_dec->p_aout, p_dec->p_aout_input ).
API for the output module
An output module must implement a constructor, an optional destructor, and a p_aout->output.pf_play function. The constructor is the function which will be called when the module is loaded, and returns 0 if, and only if the output device could be open. The function may perform specific allocation in p_aout->output.p_sys, provided the structure is deallocated in the destructor.
In most cases, the p_aout->output.pf_play function does nothing (the only exception is when the samples can be processed immediately, without caring about dates, as in the file output). The job is then done by the IO callback which you are supposed to provide.
On modern sound architectures (such as Mac OS X CoreAudio or SDL), when the audio buffer starves, the operating system automatically calls a function from your application. On outdated sound architectures (such as OSS), you have to emulate this behavior. Then your constructor must spawn a new audio IO thread, which periodically calls the IO callback to transfer the data.
When it is called, the first job of the IO callback will be determine the date at which the next samples will be played. Again, on modern platforms this information is given by the operating system, whereas on others you have to deduce it from the state of the internal buffer. Then you call aout_OutputNextBuffer( p_aout, next_date, b_can_sleek ), which will return a pointer to the next buffer to write, or NULL if none was available. In the latter case, it is advised to write zeros to the DSP.
The value of the last parameter (b_can_sleek) changes the behavior of the function. When it is set to 0, aout_OutputNextBuffer() will run an internal machinery to compensate for possible drift. For instance if the PTS of the next buffer is 40 ms earlier than the date you ask, it means we are very late. So it will ask the input stage to downsample the incoming buffers, so that we can come back in sync. No specific behavior is thus expected from your module.
On the contrary, when b_can_sleek is set to 1, you tell the output layer not to take any actions to compensate a drift. You will typically use this when you've just played silence, and you can deal with buffers which are too early by inserting zeros (zeros in this case will not break the audio continuity, since you were playing nothing before). Another case of use is with S/PDIF output. S/PDIF packets cannot be resampled for obvious reasons, so you must use b_can_sleek = 1.
Once you have a buffer, you just have to transfer it to the DSP, for instance : memcpy( dsp_buffer, p_buffer->p_buffer, p_buffer->i_nb_bytes ).