Minimalist software decoder with state-of-the-art performance for the H.264/AVC video format.
Please note this is a work in progress and will be ready for use after making GStreamer/VLC plugins.
- Supports Progressive High and MVC 3D profiles, up to level 6.2
- Any resolution up to 8K UHD
- 8-bit 4:2:0 planar YUV output
- Slices and Arbitrary Slice Order
- Slice and frame multi-threading
- Per-slice reference picture list
- Memory Management Control Operations
- Long-term reference frames
- Windows: x86, x64
- Linux: x86, x64, ARM64
- Mac OS: x64
edge264 is entirely developed in C using 128-bit vector extensions and vector intrinsics, and can be compiled with GNU GCC or LLVM Clang. SDL2 runtime library may be used (optional) to enable display with edge264_test.
Here are the make options for tuning the compiled library file:
- CC- C compiler used to convert source files to object files (default- cc)
- CFLAGS- additional compilation flags passed to- CCand- TARGETCC
- TARGETCC- C compiler used to link object files into library file (default- CC)
- LDFLAGS- additional compilation flags passed to- TARGETCC
- TARGETOS- resulting file naming convention among- Windows|- Linux|- Darwin(default host)
- VARIANTS- comma-separated list of additional variants included in the library and selected at runtime (default- logs)- x86-64-v2- variant compiled for x86-64 microarchitecture level 2 (SSSE3, SSE4.1 and POPCOUNT)
- x86-64-v3- variant compiled for x86-64 microarchitecture level 3 (AVX2, BMI, LZCNT, MOVBE)
- logs- variant compiled with logging support in YAML format (headers and slices)
 
- BUILD_TEST- toggles compilation of- edge264_test(default- yes)
- FORCEINTRIN- enforce the use of intrinsics among- x86|- ARM64(for WebAssembly)
$ make CFLAGS="-march=x86-64" VARIANTS=x86-64-v2,x86-64-v3 BUILD_TEST=no # example x86 buildThe automated test program edge264_test can browse files in a given directory, decoding each <video>.264 file and comparing its output with each sibling file <video>.yuv if found. On the set of AVCv1, FRExt and MVC conformance bitstreams, 109/224 files are decoded without errors, the rest using yet unsupported features.
$ make
$ ./edge264_test --help # prints all options available
$ ffmpeg -i vid.mp4 -vcodec copy -bsf h264_mp4toannexb -an vid.264 # optional, converts from MP4 format
$ ./edge264_test -d vid.264 # replace -d with -b to benchmark instead of displayHere is a complete example that opens an input file in Annex B format from command line, and dumps its decoded frames in planar YUV order to standard output. See edge264_test.c for a more complete example which can also display frames.
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include "edge264.h"
int main(int argc, char *argv[]) {
	int fd = open(argv[1], O_RDONLY);
	struct stat st;
	fstat(fd, &st);
	uint8_t *buf = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
	const uint8_t *nal = buf + 3 + (buf[2] == 0); // skip the [0]001 delimiter
	const uint8_t *end = buf + st.st_size;
	// auto threads, no logs, auto allocs
	Edge264Decoder *dec = edge264_alloc(-1, NULL, NULL, 0, NULL, NULL, NULL);
	Edge264Frame frm;
	int res;
	do {
		res = edge264_decode_NAL(dec, nal, end, 0, NULL, NULL, &nal);
		while (!edge264_get_frame(dec, &frm, 0)) {
			for (int y = 0; y < frm.height_Y; y++)
				write(1, frm.samples[0] + y * frm.stride_Y, frm.width_Y);
			for (int y = 0; y < frm.height_C; y++)
				write(1, frm.samples[1] + y * frm.stride_C, frm.width_C);
			for (int y = 0; y < frm.height_C; y++)
				write(1, frm.samples[2] + y * frm.stride_C, frm.width_C);
		}
	} while (res == 0 || res == ENOBUFS);
	edge264_free(&dec);
	munmap(buf, st.st_size);
	close(fd);
	return 0;
}const uint8_t * edge264_find_start_code(buf, end, four_byte)
Return a pointer to the next three or four byte (0)001 start code prefix, or end if not found.
- const uint8_t * buf- first byte of buffer to search into
- const uint8_t * end- first invalid byte past the buffer that stops the search
- int four_byte- if 0 seek a 001 prefix, otherwise seek a 0001
Edge264Decoder * edge264_alloc(n_threads, log_cb, log_arg, log_mbs, alloc_cb, free_cb, alloc_arg)
Allocate and initialize a decoding context.
- int n_threads- number of background worker threads, with 0 to disable multithreading and -1 to detect the number of logical cores at runtime
- void (* log_cb)(const char * str, void * log_arg)- if not NULL, a- fputs-compatible function pointer that- edge264_decode_NALwill call to log every header, SEI or macroblock (requires the- logsvariant otherwise fails at runtime, called from the same thread except macroblocks in multithreaded decoding)
- void * log_arg- custom value passed to- log_cb
- int log_mbs- set to 1 to enable logging of macroblocks
- void (* alloc_cb)(void ** samples, unsigned samples_size, void ** mbs, unsigned mbs_size, int errno_on_fail, void * alloc_arg)- if not NULL, a function pointer that- edge264_decode_NALwill call (on the same thread) instead of malloc to request allocation of samples and macroblock buffers for a frame (- errno_on_failis ENOMEM for mandatory allocations, or ENOBUFS for allocations that may be skipped to save memory but reduce playback smoothness)
- void (* free_cb)(void * samples, void * mbs, void * alloc_arg)- if not NULL, a function pointer that- edge264_decode_NALand- edge264_freewill call (on the same thread) to free buffers allocated through- alloc_cb
- void * alloc_arg- custom value passed to- alloc_cband- free_cb
int edge264_decode_NAL(dec, buf, end, non_blocking, free_cb, free_arg, next_NAL)
Decode a single NAL unit containing any parameter set or slice.
- Edge264Decoder * dec- initialized decoding context
- const uint8_t * buf- first byte of NAL unit (containing- nal_unit_type)
- const uint8_t * end- first byte past the buffer (max buffer size is 231-1 on 32-bit and 263-1 on 64-bit)
- int non_blocking- set to 1 if the current thread has other processing thus cannot block here
- void (* free_cb)(void * free_arg, int ret)- callback that may be called from another thread when multithreaded, to signal the end of parsing and release the NAL buffer
- void * free_arg- custom value that will be passed to- free_cb
- const uint8_t ** next_NAL- if not NULL and the return code is- 0|- ENOTSUP|- EBADMSG, will receive a pointer to the next NAL unit after the next start code in an Annex B stream
Return codes are:
- 0on success
- ENOTSUPon unsupported stream (decoding may proceed but could return zero frames)
- EBADMSGon invalid stream (decoding may proceed but could show visual artefacts, if you can check with another decoder that the stream is actually flawless, please consider filling a bug report 🙏)
- EINVALif the function was called with- dec == NULLor- dec->buf == NULL
- ENODATAif the function was called while- dec->buf >= dec->end
- ENOMEMif- mallocfailed to allocate memory
- ENOBUFSif more frames should be consumed with- edge264_get_frameto release a picture slot
- EWOULDBLOCKif the non-blocking function would have to wait before a picture slot is available
int edge264_get_frame(dec, out, borrow)
Fetch the next frame ready for output.
- Edge264Decoder * dec- initialized decoding context
- Edge264Frame *out- a structure that will be filled with data for the frame returned
- int borrow- if 0 the frame may be accessed until the next call to- edge264_decode_NAL, otherwise the frame should be explicitly returned with- edge264_return_frame. Note that access is not exclusive, it may be used concurrently as reference for other frames.
Return codes are:
- 0on success (one frame is returned)
- EINVALif the function was called with- dec == NULLor- out == NULL
- ENOMSGif there is no frame to output at the moment
typedef struct Edge264Frame {
	const uint8_t *samples[3]; // Y/Cb/Cr planes
	const uint8_t *samples_mvc[3]; // second view
	const uint8_t *mb_errors; // probabilities (0..100) for each macroblock to be erroneous, NULL if there are no errors, values are spaced by stride_mb in memory
	int8_t pixel_depth_Y; // 0 for 8-bit, 1 for 16-bit
	int8_t pixel_depth_C;
	int16_t width_Y;
	int16_t width_C;
	int16_t height_Y;
	int16_t height_C;
	int16_t stride_Y;
	int16_t stride_C;
	int16_t stride_mb;
	uint32_t FrameId;
	uint32_t FrameId_mvc; // second view
	int16_t frame_crop_offsets[4]; // {top,right,bottom,left}, useful to derive the original frame with 16x16 macroblocks
	void *return_arg;
} Edge264Frame;void edge264_return_frame(dec, return_arg)
Give back ownership of the frame if it was borrowed from a previous call to edge264_get_frame.
- Edge264Decoder * dec- initialized decoding context
- void * return_arg- the value stored inside the frame to return
void edge264_flush(dec)
For use when seeking, stop all background processing, flush all delayed frames while keeping them allocated, and clear the internal decoder state.
- Edge264Decoder * dec- initialized decoding context
void edge264_free(pdec)
Deallocate the entire decoding context, and unset the pointer.
- Edge264Decoder ** pdec- pointer to a decoding context, initialized or not
- Stress testing (in progress)
- Multithreading (in progress)
- Error recovery (in progress)
- Integration in VLC/ffmpeg/GStreamer
- ARM32
- PAFF and MBAFF
- 4:0:0, 4:2:2 and 4:4:4
- 9-14 bit depths with possibility of different luma/chroma depths
- Transform-bypass for macroblocks with QP==0
- SEI messages
- AVX-2 optimizations
I use edge264 to experiment on new programming techniques to improve performance and code size over existing decoders, and presented a few of these techniques at FOSDEM'24 and FOSDEM'25.
- Single header file - It contains all struct definitions, common constants and enums, SIMD aliases, inline functions and macros, and exported functions for each source file. To understand the code base you should look at this file first.
- Code blocks instead of functions - The main decoding loop is a forward pipeline designed as a DAG loosely resembling hardware decoders, with nodes being non-inlined functions and edges being tail calls. It helps mutualize code branches wherever possible, thus reduces code size to help fit in L1 cache.
- Tree branching - Directional intra modes are implemented with a jump table to the leaves of a tree then unconditional jumps down to the trunk. It allows sharing the bottom code among directional modes, to reduce code size.
- Global context register - The pointer to the main structure holding context data is assigned to a register when supported by the compiler (GCC).This technique was dropped as Clang eventually reached on-par performance, so there is little incentive to maintain this hack.
- Default neighboring values (search unavail_mb) - Tests for availability of neighbors are replaced with fake neighboring macroblocks around each frame. It reduces the number of conditional tests inside the main decoding loop, thus reduces code size and branch predictor pressure.
- Relative neighboring offsets (look for A4x4_int8and related variables) - Access to left/top macroblock values is done with direct offsets in memory instead of copying their values to a buffer beforehand. It helps to reduce the reads and writes in the main decoding loop.
- Parsing uneven block shapes (look at function parse_P_sub_mb) - Each Inter macroblock paving specified with mb_type and sub_mb_type is first converted to a bitmask, then iterated on set bits to fetch the correct number of reference indices and motion vectors. This helps to reduce code size and number of conditional blocks.
- Using vector extensions - GCC's vector extensions are used along vector intrinsics to write more compact code. All intrinsics from Intel are aliased with shorter names, which also provides an enumeration of all SIMD instructions used in the decoder.
- Register-saturating SIMD - Some critical SIMD algorithms use more simultaneous vectors than available registers, effectively saturating the register bank and generating stack spills on purpose. In some cases this is more efficient than splitting the algorithm into smaller bits, and has the additional benefit of scaling well with later CPUs.
- Piston cached bitstream reader - The bitstream bits are read in a size_t[2] intermediate cache with a trailing set bit to keep track of the number of cached bits, giving access to 32/64 bits per read from the cache, and allowing wide refills from memory.
- On-the-fly SIMD unescaping - The input bitstream is unescaped on the fly using vector code, avoiding a full preprocessing pass to remove escape sequences, and thus reducing memory reads/writes.
- Multiarch SIMD programming - Using vector extensions along with aliased intrinsics allows supporting both Intel SSE and ARM NEON with around 80% common code and few #if #else blocks, while keeping state-of-the-art performance for both architectures.
- The Structure of Arrays pattern - The frame buffer is stored with arrays for each distinct field rather than an array of structures, to express operations on frames with bitwise and vector operators (see AoS and SoA). The task buffer for multithreading also relies on it partially.
- Deferred error checking - Error detection is performed once in each type of NAL unit (search for returnstatements), by clamping all input values to their expected ranges, then expectingrbsp_trailing_bitafterwards (with very high probability of catching an error if the stream is corrupted). This design choice is discussed in A case about parsing errors.
Other yet-to-be-presented bits:
- Minimalistic API with FFI-friendly design (7 functions and 1 structure).
- The bitstream caches for CAVLC and CABAC (search for rbsp_reg) are stored in two size_t variables each, which may be mapped to Global Register Variables in the future.
- The decoding of input symbols is interspersed with their parsing (instead of parsing to a structthen decoding the data). It deduplicates branches and loops that are present in both parsing and decoding, and even eliminates the need to store some symbols (e.g. mb_type, sub_mb_type, mb_qp_delta).
With the help of a custom bitstream writer using the same YAML format edge264 outputs, a set of extensive tests are being created in tools/raw_tests to stress the darkest corners of this decoder. Use the following command to compile and run the tests (Python is required here):
make test
| General tests | Expected | Test files | 
|---|---|---|
| All supported types of NAL units with/without logging | All OK | supp-nals | 
| All unsupported types of NAL units with/without logging | All unsupp | unsupp-nals | 
| Maximal header log-wise | All OK | max-logs | 
| All conditions (incl. ignored) for detecting the start of a new frame | All OK | finish-frame | 
| nal_ref_idc=0 on NAL types 5, 6, 7, 8, 9, 10, 11, 12 and 15 | All OK | nal-ref-idc-0 | 
| Missing rbsp_trailing_bit for all supported NAL types | All OK | no-trailing-bit | 
| Surrounding the CPB/frame buffers with protected memory | All OK | page-boundaries | 
| SEI/slice referencing an uninitialized SPS/PPS | 1 OK, 4 errors | missing-ps | 
| Two non-ref frames with decreasing POC | All OK, any order | non-ref-dec-poc | 
| Horizontal/vertical cropping leaving zero space | All OK, 1x1 frames | zero-cropping | 
| P/B slice with nal_unit_type=5 or max_num_ref_frames=0 | 4 OK, 2 errors | no-refs-P-B-slice | 
| IDR slice with frame_num>0 | All OK, clamped to 0 | pos-frame-num-idr | 
| A ref that must bump out higher POCs to enter DPB (C.4.5.2) | All OK, check output order | poc-out-of-order | 
| Two ref frames with the same frame_num but differing POC, then a third frame referencing both | ||
| Gap in frame_num while gaps_in_frame_num_value_allowed_flag=0 | ||
| Stream starting with non-IDR I frame | ||
| Stream starting with P/B frame | ||
| Ref slice with delta_pic_order_cnt_bottom=-2**31, then a second frame referencing it | ||
| Two frames A/B with intersecting top/bottom POC intervals in all possible intersections | ||
| A 32-bit POC overflow between 2 frames | ||
| A B-frame referencing frames with more than 2**16 POC diff | ||
| num_ref_idx_active>15 in SPS then no override in slice for L0 and L1 | ||
| A slice with more ref_pic_list_modifications than num_ref_idx_active/16 for L0 and L1 | ||
| A slice with ref_pic_list_modifications duplicating a ref then referencing the second one | ||
| A slice with insufficient ref frames with and without override of num_ref_idx_active for L0 and L1 | ||
| A modification of RefPicList[0/1] to a non-existing short/long term frame, then referencing it in mb | ||
| 33 IDR with long_term_reference_flag=0/1 while max_num_ref_frames=0 (8.2.5.1) | ||
| A new reference while max_num_ref_frames are already all long-term | ||
| All combinations of mmco on all non-existing/short/long refs, with at least twice each mmco | ||
| Two fields of the same frame being assigned different long-term frame indices then referenced | ||
| While all max_num_ref_frames are long-term, a ref_pic_list_modification that references all of them | ||
| An IDR picture with POC>0 | ||
| A picture with mmco=5 decoded after a picture with greater POC (8.2.1) | ||
| A P/B frame with zero references before or received with a gap in frame_num equal to max_ref_frames | ||
| A P/B frame referencing a non-existing/erroneous ref | ||
| A B frame with colPic set to a non-existing frame | ||
| A current frame mmco'ed to long-term while all max_num_ref_frames are already long-term | ||
| A mmco marking a non-existing picture to long-term | ||
| All combinations of IntraNxNPredMode with A/B/C/D unavailability with asserts for out-of-bounds reads | ||
| A direct Inter reference from colPic that is not present in RefPicList0 | ||
| A residual block with all coeffs at maximum 32-bit values | ||
| Two slices of the same frame separated by a currPic reset (ex. AUD) | ||
| Two frames with the same POC yet differing TopFieldOrderCnt/BottomFieldOrderCnt | ||
| Differing mmcos on two slices of the same frame | ||
| Sending 2 IDR, then reaching the lowest possible POC, then getting all frames | ||
| Two slices with mmco=5 yet frame_num>0 (to make it look like a new frame) | ||
| POCs spaced by more than half max bits, such that relying on a stale prevPicOrderCnt yields wrong POC | ||
| Filling the DPB with 16 refs then setting max_num_ref_frames=1 and adding a new ref frame | ||
| Adding a frame cropping after decoding a frame | Crop should not apply retroactively | |
| Making a Direct ref_pic be used after it has been unreferenced | ||
| poc_type=2 and non-ref frame followed by non-ref pic, and the opposite (7.4.2.1.1) | ||
| direct_8x8_inference_flag=1 with frame_mbs_only_flag=0 | ||
| checking that a gap in frame_num with poc_type==0 does not insert refs in B slices | ||
| A SPS changing frame format while currPic>=0 | ||
| A frame allocator putting pic/mb allocs at start/end of a page boundary | 
| Parameter sets tests | Expected | Test files | 
|---|---|---|
| Invalid profile_idc=0/255 | ||
| Highest level_idc=255 | ||
| All unsupported values of chroma_format_idc | ||
| All unsupported values of bit_depth_luma/chroma | ||
| qpprime_y_zero_transform_bypass_flag=1 | ||
| All scaling lists default/fallback rules and repeated values for all indices, with residual macroblock | ||
| log2_max_frame_num=4 and a frame referencing another with the same frame_num%4 | 
| CAVLC tests | Expected | Test files | 
|---|---|---|
| All valid total_zeros=0-8-prefix+3-bit-suffix for TotalCoeffs in [0;15] for 4x4 and 2x2 | ||
| Invalid total_zeros=31/63/127-prefix for TotalCoeffs in [0;15] for 4x4 and 2x2 | ||
| All valid coeff_token=0-14-prefix+4-bit-suffix for nC=0/2/4, and valid 6-bit-values for nC=8 | ||
| Invalid coeff_token=31/63/127-prefix for nC=0/2/4, and invalid 6-bit-values for nC=8 | ||
| All valid levelCode=25-prefix+suffixLength-bit-suffix for all values of suffixLength | ||
| All valid run_before for all values of zerosLeft<=7 | ||
| Invalid run_before=31/63/127 for zerosLeft=7 | ||
| Macroblock of maximal size for all values of mb_type | ||
| mb_qp_delta=-26/25 that overflows on both sides | ||
| All valid inferences of nC for all values of nA/nB=unavail/other-slice/0-16 | ||
| All coded_block_pattern=[0;47] for I and P/B slices | ||
| All combinations of intra_chroma_pred_mode and Intra4x4/8x8/16x16PredMode with A/B-unavailability | ||
| All values of mb_type+sub_mb_types for I/P/B with ref_idx/mvds different than values from B_Direct | ||
| mvd=[-32768/0/32767,-32768/0/32767] in a single 16x16 macroblock | ||
| TotalCoeff=16 for a Intra16x16 AC block | ||
| A residual block with run_length=14 making zerosLeft negative | 
| CABAC tests | Expected | Test files | 
|---|---|---|
| Mixing CAVLC and CABAC in a same frame | ||
| Single slice with at least 8 cabac_zero_word | 
| MVC tests | Expected | Test files | 
|---|---|---|
| All wrong combinations of non_idr_flag with nal_unit_type=1/5 and nal_ref_idc=0/1 | ||
| nal_unit_type=14 then filler unit then nal_unit_type=1/5 | ||
| An nal_unit_type=5 view paired with a non_idr_flag=0 P view, or a non_idr_flag=1 view | ||
| Missing a base or non-base view | ||
| Receiving a SSPS yet only base views then | ||
| 16 ref base views while non base are non-refs | ||
| A SSPS with different pic_width_in_mbs/pic_height_in_mbs/chroma_format_idc than its SPS | ||
| A SSPS with num_views=1 | ||
| A non-base view with weighted_bipred_idc=2 | ||
| A non-base view with its base in RefPicList1[0] and direct_spatial_mv_pred_flag=0 (H.7.4.3) | ||
| A slice with num_ref_idx_l0_active>8 | ||
| svc_extension_flag=1 on a MVC stream | ||
| SSPS with additional_extension2_flag=1 and more trailing data | ||
| Gap in frame_num of 16 frames on both views | ||
| Specifying extra_frames=1 | ||
| Receiving a non-base view before its base | ||
| A stream sending non-base views after a few frames have been output | 
| Error recovery tests | Expected | Test files | 
|---|---|---|
| Tests to implement | ||
| A complete frame received twice | ||
| A slice of a frame received twice | ||
| Frame with correct and erroneous slice | ||
| All combinations erroneous/correct and all interval intersections on 2 slices | ||
| All failures of malloc | ||
| All (dis-)allowed bit positions at the end without rbsp_trailing_bit |