MIPS port improvements #4 #419

SiarheiVolkau · 2025-08-24T18:15:29Z

This patch serie address support of MIPS32 (r1-r5 mostly) and revise MIPS DSP ASE support.

MIPS32 have 32x32=>64 bit multiplication and multiply-accumulate instructions
so most improvements for DSP might be adapted to MIPS32 too, although it has only
one accumulator register for such operations.

Performance measured in QEMU by counting instructions, this isn't best approach because
QEMU isn't cycle accurate emulator but from the other point of view MIPS architecture
has many releases with different internal architecture so performance might differ noticeable
on different processors with same instructions set, that's what compiler should be aware of
(flag -mtune=).

QEMU perfomance (in instructions executed):

Architecture	Test	Initial, insns	Improved, insns	Difference, % (more is better)
MIPS32r1	test_opus_decode 1	192497750675	173969357000	9.62%
MIPS32r2+DSP	test_opus_decode 1	142939494587	140239758103	1.88%
MIPS32r1	test_opus_encode 1	447394529478	396319235982	11.42%
MIPS32r2+DSP	test_opus_encode 1	348482765364	328594898435	5.71%

In the end MIPS32 and DSP performance has been measured on real MIPS 24KEc CPU here are results:

Architecture	Test	Initial, time/s	Improved, time/s	Difference, % (more is better)
MIPS32r1	decode voip	61.10	54.33	11.08%
MIPS32r1	decode audio	45.86	42.72	6.85%
MIPS32r2+DSP	decode voip	46.99	46.97	0.04%
MIPS32r2+DSP	decode audio	38.73	38.47	0.67%
MIPS32r1	encode voip	622.10	599.39	3.65%
MIPS32r1	encode audio	374.61	364.62	2.67%
MIPS32r2+DSP	encode voip	569.82	574.01	-0.74%
MIPS32r2+DSP	encode audio	348.87	346.71	0.62%

MIPS32 since release 1 has support for multiply-accumulate pattern with 32x32=>64 bit data, although in constrast to DSP extension it has only one accumulator register. Another disadvantage is extract scaled result back from accumulator requires 5 instructions so for 16x16 multiply-accumulate it's faster to use normal multiptication (32x32=>32) and addition instructions. GCC likes to shuffle mult+madd pattern instructions away and then reload accumulator in between, so default C implementation is far from optimal. GCC don't have builtin functions for mult/madd/msub instructions so inline assembly is used here. Regarding MIPS32r6 - it doesn't have accumulator register at all. Instead, it has pair of instructions MUL/MUH which implement 32x32=>64 multiplication on a general registers. C implemetation matches much better with that processor. So no special version for it. MIPS64 in turn has mult+madd instructions but only for compatibility with 32-bit binaries, taking back result from accumulator requires same 5 instructions and they must be written in assembly. Instead, it has 64x64=>64 multiplication on general registers, so C code shall be good enough with typical dmul+daddu instructions for multiply-accumulate. So no special version for it. Signed-off-by: Siarhei Volkau <[email protected]>

Maybe they were in use in 2014? Signed-off-by: Siarhei Volkau <[email protected]>

It's typical implementation requires involving accumulator register to get 48+ bit multiplication. but getting scaled result back from accumulator requires 5 instructions, so typically GCC emits: # MULT16_32_Q16 mult a16, b32 mflo r1 mfhi r2 srl r1, r1, 16 sll r2, r2, 16 or result, r1, r2 but if we scale 16-bit argument before multiplication we can get result in one instruction (mfhi): # MULT16_32_Q16 sll a32, a16, 16 mult a32, b32 mfhi result for MIPS32r6 it's even shorter: sll a32, a16, 16 muh result, a32, b32 MIPS64 avoids using accumulator here and can scale result in general register with single instruction. So no special trick needed. Signed-off-by: Siarhei Volkau <[email protected]>

For non-DSP MIPS it's worth to use default MAC16_16 implementation. So there's no difference with pure C implementation. The real difference goes from manual tuning C code: - unroll loop one more time for dual_inner_prod - replace tail if-s by switch in xcorr_kernel - use 32-bit accumulators for non-DSP variant Why switch is faster? Probably because compiler don't have to track j variable till the end of the cycle and can replace exit condition by something like x < &initial_x[N-3]. These changes increase overall opus_decode test execution speed by about 1% for both DSP and non-DSP versions. Measurements done in QEMU by counting instructions executed. QEMU is not cycle accurate and real effect might be lower due to pipeline stalls. Signed-off-by: Siarhei Volkau <[email protected]>

I observed that overridden exp_rotation1 is fully matches default C version. Overridden renormalise_vector in turn differs only in celt_inner_prod calculation, so instead of tuning existing overrides it's worth to remove them compltely and tune celt_inner_prod instead. This change gives minor performance improvement for both DSP and non-DSP MIPS. Signed-off-by: Siarhei Volkau <[email protected]>

absq_s.ph does two ABS16 with saturation on a vector of two 16-bit values. cmp.lt.ph and pick.ph form conditional move for vector of two 16-bit values. Also 2x loop unroll for better perfomance. Original C version can return 0x8000 (positive) whereas this one is limited by 0x7fff, since result of this function used in ILOG2 context this is important. As a quick fix 1 is added to the result in hope to return 0x8000 if saturation happens. Signed-off-by: Siarhei Volkau <[email protected]>

dpaq_s.w.ph does two 16x16=>Q31 multiplications with adding both results to accumulator register. For getting Q31, result of multiplication is shifted left by 1. Also it does saturation: for two 0x8000 (-32768) inputs result is 0x7fffffff (maximal positive Q31). This instruction is ideal candidate for celt/dual_inner_prod functions although data alignment isn't always good to utilize it. Signed-off-by: Siarhei Volkau <[email protected]>

MIPS32 since release 1 has support for multiply-accumulate pattern with 32x32=>64 bit data, although in constrast to DSP extension it has only one accumulator register and no builtin functions in GCC. Signed-off-by: Siarhei Volkau <[email protected]>

Simple loop with shift by 1 unrolled 2 times. More complex loop unrolled 4 times. Signed-off-by: Siarhei Volkau <[email protected]>

The code faster because: - avoids 64-bit shift on each iteration - matches multiply-accumulate pattern - might be autovectorized (not verified) Signed-off-by: Siarhei Volkau <[email protected]>

I observed that overridden silk_noise_shape_quantizer_del_dec is refactored in C version, and most performance critical parts moved to separate functions silk_noise_shape_quantizer_short_prediction and silk_NSQ_noise_shape_feedback_loop. So let's drop current mips implementation and override only the critical parts mentioned above. Signed-off-by: Siarhei Volkau <[email protected]>

MIPS32 has 32x32=>64bit multiplication, although shifting 64-bit result isn't trivial so its worth to shift right 64-bit value to 32 it means just drop LSB register of the result. Since second argument of silk_SMULWB is 16-bit wide we can shift it left by 16 before multiplication to apply technique above to the result. Signed-off-by: Siarhei Volkau <[email protected]>

CLZ instruction first appeared in MIPS32 silk_CLZ16 and silk_CLZ32 can benefit of that, not DSP only. Signed-off-by: Siarhei Volkau <[email protected]>

MIPS32+ can get benefit from same algorithm as used for DSP extension due its multiply-accumulate nature. Signed-off-by: Siarhei Volkau <[email protected]>

Signed-off-by: Siarhei Volkau <[email protected]>

At the moment of fork mips version from C version in 2014 it contain minimal differences with base C version (signle line of code), then it was disabled completely during furhter development. So it makes little sense to keep that header in repo. Signed-off-by: Siarhei Volkau <[email protected]>

SiarheiVolkau added 16 commits August 23, 2025 17:42

MIPS: fixed_generic_mipsr1.h remove unused functions

820474b

Maybe they were in use in 2014? Signed-off-by: Siarhei Volkau <[email protected]>

MIPS: unroll fft_downshift loops for performance

38f5a53

Simple loop with shift by 1 unrolled 2 times. More complex loop unrolled 4 times. Signed-off-by: Siarhei Volkau <[email protected]>

refactor: _celt_lpc performance improvement

9a185f2

The code faster because: - avoids 64-bit shift on each iteration - matches multiply-accumulate pattern - might be autovectorized (not verified) Signed-off-by: Siarhei Volkau <[email protected]>

MIPS: allow __builtin_clz for MIPS32+

58f24e2

CLZ instruction first appeared in MIPS32 silk_CLZ16 and silk_CLZ32 can benefit of that, not DSP only. Signed-off-by: Siarhei Volkau <[email protected]>

MIPS: silk: optimize silk_warped_autocorrelation_FIX for MIPS32+

7a86791

MIPS32+ can get benefit from same algorithm as used for DSP extension due its multiply-accumulate nature. Signed-off-by: Siarhei Volkau <[email protected]>

MIPS: delete unused header prefilter_FIX_mipsr1.h

7f8dcff

Signed-off-by: Siarhei Volkau <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MIPS port improvements #4 #419

MIPS port improvements #4 #419

Uh oh!

SiarheiVolkau commented Aug 24, 2025

Uh oh!

Uh oh!

MIPS port improvements #4 #419

Are you sure you want to change the base?

MIPS port improvements #4 #419

Uh oh!

Conversation

SiarheiVolkau commented Aug 24, 2025

Uh oh!

Uh oh!