Skip to content

Conversation

SiarheiVolkau
Copy link
Contributor

This patch serie address support of MIPS32 (r1-r5 mostly) and revise MIPS DSP ASE support.

MIPS32 have 32x32=>64 bit multiplication and multiply-accumulate instructions
so most improvements for DSP might be adapted to MIPS32 too, although it has only
one accumulator register for such operations.

Performance measured in QEMU by counting instructions, this isn't best approach because
QEMU isn't cycle accurate emulator but from the other point of view MIPS architecture
has many releases with different internal architecture so performance might differ noticeable
on different processors with same instructions set, that's what compiler should be aware of
(flag -mtune=).

QEMU perfomance (in instructions executed):

Architecture Test Initial, insns Improved, insns Difference, % (more is better)
MIPS32r1 test_opus_decode 1 192497750675 173969357000 9.62%
MIPS32r2+DSP test_opus_decode 1 142939494587 140239758103 1.88%
MIPS32r1 test_opus_encode 1 447394529478 396319235982 11.42%
MIPS32r2+DSP test_opus_encode 1 348482765364 328594898435 5.71%

In the end MIPS32 and DSP performance has been measured on real MIPS 24KEc CPU here are results:

Architecture Test Initial, time/s Improved, time/s Difference, % (more is better)
MIPS32r1 decode voip 61.10 54.33 11.08%
MIPS32r1 decode audio 45.86 42.72 6.85%
MIPS32r2+DSP decode voip 46.99 46.97 0.04%
MIPS32r2+DSP decode audio 38.73 38.47 0.67%
MIPS32r1 encode voip 622.10 599.39 3.65%
MIPS32r1 encode audio 374.61 364.62 2.67%
MIPS32r2+DSP encode voip 569.82 574.01 -0.74%
MIPS32r2+DSP encode audio 348.87 346.71 0.62%

MIPS32 since release 1 has support for multiply-accumulate pattern
with 32x32=>64 bit data, although in constrast to DSP extension
it has only one accumulator register.

Another disadvantage is extract scaled result back from accumulator
requires 5 instructions so for 16x16 multiply-accumulate it's
faster to use normal multiptication (32x32=>32) and addition
instructions.

GCC likes to shuffle mult+madd pattern instructions away and then
reload accumulator in between, so default C implementation is far
from optimal.

GCC don't have builtin functions for mult/madd/msub instructions
so inline assembly is used here.

Regarding MIPS32r6 - it doesn't have accumulator register at all.
Instead, it has pair of instructions MUL/MUH which implement
32x32=>64 multiplication on a general registers. C implemetation
matches much better with that processor. So no special version
for it.

MIPS64 in turn has mult+madd instructions but only for compatibility
with 32-bit binaries, taking back result from accumulator requires
same 5 instructions and they must be written in assembly.
Instead, it has 64x64=>64 multiplication on general registers, so
C code shall be good enough with typical dmul+daddu instructions for
multiply-accumulate. So no special version for it.

Signed-off-by: Siarhei Volkau <[email protected]>
Maybe they were in use in 2014?

Signed-off-by: Siarhei Volkau <[email protected]>
It's typical implementation requires involving accumulator register
to get 48+ bit multiplication. but getting scaled result back from
accumulator requires 5 instructions, so typically GCC emits:
# MULT16_32_Q16
 mult a16, b32
 mflo r1
 mfhi r2
 srl  r1, r1, 16
 sll  r2, r2, 16
 or   result, r1, r2

but if we scale 16-bit argument before multiplication
we can get result in one instruction (mfhi):
# MULT16_32_Q16
 sll  a32, a16, 16
 mult a32, b32
 mfhi result

for MIPS32r6 it's even shorter:
 sll  a32, a16, 16
 muh  result, a32, b32

MIPS64 avoids using accumulator here and can scale
result in general register with single instruction.
So no special trick needed.

Signed-off-by: Siarhei Volkau <[email protected]>
For non-DSP MIPS it's worth to use default MAC16_16 implementation.
So there's no difference with pure C implementation.

The real difference goes from manual tuning C code:
- unroll loop one more time for dual_inner_prod
- replace tail if-s by switch in xcorr_kernel
- use 32-bit accumulators for non-DSP variant

Why switch is faster? Probably because compiler don't have to
track j variable till the end of the cycle and can replace exit
condition by something like x < &initial_x[N-3].

These changes increase overall opus_decode test execution speed by about
1% for both DSP and non-DSP versions.

Measurements done in QEMU by counting instructions executed.
QEMU is not cycle accurate and real effect might be lower due
to pipeline stalls.

Signed-off-by: Siarhei Volkau <[email protected]>
I observed that overridden exp_rotation1 is fully
matches default C version.

Overridden renormalise_vector in turn differs only
in celt_inner_prod calculation, so instead of tuning
existing overrides it's worth to remove them compltely
and tune celt_inner_prod instead.

This change gives minor performance improvement for
both DSP and non-DSP MIPS.

Signed-off-by: Siarhei Volkau <[email protected]>
absq_s.ph does two ABS16 with saturation on a vector of two 16-bit
values.
cmp.lt.ph and pick.ph form conditional move for vector of two 16-bit
values.

Also 2x loop unroll for better perfomance.

Original C version can return 0x8000 (positive) whereas this one
is limited by 0x7fff, since result of this function used in ILOG2
context this is important. As a quick fix 1 is added to the result
in hope to return 0x8000 if saturation happens.

Signed-off-by: Siarhei Volkau <[email protected]>
dpaq_s.w.ph does two 16x16=>Q31 multiplications with adding both results
to accumulator register.
For getting Q31, result of multiplication is shifted left by 1.
Also it does saturation: for two 0x8000 (-32768) inputs result is
0x7fffffff (maximal positive Q31).

This instruction is ideal candidate for celt/dual_inner_prod functions
although data alignment isn't always good to utilize it.

Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32 since release 1 has support for multiply-accumulate pattern
with 32x32=>64 bit data, although in constrast to DSP extension
it has only one accumulator register and no builtin functions in GCC.

Signed-off-by: Siarhei Volkau <[email protected]>
Simple loop with shift by 1 unrolled 2 times.
More complex loop unrolled 4 times.

Signed-off-by: Siarhei Volkau <[email protected]>
The code faster because:
 - avoids 64-bit shift on each iteration
 - matches multiply-accumulate pattern
 - might be autovectorized (not verified)

Signed-off-by: Siarhei Volkau <[email protected]>
I observed that overridden silk_noise_shape_quantizer_del_dec is
refactored in C version, and most performance critical parts moved
to separate functions silk_noise_shape_quantizer_short_prediction
and silk_NSQ_noise_shape_feedback_loop.

So let's drop current mips implementation and override only
the critical parts mentioned above.

Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32 has 32x32=>64bit multiplication, although shifting 64-bit result
isn't trivial so its worth to shift right 64-bit value to 32 it means
just drop LSB register of the result.

Since second argument of silk_SMULWB is 16-bit wide we can shift it
left by 16 before multiplication to apply technique above to the result.

Signed-off-by: Siarhei Volkau <[email protected]>
CLZ instruction first appeared in MIPS32

silk_CLZ16 and silk_CLZ32 can benefit of that, not DSP only.

Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32+ can get benefit from same algorithm as used for DSP extension
due its multiply-accumulate nature.

Signed-off-by: Siarhei Volkau <[email protected]>
At the moment of fork mips version from C version in 2014 it contain
minimal differences with base C version (signle line of code), then it
was disabled completely during furhter development.

So it makes little sense to keep that header in repo.

Signed-off-by: Siarhei Volkau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant