-
Notifications
You must be signed in to change notification settings - Fork 693
MIPS port improvements #4 #419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SiarheiVolkau
wants to merge
16
commits into
xiph:main
Choose a base branch
from
SiarheiVolkau:patch-6
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MIPS32 since release 1 has support for multiply-accumulate pattern with 32x32=>64 bit data, although in constrast to DSP extension it has only one accumulator register. Another disadvantage is extract scaled result back from accumulator requires 5 instructions so for 16x16 multiply-accumulate it's faster to use normal multiptication (32x32=>32) and addition instructions. GCC likes to shuffle mult+madd pattern instructions away and then reload accumulator in between, so default C implementation is far from optimal. GCC don't have builtin functions for mult/madd/msub instructions so inline assembly is used here. Regarding MIPS32r6 - it doesn't have accumulator register at all. Instead, it has pair of instructions MUL/MUH which implement 32x32=>64 multiplication on a general registers. C implemetation matches much better with that processor. So no special version for it. MIPS64 in turn has mult+madd instructions but only for compatibility with 32-bit binaries, taking back result from accumulator requires same 5 instructions and they must be written in assembly. Instead, it has 64x64=>64 multiplication on general registers, so C code shall be good enough with typical dmul+daddu instructions for multiply-accumulate. So no special version for it. Signed-off-by: Siarhei Volkau <[email protected]>
Maybe they were in use in 2014? Signed-off-by: Siarhei Volkau <[email protected]>
It's typical implementation requires involving accumulator register to get 48+ bit multiplication. but getting scaled result back from accumulator requires 5 instructions, so typically GCC emits: # MULT16_32_Q16 mult a16, b32 mflo r1 mfhi r2 srl r1, r1, 16 sll r2, r2, 16 or result, r1, r2 but if we scale 16-bit argument before multiplication we can get result in one instruction (mfhi): # MULT16_32_Q16 sll a32, a16, 16 mult a32, b32 mfhi result for MIPS32r6 it's even shorter: sll a32, a16, 16 muh result, a32, b32 MIPS64 avoids using accumulator here and can scale result in general register with single instruction. So no special trick needed. Signed-off-by: Siarhei Volkau <[email protected]>
For non-DSP MIPS it's worth to use default MAC16_16 implementation. So there's no difference with pure C implementation. The real difference goes from manual tuning C code: - unroll loop one more time for dual_inner_prod - replace tail if-s by switch in xcorr_kernel - use 32-bit accumulators for non-DSP variant Why switch is faster? Probably because compiler don't have to track j variable till the end of the cycle and can replace exit condition by something like x < &initial_x[N-3]. These changes increase overall opus_decode test execution speed by about 1% for both DSP and non-DSP versions. Measurements done in QEMU by counting instructions executed. QEMU is not cycle accurate and real effect might be lower due to pipeline stalls. Signed-off-by: Siarhei Volkau <[email protected]>
I observed that overridden exp_rotation1 is fully matches default C version. Overridden renormalise_vector in turn differs only in celt_inner_prod calculation, so instead of tuning existing overrides it's worth to remove them compltely and tune celt_inner_prod instead. This change gives minor performance improvement for both DSP and non-DSP MIPS. Signed-off-by: Siarhei Volkau <[email protected]>
absq_s.ph does two ABS16 with saturation on a vector of two 16-bit values. cmp.lt.ph and pick.ph form conditional move for vector of two 16-bit values. Also 2x loop unroll for better perfomance. Original C version can return 0x8000 (positive) whereas this one is limited by 0x7fff, since result of this function used in ILOG2 context this is important. As a quick fix 1 is added to the result in hope to return 0x8000 if saturation happens. Signed-off-by: Siarhei Volkau <[email protected]>
dpaq_s.w.ph does two 16x16=>Q31 multiplications with adding both results to accumulator register. For getting Q31, result of multiplication is shifted left by 1. Also it does saturation: for two 0x8000 (-32768) inputs result is 0x7fffffff (maximal positive Q31). This instruction is ideal candidate for celt/dual_inner_prod functions although data alignment isn't always good to utilize it. Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32 since release 1 has support for multiply-accumulate pattern with 32x32=>64 bit data, although in constrast to DSP extension it has only one accumulator register and no builtin functions in GCC. Signed-off-by: Siarhei Volkau <[email protected]>
Simple loop with shift by 1 unrolled 2 times. More complex loop unrolled 4 times. Signed-off-by: Siarhei Volkau <[email protected]>
The code faster because: - avoids 64-bit shift on each iteration - matches multiply-accumulate pattern - might be autovectorized (not verified) Signed-off-by: Siarhei Volkau <[email protected]>
I observed that overridden silk_noise_shape_quantizer_del_dec is refactored in C version, and most performance critical parts moved to separate functions silk_noise_shape_quantizer_short_prediction and silk_NSQ_noise_shape_feedback_loop. So let's drop current mips implementation and override only the critical parts mentioned above. Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32 has 32x32=>64bit multiplication, although shifting 64-bit result isn't trivial so its worth to shift right 64-bit value to 32 it means just drop LSB register of the result. Since second argument of silk_SMULWB is 16-bit wide we can shift it left by 16 before multiplication to apply technique above to the result. Signed-off-by: Siarhei Volkau <[email protected]>
CLZ instruction first appeared in MIPS32 silk_CLZ16 and silk_CLZ32 can benefit of that, not DSP only. Signed-off-by: Siarhei Volkau <[email protected]>
MIPS32+ can get benefit from same algorithm as used for DSP extension due its multiply-accumulate nature. Signed-off-by: Siarhei Volkau <[email protected]>
Signed-off-by: Siarhei Volkau <[email protected]>
At the moment of fork mips version from C version in 2014 it contain minimal differences with base C version (signle line of code), then it was disabled completely during furhter development. So it makes little sense to keep that header in repo. Signed-off-by: Siarhei Volkau <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch serie address support of MIPS32 (r1-r5 mostly) and revise MIPS DSP ASE support.
MIPS32 have 32x32=>64 bit multiplication and multiply-accumulate instructions
so most improvements for DSP might be adapted to MIPS32 too, although it has only
one accumulator register for such operations.
Performance measured in QEMU by counting instructions, this isn't best approach because
QEMU isn't cycle accurate emulator but from the other point of view MIPS architecture
has many releases with different internal architecture so performance might differ noticeable
on different processors with same instructions set, that's what compiler should be aware of
(flag -mtune=).
QEMU perfomance (in instructions executed):
In the end MIPS32 and DSP performance has been measured on real MIPS 24KEc CPU here are results: