Skip to content

[INFO] Code execution speed considerations for developers #4206

@DedeHai

Description

@DedeHai

I want to collect some info here about things I have learned while writing code for the ESP32 family MCUs. Please feel free to add to this.

This is a work in progress.

Comparison of basic operations on the CPU architectures

Operation ESP32@240MHz (MOPS/s) S3@240MHz (MOPS/s) S2@240MHz (MOPS/s) C3@160MHz (MOPS/s)
Integer Addition 237.76 237.99 182.17 127.08
Integer Multiply 237.05 238.06 182.17 120.74
Integer Division 118.94 119.03 101.29 4.63
Integer Multiply-Add 158.49 158.66 136.63 127.22
64bit Integer Addition 19.50 20.81 18.11 36.82
64bit Integer Multiply 27.55 30.22 27.79 15.50
64bit Integer Division 2.71 2.71 2.65 1.02
64bit Integer Multiply-Add 19.80 21.88 19.16 20.30
Float Addition 237.55 238.04 7.77 1.93
Float Multiply 237.69 237.97 4.14 1.24
Float Division 1.42 4.47 0.86 0.79
Float Multiply-Add 474.85 475.91 6.43 1.76
Double Addition 6.50 6.18 6.51 1.51
Double Multiply 2.23 2.37 2.23 0.70
Double Division 0.48 0.54 0.30 0.41
Double Multiply-Add 5.65 5.61 5.65 1.40

This table was generated using code from https://esp32.com/viewtopic.php?p=82090#

Even though the ESP32 and the S3 have hardware floating point units, they still do floating point division in software so it should be avoided in speed critical functions.

Edit (softhack007): "Float Multiply-Add" uses a special CPU instruction that combines addition and multiplication. Its generated by the compiler for expressions like a = a + b * C;

As to why integer divisions on the C3 are so slow is unknown, the datasheet clearly states that it can do 32-bit integer division in hardware.

Bit shifts vs. division

Bit shifts are always faster than doing a division as it is a single-instruction command. The compiler will replace divisions by bit-shifts wherever possible, so var / 256 is equivalent to var >> 8 if var is unsigned. If it is a signed integer, it is only equivalent if the value of var is positive and this fact is known to be always the case at compile time. The reason is: -200/256=0 and -200>>8=-1. So when using signed integers and a bit-shift is possible it is better to do it explicitly instead of leaving it to the compiler.
When the incorrect rounding of signed integer bit-shifts matter, use normal division. On a ESP32 C3 there is a bit manipulation trick that can be used. Here is an example of that I use in the particle system:

int32_t ximpulse = (impulse * dx) / 32767; 
int32_t ximpulse = (impulse * dx + ((dx >> 31) & 32767)) >> 15;

(dx>>31) extracts the sign bit as a mask, rounding is applied to negative values only to correct for asymmetry in right shifts making the results not negative biased. This is still 2x faster than doing a division but only on C3.

Fixed point vs. float

Using fixed point math is less accurate but for most operations it is accurate enough and it runs much faster especially when doing divisions.
When doing mixed-math there is a pitfall: casting negative floats into unsigned integers is undefined and leads to problems on some CPUs. https://embeddeduse.com/2013/08/25/casting-a-negative-float-to-an-unsigned-int/
To avoid this problem, explicitly cast a float into int before assigning it to an unsigned integer.

Modulo Operator: %

The modulo operator uses several instructions. A modulo of 2^i can be replaced with a 'bitwise and' or & operator which is a single instruction. The rule is n % 2^i = n & (2^i - 1). For example n % 2048 = n & 2047

Speed of different Memory Types

  • PSRAM: buffers are copied to a cache so first access to a buffer not in cache is slow, consecutive access is fast, no matter if sequential or random unless the buffer does not fit into cache
    • cached access on ESP32 and S3 happens up to ~26kB
    • cached acces on S2 happens only up to ~6kB
    • random access to buffers that do not fit into cache is slow and depends on cache misses, a 300% increase in access speed is not uncommon, it can go up to 2000% on fully random access.
  • FAST RTC MEMORY (not available on classic ESP32): access is on-par with normal DRAM, my tests showed no difference even though espressif states that access is slower. It can be used like normal DRAM. ~7kB available. malloc() will use it but the system puts it at low priority.
  • SLOW RTC MEMORY (not available on C3): cannot be allocated dynamically, use RTC_NOINIT_ATTR to place a static buffer in this memory, access is slow (200% - 500% compared to DRAM), size is limited to ~4kB-6kB

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionkeepThis issue will never become stale/closed automaticallyoptimizationre-working an existing feature to be faster, or use less memory

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions