-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
I want to collect some info here about things I have learned while writing code for the ESP32 family MCUs. Please feel free to add to this.
This is a work in progress.
Comparison of basic operations on the CPU architectures
Operation | ESP32@240MHz (MOPS/s) | S3@240MHz (MOPS/s) | S2@240MHz (MOPS/s) | C3@160MHz (MOPS/s) |
---|---|---|---|---|
Integer Addition | 237.76 | 237.99 | 182.17 | 127.08 |
Integer Multiply | 237.05 | 238.06 | 182.17 | 120.74 |
Integer Division | 118.94 | 119.03 | 101.29 | 4.63 |
Integer Multiply-Add | 158.49 | 158.66 | 136.63 | 127.22 |
64bit Integer Addition | 19.50 | 20.81 | 18.11 | 36.82 |
64bit Integer Multiply | 27.55 | 30.22 | 27.79 | 15.50 |
64bit Integer Division | 2.71 | 2.71 | 2.65 | 1.02 |
64bit Integer Multiply-Add | 19.80 | 21.88 | 19.16 | 20.30 |
Float Addition | 237.55 | 238.04 | 7.77 | 1.93 |
Float Multiply | 237.69 | 237.97 | 4.14 | 1.24 |
Float Division | 1.42 | 4.47 | 0.86 | 0.79 |
Float Multiply-Add | 474.85 | 475.91 | 6.43 | 1.76 |
Double Addition | 6.50 | 6.18 | 6.51 | 1.51 |
Double Multiply | 2.23 | 2.37 | 2.23 | 0.70 |
Double Division | 0.48 | 0.54 | 0.30 | 0.41 |
Double Multiply-Add | 5.65 | 5.61 | 5.65 | 1.40 |
This table was generated using code from https://esp32.com/viewtopic.php?p=82090#
Even though the ESP32 and the S3 have hardware floating point units, they still do floating point division in software so it should be avoided in speed critical functions.
Edit (softhack007): "Float Multiply-Add" uses a special CPU instruction that combines addition and multiplication. Its generated by the compiler for expressions like a = a + b * C;
As to why integer divisions on the C3 are so slow is unknown, the datasheet clearly states that it can do 32-bit integer division in hardware.
Bit shifts vs. division
Bit shifts are always faster than doing a division as it is a single-instruction command. The compiler will replace divisions by bit-shifts wherever possible, so var / 256
is equivalent to var >> 8
if var is unsigned. If it is a signed integer, it is only equivalent if the value of var is positive and this fact is known to be always the case at compile time. The reason is: -200/256=0
and -200>>8=-1
. So when using signed integers and a bit-shift is possible it is better to do it explicitly instead of leaving it to the compiler.
When the incorrect rounding of signed integer bit-shifts matter, use normal division. On a ESP32 C3 there is a bit manipulation trick that can be used. Here is an example of that I use in the particle system:
int32_t ximpulse = (impulse * dx) / 32767;
int32_t ximpulse = (impulse * dx + ((dx >> 31) & 32767)) >> 15;
(dx>>31)
extracts the sign bit as a mask, rounding is applied to negative values only to correct for asymmetry in right shifts making the results not negative biased. This is still 2x faster than doing a division but only on C3.
Fixed point vs. float
Using fixed point math is less accurate but for most operations it is accurate enough and it runs much faster especially when doing divisions.
When doing mixed-math there is a pitfall: casting negative floats into unsigned integers is undefined and leads to problems on some CPUs. https://embeddeduse.com/2013/08/25/casting-a-negative-float-to-an-unsigned-int/
To avoid this problem, explicitly cast a float into int
before assigning it to an unsigned integer.
Modulo Operator: %
The modulo operator uses several instructions. A modulo of 2^i can be replaced with a 'bitwise and' or & operator which is a single instruction. The rule is n % 2^i = n & (2^i - 1)
. For example n % 2048 = n & 2047
Speed of different Memory Types
- PSRAM: buffers are copied to a cache so first access to a buffer not in cache is slow, consecutive access is fast, no matter if sequential or random unless the buffer does not fit into cache
- cached access on ESP32 and S3 happens up to ~26kB
- cached acces on S2 happens only up to ~6kB
- random access to buffers that do not fit into cache is slow and depends on cache misses, a 300% increase in access speed is not uncommon, it can go up to 2000% on fully random access.
- FAST RTC MEMORY (not available on classic ESP32): access is on-par with normal DRAM, my tests showed no difference even though espressif states that access is slower. It can be used like normal DRAM. ~7kB available. malloc() will use it but the system puts it at low priority.
- SLOW RTC MEMORY (not available on C3): cannot be allocated dynamically, use
RTC_NOINIT_ATTR
to place a static buffer in this memory, access is slow (200% - 500% compared to DRAM), size is limited to ~4kB-6kB