weird storage.erase_filesystem() problems on metro rp2350 #10104

jepler · 2025-02-26T23:37:00Z

CircuitPython version and board name

metro rp2350

Code/REPL

>>> storage.erase_filesystem()

Behavior

Usually freezes with the white LED on.

with pico-probe a variety of weird crashes and double faults are observed. For instance on one occasion it crashed within a function in flash that appeared to have its content overwritten; but on restart, the content was restored (so problem with XIP cache?)

(gdb) disas common_hal_os_urandom 
Dump of assembler code for function common_hal_os_urandom:
   0x100419a4 <+0>:    movs    r0, r0
   0x100419a6 <+2>:    movs    r0, r0
   0x100419a8 <+4>:    movs    r0, r0
   0x100419aa <+6>:    movs    r0, r0
   0x100419ac <+8>:    movs    r0, r0
   0x100419ae <+10>:    movs    r0, r0
   0x100419b0 <+12>:    movs    r0, r0
   0x100419b2 <+14>:    movs    r0, r0
   0x100419b4 <+16>:    movs    r0, r0
   0x100419b6 <+18>:    movs    r0, r0
   0x100419b8 <+20>:    movs    r0, r0
   0x100419ba <+22>:    movs    r0, r0
   0x100419bc <+24>:    movs    r0, r0
   0x100419be <+26>:    movs    r0, r0
   0x100419c0 <+28>:    str    r3, [r4, #0]

I believe that by trying various revisions I excluded my recent changes to auto-initialize HSTX & to interrupt handling. However, I encourage anyone picking up this issue to double check. Especially the interrupt handling change, which was designed to prevent HSTX display glitches during flash writes, could be related to this....

Description

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

jepler · 2025-02-26T23:37:34Z

discord stream of consciousness from me: https://discord.com/channels/327254708534116352/327298996332658690/1344430703114321930

eightycc · 2025-02-28T16:37:51Z

There is a possible race condition. The global variable nesting_count used in common_hal_mcu_[en,dis]able_interrupts is volatile but accesses to it in these functions are not atomic. Since this variable is manipulated with interrupts partially enabled, it may get clobbered.

eightycc · 2025-03-01T14:00:09Z

The problem is reproducible on an Adafruit Feather RP2350 with an HSTX to DVI adapter attached. Reproducing the problem also requires initializing picodvi.Framebuffer. I used Scott's demo ruler app from the adapter tutorial.

eightycc · 2025-03-01T14:20:58Z

Running with stock 9.2.4 problem does not occur. A top of main build does fail. I protected nesting_count in common_hal_mcu_disable_interrupts() by completely disabling interrupts at entry and re-enabling on exit by save_and_disable_interrupts() and restore_interrupts() but this does not resolve the problem.

eightycc · 2025-03-02T12:54:29Z

I'm beginning to think that the problem is interaction between DMA read access activity to the HSTX peripheral and the XIP section while programming a flash block. It's possibly a starvation issue for XIP during flash programming. Going through the Pico SDK and RP2350 bootrom code with a fine-toothed comb I'm not finding any nits that would cause breakage from the brief delay introduced by running the IRQ service routine for the frambuffer.

Admittedly there remains much hand waving in this explanation. I'm devising a stand-alone test to see if I can eliminate as many variables as possible.

In the meantime, I recommend backing out #10049 and explicitly turning off HSTX DMA during flash write operations.

jepler · 2025-03-02T15:46:24Z

We could put in a call to release displays if it only affects storage.erase_filesystem() but my worry is that it affects any flash writes...

eightycc · 2025-03-02T16:10:34Z

@jepler AFAIKT it affects all flash writes. storage.erase_filesystem() is performing a large number of back-to-back writes which serves to hit the problem reliably. There's no error detection/reporting mechanism in flash write SDK or bootrom code, so when it fails it does so silently. Sigh.

I'm giving myself a crash course on DMA/HSTX/TDMS operation. It's impressively complicated.

There is code in port_internal_flash_flush that pauses audio DMA activity during a flash write. Something similar for HSTX?

eightycc · 2025-03-02T16:29:55Z

There's another spot in ByteArray.c:write_page() that writes flash. This instance does not disable audio DMA.

erase_and_write_sector in ByteArray.c too.

We'll want to factor all flash writes into a single function.

jepler · 2025-03-02T16:48:00Z

while it's true that the audio dma disable maybe should be done in nvm's write_page as well, this was added to fix bad audio output while interrupts were delayed, not to fix incorrect flash writes or bad xip after flash writes. For more background:

eightycc · 2025-03-02T17:11:54Z

Tangentially related, the code in supervisor_flash_pre_write() and supervisor_flash_post_write() is redundant with SDK 2.1. The Micropython folks must have hit the same problem with PSRAM config across flash writes and fixed it in raspberrypi/pico-sdk#2082. Likewise, the necessary XIP cache clean went in with SDK 2.1 in raspberrypi/pico-sdk#2013.

dhalbert · 2025-03-03T14:19:47Z

I would like to do a 9.2.5 release pretty soon, but consider this a blocker. What is the best short-term way forward? Is it

In the meantime, I recommend backing out #10049 and explicitly turning off HSTX DMA during flash write operations.

Or are we on the way to a fix?

Is there anything to be done with DMA priorities that could improve glitching?

eightycc · 2025-03-03T15:28:26Z

I've not nailed the root cause. Earlier I wrote that it appeared not to be IRQ related, but on closer examination this may not be correct. The flash write code in the SDK appears to assume that interrupts are completely disabled and the "victim" core is entirely quiescent. There's a very funky window where the bootrom is re-entered to re-initialize XIP that could be hazardous to interrupt.

I'm adding a third DMA channel to re-trigger the command DMA channel in framebuffer to eliminate the interrupt, but since I'm also climbing a steep learning curve it's slow going. If someone with more knowledge of RP2350 DMA wants to jump in, please do.

tannewt · 2025-03-03T19:17:58Z

I think the simplest thing is to disable the HSTX display before doing erase_filesystem. It resets anyway so just turn it off early.

Let's deal with other flash writes during HSTX separately. I haven't actually see it myself so I'm wary it is a big issue.

eightycc · 2025-03-04T18:53:50Z

Note to self: Constructing a new framebuffer via common_hal_picodvi_framebuffer_construct() after picodvi_autoconstruct() leaks memory and dma channels. Attempting to create a large framebuffer will exhaust memory, resulting in a crash due to an attempt to de-init the clobbered framebuffer object. This needs a separate issue.

tannewt · 2025-03-04T19:05:58Z

This fix? tannewt@b4675f7#diff-06a92b3a928c9d1ed0731ea128ee37130e527aed7cf4ec3c100b0049de8b49b1R272-R274

I don't see how it leaks DMA channels.

eightycc · 2025-03-04T19:24:39Z

@tannewt Turns out it wasn't actually leaking DMA channels, it simply wasn't resetting the DMA channel numbers in the framebuffer object so it was attempting to un-claim them twice. It looked like a leak on first glance. Since zero is a valid DMA channel, I changed the channels in framebuffer to int and set them to -1 when not assigned. I did spot the other memory leak in the patch you referenced plus at least one more. Since I'm deep into this code, are there any other patches?

tannewt · 2025-03-04T19:30:54Z

Since I'm deep into this code, are there any other patches?

That's my working branch and has all of my changes.

eightycc · 2025-03-05T03:20:13Z

Found the root cause: In supervisor_flash_init() flash_do_cmd() is called without disabling interrupts. PR will follow shortly.

jepler · 2025-03-05T14:07:28Z

Appreciate the sleuthing @eightycc !

jepler added the bug label Feb 26, 2025

tannewt added crash rp235x labels Feb 27, 2025

tannewt added this to the 9.2.x milestone Feb 27, 2025

eightycc mentioned this issue Mar 5, 2025

Fix RP2350 Hang During storage.erase_filesystem #10111

Merged

tannewt closed this as completed in #10111 Mar 5, 2025

eightycc mentioned this issue Mar 6, 2025

Fixes to RP2350 Framebuffer and handling of flash writes #10116

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weird storage.erase_filesystem() problems on metro rp2350 #10104

weird storage.erase_filesystem() problems on metro rp2350 #10104

jepler commented Feb 26, 2025

jepler commented Feb 26, 2025

eightycc commented Feb 28, 2025

eightycc commented Mar 1, 2025

eightycc commented Mar 1, 2025

eightycc commented Mar 2, 2025

jepler commented Mar 2, 2025

eightycc commented Mar 2, 2025 •

edited

Loading

eightycc commented Mar 2, 2025 •

edited

Loading

jepler commented Mar 2, 2025

eightycc commented Mar 2, 2025 •

edited

Loading

dhalbert commented Mar 3, 2025

eightycc commented Mar 3, 2025

tannewt commented Mar 3, 2025

eightycc commented Mar 4, 2025

tannewt commented Mar 4, 2025

eightycc commented Mar 4, 2025

tannewt commented Mar 4, 2025

eightycc commented Mar 5, 2025

jepler commented Mar 5, 2025

weird storage.erase_filesystem() problems on metro rp2350 #10104

weird storage.erase_filesystem() problems on metro rp2350 #10104

Comments

jepler commented Feb 26, 2025

CircuitPython version and board name

Code/REPL

Behavior

Description

Additional information

jepler commented Feb 26, 2025

eightycc commented Feb 28, 2025

eightycc commented Mar 1, 2025

eightycc commented Mar 1, 2025

eightycc commented Mar 2, 2025

jepler commented Mar 2, 2025

eightycc commented Mar 2, 2025 • edited Loading

eightycc commented Mar 2, 2025 • edited Loading

jepler commented Mar 2, 2025

eightycc commented Mar 2, 2025 • edited Loading

dhalbert commented Mar 3, 2025

eightycc commented Mar 3, 2025

tannewt commented Mar 3, 2025

eightycc commented Mar 4, 2025

tannewt commented Mar 4, 2025

eightycc commented Mar 4, 2025

tannewt commented Mar 4, 2025

eightycc commented Mar 5, 2025

jepler commented Mar 5, 2025

eightycc commented Mar 2, 2025 •

edited

Loading

eightycc commented Mar 2, 2025 •

edited

Loading

eightycc commented Mar 2, 2025 •

edited

Loading