Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird storage.erase_filesystem() problems on metro rp2350 #10104

Closed
jepler opened this issue Feb 26, 2025 · 19 comments · Fixed by #10111
Closed

weird storage.erase_filesystem() problems on metro rp2350 #10104

jepler opened this issue Feb 26, 2025 · 19 comments · Fixed by #10111

Comments

@jepler
Copy link
Member

jepler commented Feb 26, 2025

CircuitPython version and board name

metro rp2350

Code/REPL

>>> storage.erase_filesystem()

Behavior

Usually freezes with the white LED on.

with pico-probe a variety of weird crashes and double faults are observed. For instance on one occasion it crashed within a function in flash that appeared to have its content overwritten; but on restart, the content was restored (so problem with XIP cache?)

(gdb) disas common_hal_os_urandom 
Dump of assembler code for function common_hal_os_urandom:
   0x100419a4 <+0>:    movs    r0, r0
   0x100419a6 <+2>:    movs    r0, r0
   0x100419a8 <+4>:    movs    r0, r0
   0x100419aa <+6>:    movs    r0, r0
   0x100419ac <+8>:    movs    r0, r0
   0x100419ae <+10>:    movs    r0, r0
   0x100419b0 <+12>:    movs    r0, r0
   0x100419b2 <+14>:    movs    r0, r0
   0x100419b4 <+16>:    movs    r0, r0
   0x100419b6 <+18>:    movs    r0, r0
   0x100419b8 <+20>:    movs    r0, r0
   0x100419ba <+22>:    movs    r0, r0
   0x100419bc <+24>:    movs    r0, r0
   0x100419be <+26>:    movs    r0, r0
   0x100419c0 <+28>:    str    r3, [r4, #0]

I believe that by trying various revisions I excluded my recent changes to auto-initialize HSTX & to interrupt handling. However, I encourage anyone picking up this issue to double check. Especially the interrupt handling change, which was designed to prevent HSTX display glitches during flash writes, could be related to this....

Description

No response

Additional information

No response

@jepler jepler added the bug label Feb 26, 2025
@jepler
Copy link
Member Author

jepler commented Feb 26, 2025

@tannewt tannewt added this to the 9.2.x milestone Feb 27, 2025
@eightycc
Copy link

There is a possible race condition. The global variable nesting_count used in common_hal_mcu_[en,dis]able_interrupts is volatile but accesses to it in these functions are not atomic. Since this variable is manipulated with interrupts partially enabled, it may get clobbered.

@eightycc
Copy link

eightycc commented Mar 1, 2025

The problem is reproducible on an Adafruit Feather RP2350 with an HSTX to DVI adapter attached. Reproducing the problem also requires initializing picodvi.Framebuffer. I used Scott's demo ruler app from the adapter tutorial.

@eightycc
Copy link

eightycc commented Mar 1, 2025

Running with stock 9.2.4 problem does not occur. A top of main build does fail. I protected nesting_count in common_hal_mcu_disable_interrupts() by completely disabling interrupts at entry and re-enabling on exit by save_and_disable_interrupts() and restore_interrupts() but this does not resolve the problem.

@eightycc
Copy link

eightycc commented Mar 2, 2025

I'm beginning to think that the problem is interaction between DMA read access activity to the HSTX peripheral and the XIP section while programming a flash block. It's possibly a starvation issue for XIP during flash programming. Going through the Pico SDK and RP2350 bootrom code with a fine-toothed comb I'm not finding any nits that would cause breakage from the brief delay introduced by running the IRQ service routine for the frambuffer.

Admittedly there remains much hand waving in this explanation. I'm devising a stand-alone test to see if I can eliminate as many variables as possible.

In the meantime, I recommend backing out #10049 and explicitly turning off HSTX DMA during flash write operations.

@jepler
Copy link
Member Author

jepler commented Mar 2, 2025

We could put in a call to release displays if it only affects storage.erase_filesystem() but my worry is that it affects any flash writes...

@eightycc
Copy link

eightycc commented Mar 2, 2025

@jepler AFAIKT it affects all flash writes. storage.erase_filesystem() is performing a large number of back-to-back writes which serves to hit the problem reliably. There's no error detection/reporting mechanism in flash write SDK or bootrom code, so when it fails it does so silently. Sigh.

I'm giving myself a crash course on DMA/HSTX/TDMS operation. It's impressively complicated.

There is code in port_internal_flash_flush that pauses audio DMA activity during a flash write. Something similar for HSTX?

@eightycc
Copy link

eightycc commented Mar 2, 2025

There's another spot in ByteArray.c:write_page() that writes flash. This instance does not disable audio DMA.

erase_and_write_sector in ByteArray.c too.

We'll want to factor all flash writes into a single function.

@jepler
Copy link
Member Author

jepler commented Mar 2, 2025

while it's true that the audio dma disable maybe should be done in nvm's write_page as well, this was added to fix bad audio output while interrupts were delayed, not to fix incorrect flash writes or bad xip after flash writes. For more background:

@eightycc
Copy link

eightycc commented Mar 2, 2025

Tangentially related, the code in supervisor_flash_pre_write() and supervisor_flash_post_write() is redundant with SDK 2.1. The Micropython folks must have hit the same problem with PSRAM config across flash writes and fixed it in raspberrypi/pico-sdk#2082. Likewise, the necessary XIP cache clean went in with SDK 2.1 in raspberrypi/pico-sdk#2013.

@dhalbert
Copy link
Collaborator

dhalbert commented Mar 3, 2025

I would like to do a 9.2.5 release pretty soon, but consider this a blocker. What is the best short-term way forward? Is it

In the meantime, I recommend backing out #10049 and explicitly turning off HSTX DMA during flash write operations.

Or are we on the way to a fix?

Is there anything to be done with DMA priorities that could improve glitching?

@eightycc
Copy link

eightycc commented Mar 3, 2025

I've not nailed the root cause. Earlier I wrote that it appeared not to be IRQ related, but on closer examination this may not be correct. The flash write code in the SDK appears to assume that interrupts are completely disabled and the "victim" core is entirely quiescent. There's a very funky window where the bootrom is re-entered to re-initialize XIP that could be hazardous to interrupt.

I'm adding a third DMA channel to re-trigger the command DMA channel in framebuffer to eliminate the interrupt, but since I'm also climbing a steep learning curve it's slow going. If someone with more knowledge of RP2350 DMA wants to jump in, please do.

@tannewt
Copy link
Member

tannewt commented Mar 3, 2025

I think the simplest thing is to disable the HSTX display before doing erase_filesystem. It resets anyway so just turn it off early.

Let's deal with other flash writes during HSTX separately. I haven't actually see it myself so I'm wary it is a big issue.

@eightycc
Copy link

eightycc commented Mar 4, 2025

Note to self: Constructing a new framebuffer via common_hal_picodvi_framebuffer_construct() after picodvi_autoconstruct() leaks memory and dma channels. Attempting to create a large framebuffer will exhaust memory, resulting in a crash due to an attempt to de-init the clobbered framebuffer object. This needs a separate issue.

@tannewt
Copy link
Member

tannewt commented Mar 4, 2025

This fix? tannewt@b4675f7#diff-06a92b3a928c9d1ed0731ea128ee37130e527aed7cf4ec3c100b0049de8b49b1R272-R274

I don't see how it leaks DMA channels.

@eightycc
Copy link

eightycc commented Mar 4, 2025

@tannewt Turns out it wasn't actually leaking DMA channels, it simply wasn't resetting the DMA channel numbers in the framebuffer object so it was attempting to un-claim them twice. It looked like a leak on first glance. Since zero is a valid DMA channel, I changed the channels in framebuffer to int and set them to -1 when not assigned. I did spot the other memory leak in the patch you referenced plus at least one more. Since I'm deep into this code, are there any other patches?

@tannewt
Copy link
Member

tannewt commented Mar 4, 2025

Since I'm deep into this code, are there any other patches?

That's my working branch and has all of my changes.

@eightycc
Copy link

eightycc commented Mar 5, 2025

Found the root cause: In supervisor_flash_init() flash_do_cmd() is called without disabling interrupts. PR will follow shortly.

@jepler
Copy link
Member Author

jepler commented Mar 5, 2025

Appreciate the sleuthing @eightycc !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants