Skip to content

Conversation

@cwoffenden
Copy link
Collaborator

@cwoffenden cwoffenden commented Sep 15, 2025

ProcessAudio()behaves like it's running on the main thread, so spin locks are also blocking MainLoop() from running*. I removed the previous workaround of a counter in the AW process callback, used to only have interaction between the audio and main thread after a delay.

The previous main thread's code is now run in a worker, which still tests the spinlocks from the AW's side.

*it may also just be that the main thread is used to schedule the callbacks.

@cwoffenden
Copy link
Collaborator Author

cwoffenden commented Sep 15, 2025

test_glgears_proxy_jstarget failed, the audio locks aren't. Running locally using --repeat 1000 with the original code I can get a failure eventually within the 1000 repetitions (and with also with repeated runs; within 10 repetitions on a 2-core VM), with the new code not, though I'll test more on a variety of hardware/OS combos.

@cwoffenden cwoffenden marked this pull request as ready for review September 15, 2025 18:23
@cwoffenden
Copy link
Collaborator Author

5000+ runs and counting. Running in a 2-core VM I can make this PR just keep on going and the earlier code fail quickly.

runner

The original code either fails here, taking more than 10s to acquire a lock it should get in the next frame or so:

result = emscripten_lock_busyspin_wait_acquire(&testLock, 10000);

And only after timing out will the main thread release the lock, so during spinning the main thread must be blocked via the audio thread. This isn't the case with a worker.

The other place the original code fails is after calling emscripten_force_exit() the browser can hang. I can't get a stack trace here because the browser is unresponsive. This also doesn't fail in this PR.

@cwoffenden cwoffenden changed the title [AUDIO_WORKLETS] Fix race condition in locks test [AUDIO_WORKLETS] Move code off the main thread in locks test Sep 16, 2025
@cwoffenden cwoffenden marked this pull request as draft September 16, 2025 13:41
@cwoffenden
Copy link
Collaborator Author

cwoffenden commented Sep 16, 2025

I'm still marking the test as flaky, I've had one failure on CircleCI but more than 30'000+ successes without a single failure locally. Trying again with a 1-CPU VM:

Ran 10000 tests in 7692.931s

@cwoffenden cwoffenden marked this pull request as ready for review September 16, 2025 16:47
@brendandahl
Copy link
Collaborator

I'm still marking the test as flaky, I've had one failure on CircleCI but more than 30'000+ successes without a single failure locally. Trying again with a 1-CPU VM:

That could be from some other audio worklet bug, I've seen test_audio_worklet_strict and test_audio_worklet_pthreads_es6 flake recently on CircleCI.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2025

I've also seen plain old test_audio_worklet flake.

@juj
Copy link
Collaborator

juj commented Sep 18, 2025

This flake seems to be Chrome-specific btw - the flake does not occur on Firefox. I wonder if there could be a Chrome bug or improvement possible?

@cwoffenden
Copy link
Collaborator Author

I saw a few failures with the earlier code after the emscripten_force_exit() call, which I've not been able to recreate with this PR (I must've done 60'000+ runs this week with variations such as _strict, etc., and on multiple machines). I think this exit failure may be related to #25270, in that a message is pushed via the main thread but the system is mid-shutdown so fails.

We're not running anything from Emscripten 4 in production yet (only in development), but what we've seen for years in the logs are errors when unloading the page. Usually some timeout call or worker is still running after the page is partially unloaded.

@cwoffenden
Copy link
Collaborator Author

This flake seems to be Chrome-specific btw

If it is down to the worker's interaction with main, it might be why I see test_glgears_proxy_jstarget fail, for example:

https://app.circleci.com/pipelines/github/emscripten-core/emscripten/45690/workflows/b4d64825-b448-4839-b82b-5bf10942876e/jobs/1024231

I had this about every other run whilst trying to get all ticks for this PR.

@cwoffenden
Copy link
Collaborator Author

I've also seen plain old test_audio_worklet flake.

test_audio_worklet_post_function also fails:

https://app.circleci.com/pipelines/github/emscripten-core/emscripten/45544/workflows/d8913a61-19da-4a00-af07-21b25712c1e7/jobs/1020135

But this isn't even doing any audio code, it's essentially a message passing worker.

@brendandahl
Copy link
Collaborator

#25312 seems to fix test_audio_worklet_post_function, but I don't really understand why since the worklet is not doing any processing.

@cwoffenden
Copy link
Collaborator Author

cwoffenden commented Sep 19, 2025

#25312 seems to fix test_audio_worklet_post_function

I added the same explicit shutdown here too to see how that goes.

So far the unrelated test_pthread_main_thread_blocking_join failed. For two more CI runs it worked fine.

@brendandahl
Copy link
Collaborator

If you want to run more iterations faster I have a helper option --compile-once in https://github.com/brendandahl/emscripten/tree/compile-once . Still needs some work before I put it up for review, but it works for the browser audio tests

@cwoffenden
Copy link
Collaborator Author

This should be good to land, since it moves code off the main thread and add the exit hang fix from #25312.

@cwoffenden cwoffenden marked this pull request as draft November 9, 2025 11:05
@cwoffenden
Copy link
Collaborator Author

These changes are still relevant.

@cwoffenden cwoffenden marked this pull request as ready for review November 10, 2025 12:53
@cwoffenden cwoffenden requested a review from juj November 10, 2025 12:53
@sbc100
Copy link
Collaborator

sbc100 commented Nov 10, 2025

This seems like its working around a larger issue here which is that audio worklet code cannot, in general, synchronize with the main thread.

Is that true? If so, should we not document that? It seems like a pretty huge issue TBH. Can we come up with a way to detect this kind of synronization and warn about? I can't think of any way sadly, since the APIs uses are mostly just atomics and its not possible to tell which thread is being interacted with.

I guess I'm ok landing this change to make the tests less flaky but we should have some plan to update the docs I think.

Would it be simpler to just say the audio worklet cannot ever block on other threads at all? Its seems like that is likely the intent of the API anyway, that it should never block.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 10, 2025

(I guess the previous comment is really just about #24213)

@cwoffenden
Copy link
Collaborator Author

cwoffenden commented Nov 10, 2025

I think until recently it was possible to use futexes between the main and audio thread*, but something changed in Chrome that broke this. Let me delve into the spec and see what it says (I've already forgot the finer points since writing this), then think about how to document this.

*we shipped a product for years doing this before shifting more work off the main thread.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 10, 2025

I think until recently it was possible to use futexes between the main and audio thread*, but something changed in Chrome that broke this. Let me delve into the spec and see what it says (I've already forgot the finer points since writing this), then think about how to document this.

*we shipped a product for years doing this before shifting more work off the main thread.

The spec seems pretty clear to me: "Implementations can run worklets wherever they choose (including on the main thread).". I think we (emscripten) perhaps overlooked or misunderstood this.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 10, 2025

I think until recently it was possible to use futexes between the main and audio thread*, but something changed in Chrome that broke this. Let me delve into the spec and see what it says (I've already forgot the finer points since writing this), then think about how to document this.
*we shipped a product for years doing this before shifting more work off the main thread.

The spec seems pretty clear to me: "Implementations can run worklets wherever they choose (including on the main thread).". I think we (emscripten) perhaps overlooked or misunderstood this.

Presumably this means that audio worklet work run on any given web worker too.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 11, 2025

Actually I just heard back from webaudio folks who pointed out that spec for AudioWorkletProcessor says: "This interface represents an audio processing code that runs on the audio rendering thread."

They also said that chrome in particular runs on a pool of backing threads, not the main thread.

This means I that this change should not be necessary. Perhaps there is bug in chrome, or perhaps there is a bug in our stuff, but this seems like just sidestepping the issue rather than tracking it down.

@cwoffenden
Copy link
Collaborator Author

Actually I just heard back from webaudio folks

[snip]

I'll look into it. There was a change in Chrome that seemed to break this all of a sudden.

@sbc100
Copy link
Collaborator

sbc100 commented Nov 11, 2025

Actually I just heard back from webaudio folks

[snip]

I'll look into it. There was a change in Chrome that seemed to break this all of a sudden.

Did you notice it in your shipping code or just here in the unit tests?

@cwoffenden
Copy link
Collaborator Author

Did you notice it in your shipping code or just here in the unit tests?

Only in the unit test, but that might be a logic error fixed when going to a worker as a side effect. I'll look further into it and I'll check in with our devs that use this, because it's possible code moved off the main thread.

The unit test ended up being a test of the API rather than something to hammer the locks (ideally it'd be a queue, with multiple entries pushed from a timeout on the main thread and popped from the audio thread).

@cwoffenden
Copy link
Collaborator Author

I extended the test to take a -DTEST_ON_WORKER flag to run on a worker, otherwise it runs on the main thread, the code is as close as possible the same. On CI it still hangs on main (locally it's hard work to hang, I need a Linux VM).

@cwoffenden
Copy link
Collaborator Author

This means I that this change should not be necessary. Perhaps there is bug in chrome, or perhaps there is a bug in our stuff, but this seems like just sidestepping the issue rather than tracking it down.

Having spent some time with it, and having changed the code to allow switching between main and worker threads (harnessing the power of messy macros), I can't say why it does but the assert will eventually fail from the audio thread in emscripten_lock_busyspin_wait_acquire() in the TEST_WAIT_ACQUIRE case. I don't ever see it fail spinning in the main or worker thread, and not if the main thread is replaced by a worker.

I will continue looking because I'd like an answer.

@cwoffenden cwoffenden marked this pull request as draft November 13, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants