Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite recursion with the stderr wrapper thing #844

Open
dlwh opened this issue Sep 11, 2024 · 8 comments
Open

Infinite recursion with the stderr wrapper thing #844

dlwh opened this issue Sep 11, 2024 · 8 comments

Comments

@dlwh
Copy link
Sponsor Contributor

dlwh commented Sep 11, 2024

Don't have a minimization yet, but it affects 0.11.5. Doesn't seem to be deterministic...

    for batch, tags in tqdm.tqdm(iterator, "eval"):
  File "/opt/levanter/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1098, in __init__
    self.refresh(lock_args=self.lock_args)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1347, in refresh
    self.display()
  File "/opt/levanter/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1494, in display
    self.moveto(pos)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1443, in moveto
    self.fp.write('\n' * n + _term_move_up() * -n)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/tqdm/utils.py", line 196, in inner
    return func(*args, **kwargs)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/equinox/_jit.py", line 162, in write
    self.stderr.write(data)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/equinox/_jit.py", line 162, in write
    self.stderr.write(data)
  File "/opt/levanter/.venv/lib/python3.10/site-packages/equinox/_jit.py", line 162, in write
    self.stderr.write(data)
  [Previous line repeated 979 more times]
RecursionError: maximum recursion depth exceeded
@dlwh
Copy link
Sponsor Contributor Author

dlwh commented Sep 11, 2024

somehow i'm also getting segfaults and other memory corruption in multithreaded code in 0.11.5 and not 0.11.3 . Is there a less stompy way to suppress the error?

@patrick-kidger
Copy link
Owner

patrick-kidger commented Sep 12, 2024

Eek! That's not good.
I'm not sure how you ended up with an infinite recursion but I suppose the answer is "threading something something".

Perhaps we could arrange to replace stderr for only the main thread? That would cover the main use-case (of hiding the noisy default printout from eqx.error_if), hopefully without triggering your case here.

Absent a MWE then I'd be happy to take a PR on this, if you can identify such a fix.

@dlwh
Copy link
Sponsor Contributor Author

dlwh commented Sep 12, 2024

Mostly notes for me (or @hawkinsp said he could maybe take a look).

I have isolated it to 2188e01, which isn't surprising. The newer variant (with flush, doesn't fix it, except that it makes wandb happier)

A non-minimal, but reasonably reliable, reproducer of a crash is here: https://github.com/stanford-crfm/levanter/tree/reproduce_eqx_crash

Running bash scripts/repro_crash.sh will reproduce. It assumes a decent amount of setup of docker and such, but it's one line (after installing levanter). This "reproduction" is usually a stackless abort with no context at all. Previous versions will print the (supposedly) infinite recursion above, or some kind of memory corruption.

That branch separates the crash from tensorstore, ray, and wandb, which were my other suspects.

I'll think on the right way to address in Equinox. My suspicion is a judicious use of thread locals is in order.

@dlwh
Copy link
Sponsor Contributor Author

dlwh commented Sep 13, 2024

Hey @patrick-kidger what's the code path that makes the _FilteredStderr thing useful? I can't find any tests that fail if i turn it off or even a test that calls write to stderr inside filter_jit

@dlwh
Copy link
Sponsor Contributor Author

dlwh commented Sep 13, 2024

(in general this approach is clearly not threadsafe, though I still don't understand why it explodes.)

@patrick-kidger
Copy link
Owner

patrick-kidger commented Sep 13, 2024

All good questions! I've just poked at this more carefully and noticed that really what's going on is that JAX is making a logging.exception, which in turn writes to stderr. I've switched from filtering stderr to filtering the log, and I've added a test for this as well.

Take a look at #849 -- how does this behave for you? Notably I've not tried doing anything with threads just yet, we may need to tweak this further.

@dlwh
Copy link
Sponsor Contributor Author

dlwh commented Sep 13, 2024

OK, it's of course hard to know with multithreading issues, but it's not crashing with that branch. Thank you for the quick fix!

@patrick-kidger
Copy link
Owner

Sometimes 'not crashing' is the highest ideal we can aspire to in this industry.

Glad that works! We've picked up a few such bugfixes now so I'll do a new release shortly.

@patrick-kidger patrick-kidger mentioned this issue Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants