fix: resolve three issues encountered during RL training#47
Open
Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Open
fix: resolve three issues encountered during RL training#47Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Conversation
1. compute_advantages returns all zeros for single-sample batches (std=0 causes division to collapse), falling back to raw rewards 2. PRMScorer._query_once has no timeout β a hanging API endpoint freezes the entire training loop indefinitely, adding asyncio.wait_for with configurable timeout (default 120s) 3. SkillEvolver uses a hardcoded placeholder API key as fallback instead of raising early when OPENAI_API_KEY is not set
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I've been running MetaClaw in RL mode (Kimi-K2.5 + GPT-5.2 as PRM). During this time I hit three independent issues that caused silent training failures β no crash, no obvious error, just the model not learning or the process hanging. Took me a while to trace each one, so I'm bundling the fixes here in case others run into the same thing.
compute_advantages()returning all-zero advantages for single-sample batches (std=0 edge case)PRMScorer._query_once()to prevent training loop from hanging on unresponsive APISkillEvolverso missingOPENAI_API_KEYraises immediatelyProblem
I ran into three separate issues while using MetaClaw for RL training on my personal setup (Kimi-K2.5 + GPT-5.2 as PRM judge, batch_size=4):
1. Single-sample batch produces zero gradient
When I set
rl.batch_sizeto 1 for quick iteration during debugging, I noticed the model weights weren't updating at all. After digging into it, I found thatcompute_advantages()normalises rewards as(r - mean) / (std + eps)β but with a single sample,stdis 0 and every advantage becomes ~0 regardless of the actual reward. The training step runs, datums are formatted,forward_backward_asynciscalled, but the model learns nothing. Same thing happens if a batch has identical rewards (e.g. all +1).
Fix: When
std < 1e-8, fall back to raw (centred) rewards instead of dividing by near-zero std. For single-sample batches, return the reward directly.2. PRM scorer hangs indefinitely on slow API
My PRM endpoint (proxied through a corporate gateway) occasionally hangs for 5+ minutes without responding. When this happens,
_query_onceblocks insideasyncio.to_thread()forever β there's no timeout.The entire training loop freezes because
evaluate()awaits all votes viaasyncio.gather, and one hanging vote blocks everything. I had to manually kill the process twice before I traced it to this.Fix: Wrap the
asyncio.to_threadcall withasyncio.wait_for(timeout=self.timeout). Default timeout is 120s, configurable via the newtimeoutparameter. On timeout, the vote returnsNone(same asany other failure) and majority voting proceeds with the remaining votes.
3. SkillEvolver silently uses a hardcoded placeholder API key
When I first enabled skill evolution, I forgot to set
OPENAI_API_KEY. Instead of getting a clear error, the evolver used a hardcoded fallback key ("aB7cD9eF2gH5iJ8kL1mN4oP6qR3sT0uV") and sent realrequests to the API, which returned cryptic 401 errors buried in the logs. It took me a while to realise the issue was just a missing env var. The
if not api_keycheck on line 90 never triggers becausethe default is non-empty.
Fix: Change the default to
""so the existingEnvironmentErroron line 91 fires immediately with a clear message.Changes
metaclaw/data_formatter.pyβ handle zero-std edge case incompute_advantages()metaclaw/prm_scorer.pyβ add configurable timeout toPRMScorer._query_once()metaclaw/skill_evolver.pyβ remove hardcoded placeholder API key fallbackTest
prm_urlat a non-responsive endpoint β training loop no longer hangs, logs timeout warning and continuesOPENAI_API_KEYnow raisesEnvironmentErrorimmediately onSkillEvolverinit