Fix BSH action scheme and PBR reward for correct trading behavior#485
Closed
rhamnett wants to merge 2 commits into
Closed
Fix BSH action scheme and PBR reward for correct trading behavior#485rhamnett wants to merge 2 commits into
rhamnett wants to merge 2 commits into
Conversation
BSH had a flawed action scheme (Discrete(2) toggle) and PBR had two critical bugs causing the agent to never learn profitable strategies: 1. BSH used a toggle (0/1) instead of explicit hold/buy/sell actions, making it impossible to hold a position without toggling state 2. PBR position=-1 after sell created a fake short signal that punished the agent for being safe in cash during rallies 3. PBR was completely blind to commission costs — identical reward signal whether commission was 0% or 100% Fixes: - BSH: Discrete(3) with 0=hold, 1=buy, 2=sell + position tracking - PBR position: sell to 0 (cash/flat) instead of -1 (fake short) - PBR init/reset: position=0 instead of -1 - Commission penalty: price x commission on trades, scaled to match the price-based reward units so the agent correctly learns trade costs - AdvancedPBR: same position and commission fixes - Tests: updated for Discrete(3) semantics, added commission penalty test Verified: 270 tests pass, reward signal has correct directional alignment with actual returns across all commission levels and trading strategies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… implementation available. Set BSH3 as default action scheme used when running tests
e1e0ef3 to
b708c37
Compare
Collaborator
|
I kept BSH as the original one, and renamed the "fixed" BSH to BSH3. I also updated the tests to reference BSH3 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The BSH (Buy/Sell/Hold) action scheme and PBR (Position-Based Return) reward scheme had fundamental bugs that made it impossible for an RL agent to learn a profitable trading strategy. These have likely been present since the original implementation.
Bug 1: BSH used a binary toggle instead of explicit actions
The old BSH action space was
Discrete(2)with a toggle mechanism — action 0 and action 1 would swap between holding cash and holding asset based onabs(action - self.action) > 0. This meant:Fix: BSH now uses
Discrete(3)with explicit semantics:0= Hold (do nothing)1= Buy (cash → asset, no-op if already holding)2= Sell (asset → cash, no-op if already in cash)Position tracking via
_position(0=cash, 1=asset) with balance checks before executing orders.Bug 2: PBR position=-1 after sell created a phantom short signal
This was the critical bug. The old PBR set
position = -1after a sell action. The PBR reward formula is:With
position = -1after selling:This effectively modeled a short position — but BSH is a long-only system. The agent can only hold cash or hold asset, it cannot short. The -1 position was a phantom signal that:
Fix: PBR now uses
position = 0after sell (cash/flat). The reward becomes:1 × price_diff— agent participates in price moves0 × price_diff = 0— neutral, agent is simply not in the marketThis correctly models long-only PnL attribution:
daily_PnL = position_size × price_change.Bug 3: PBR was completely blind to commission costs
The old PBR had no commission awareness whatsoever. Whether the exchange charged 0% or 100% commission, the agent received identical reward signals. This meant:
Fix: Trades now incur a penalty of
price × commission_rate, keeping the penalty in the same units as the price-based reward:Example: buying at price $100 with 0.3% commission costs
100 × 0.003 = 0.3in reward, meaning the agent needs a $0.30 price rise just to break even on the trade — exactly matching reality.What Changed
tensortrade/env/default/actions.pyDiscrete(2)toggle →Discrete(3)with hold/buy/sell +_positiontrackingtensortrade/env/default/rewards.py_tradedflagtensortrade/env/default/rewards.pytests/.../test_actions.pytests/.../test_checkpoint.pyDiscrete(3)action spacetests/.../test_env_creation.pyDiscrete(3)action spaceVerification
Before fix (position=-1 after sell):
After fix (position=0 after sell):
Test plan
pytest tests/— 270 pass, 2 skippytest tests/tensortrade/unit/env/default/test_actions.py— 21 pass🤖 Generated with Claude Code