Skip to content

Fix BSH action scheme and PBR reward for correct trading behavior#485

Closed
rhamnett wants to merge 2 commits into
masterfrom
fix/pbr-reward-position
Closed

Fix BSH action scheme and PBR reward for correct trading behavior#485
rhamnett wants to merge 2 commits into
masterfrom
fix/pbr-reward-position

Conversation

@rhamnett
Copy link
Copy Markdown
Contributor

@rhamnett rhamnett commented Feb 7, 2026

Summary

The BSH (Buy/Sell/Hold) action scheme and PBR (Position-Based Return) reward scheme had fundamental bugs that made it impossible for an RL agent to learn a profitable trading strategy. These have likely been present since the original implementation.

Bug 1: BSH used a binary toggle instead of explicit actions

The old BSH action space was Discrete(2) with a toggle mechanism — action 0 and action 1 would swap between holding cash and holding asset based on abs(action - self.action) > 0. This meant:

  • There was no "hold" action — every step forced a potential state change
  • The mapping between action values and trading intent was implicit and confusing
  • The agent couldn't express "do nothing" without the toggle logic accidentally triggering trades

Fix: BSH now uses Discrete(3) with explicit semantics:

  • 0 = Hold (do nothing)
  • 1 = Buy (cash → asset, no-op if already holding)
  • 2 = Sell (asset → cash, no-op if already in cash)

Position tracking via _position (0=cash, 1=asset) with balance checks before executing orders.

Bug 2: PBR position=-1 after sell created a phantom short signal

This was the critical bug. The old PBR set position = -1 after a sell action. The PBR reward formula is:

reward = position × (price_t - price_{t-1})

With position = -1 after selling:

  • During a price rally: reward = (-1) × (positive) = negative reward for being in cash
  • During a price drop: reward = (-1) × (negative) = positive reward for being in cash

This effectively modeled a short position — but BSH is a long-only system. The agent can only hold cash or hold asset, it cannot short. The -1 position was a phantom signal that:

  1. Punished the agent for selling before rallies — teaching it "never sell, even to take profits"
  2. Rewarded the agent for selling before drops — but with a fake signal disconnected from actual PnL
  3. Made a +13% profitable strategy show negative cumulative reward

Fix: PBR now uses position = 0 after sell (cash/flat). The reward becomes:

  • Long (position=1): 1 × price_diff — agent participates in price moves
  • Cash (position=0): 0 × price_diff = 0 — neutral, agent is simply not in the market

This correctly models long-only PnL attribution: daily_PnL = position_size × price_change.

Bug 3: PBR was completely blind to commission costs

The old PBR had no commission awareness whatsoever. Whether the exchange charged 0% or 100% commission, the agent received identical reward signals. This meant:

  • The agent couldn't learn that trading has a cost
  • Excessive churning (rapid buy/sell cycles) was never penalized
  • Strategies that looked profitable in reward space were actually unprofitable after fees

Fix: Trades now incur a penalty of price × commission_rate, keeping the penalty in the same units as the price-based reward:

if self._traded:
    reward -= current_price * self.commission

Example: buying at price $100 with 0.3% commission costs 100 × 0.003 = 0.3 in reward, meaning the agent needs a $0.30 price rise just to break even on the trade — exactly matching reality.

What Changed

File Change
tensortrade/env/default/actions.py BSH: Discrete(2) toggle → Discrete(3) with hold/buy/sell + _position tracking
tensortrade/env/default/rewards.py PBR: position=0 after sell (not -1), commission penalty, _traded flag
tensortrade/env/default/rewards.py AdvancedPBR: same position and commission fixes
tests/.../test_actions.py 21 new tests covering BSH actions, PBR position semantics, commission penalty, and regression guards
tests/.../test_checkpoint.py Updated for Discrete(3) action space
tests/.../test_env_creation.py Updated for Discrete(3) action space

Verification

Before fix (position=-1 after sell):

Buy at 100, hold to 110 (+10%), sell to cash
Price continues to 115...
Cash hold reward = (-1) × (115-110) = -5  ← WRONG: punished for being safe!

After fix (position=0 after sell):

Buy at 100, hold to 110 (+10%), sell to cash  
Price continues to 115...
Cash hold reward = (0) × (115-110) = 0   ← CORRECT: neutral, not participating
  • All 270 existing tests pass (2 skipped as expected)
  • 21 new tests added, all passing
  • Reward signal verified to have correct directional alignment across all commission levels and trading strategies

Test plan

  • pytest tests/ — 270 pass, 2 skip
  • pytest tests/tensortrade/unit/env/default/test_actions.py — 21 pass
  • Regression test: cash position gives exactly 0 reward during rallies (not negative)
  • Regression test: long position in rising market gives positive reward
  • Commission penalty scales correctly with price

🤖 Generated with Claude Code

rhamnett and others added 2 commits February 7, 2026 16:59
BSH had a flawed action scheme (Discrete(2) toggle) and PBR had two
critical bugs causing the agent to never learn profitable strategies:

1. BSH used a toggle (0/1) instead of explicit hold/buy/sell actions,
   making it impossible to hold a position without toggling state

2. PBR position=-1 after sell created a fake short signal that punished
   the agent for being safe in cash during rallies

3. PBR was completely blind to commission costs — identical reward signal
   whether commission was 0% or 100%

Fixes:
- BSH: Discrete(3) with 0=hold, 1=buy, 2=sell + position tracking
- PBR position: sell to 0 (cash/flat) instead of -1 (fake short)
- PBR init/reset: position=0 instead of -1
- Commission penalty: price x commission on trades, scaled to match the
  price-based reward units so the agent correctly learns trade costs
- AdvancedPBR: same position and commission fixes
- Tests: updated for Discrete(3) semantics, added commission penalty test

Verified: 270 tests pass, reward signal has correct directional alignment
with actual returns across all commission levels and trading strategies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… implementation available.

Set BSH3 as default action scheme used when running tests
@carlogrisetti carlogrisetti force-pushed the fix/pbr-reward-position branch from e1e0ef3 to b708c37 Compare February 8, 2026 01:41
Copy link
Copy Markdown
Collaborator

@carlogrisetti carlogrisetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@carlogrisetti
Copy link
Copy Markdown
Collaborator

I kept BSH as the original one, and renamed the "fixed" BSH to BSH3. I also updated the tests to reference BSH3

@rhamnett rhamnett closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants