Fix BSH action scheme and PBR reward for correct trading behavior by rhamnett · Pull Request #485 · tensortrade-org/tensortrade

rhamnett · 2026-02-07T17:03:30Z

Summary

The BSH (Buy/Sell/Hold) action scheme and PBR (Position-Based Return) reward scheme had fundamental bugs that made it impossible for an RL agent to learn a profitable trading strategy. These have likely been present since the original implementation.

Bug 1: BSH used a binary toggle instead of explicit actions

The old BSH action space was Discrete(2) with a toggle mechanism — action 0 and action 1 would swap between holding cash and holding asset based on abs(action - self.action) > 0. This meant:

There was no "hold" action — every step forced a potential state change
The mapping between action values and trading intent was implicit and confusing
The agent couldn't express "do nothing" without the toggle logic accidentally triggering trades

Fix: BSH now uses Discrete(3) with explicit semantics:

0 = Hold (do nothing)
1 = Buy (cash → asset, no-op if already holding)
2 = Sell (asset → cash, no-op if already in cash)

Position tracking via _position (0=cash, 1=asset) with balance checks before executing orders.

Bug 2: PBR position=-1 after sell created a phantom short signal

This was the critical bug. The old PBR set position = -1 after a sell action. The PBR reward formula is:

reward = position × (price_t - price_{t-1})

With position = -1 after selling:

During a price rally: reward = (-1) × (positive) = negative reward for being in cash
During a price drop: reward = (-1) × (negative) = positive reward for being in cash

This effectively modeled a short position — but BSH is a long-only system. The agent can only hold cash or hold asset, it cannot short. The -1 position was a phantom signal that:

Punished the agent for selling before rallies — teaching it "never sell, even to take profits"
Rewarded the agent for selling before drops — but with a fake signal disconnected from actual PnL
Made a +13% profitable strategy show negative cumulative reward

Fix: PBR now uses position = 0 after sell (cash/flat). The reward becomes:

Long (position=1): 1 × price_diff — agent participates in price moves
Cash (position=0): 0 × price_diff = 0 — neutral, agent is simply not in the market

This correctly models long-only PnL attribution: daily_PnL = position_size × price_change.

Bug 3: PBR was completely blind to commission costs

The old PBR had no commission awareness whatsoever. Whether the exchange charged 0% or 100% commission, the agent received identical reward signals. This meant:

The agent couldn't learn that trading has a cost
Excessive churning (rapid buy/sell cycles) was never penalized
Strategies that looked profitable in reward space were actually unprofitable after fees

Fix: Trades now incur a penalty of price × commission_rate, keeping the penalty in the same units as the price-based reward:

if self._traded:
    reward -= current_price * self.commission

Example: buying at price $100 with 0.3% commission costs 100 × 0.003 = 0.3 in reward, meaning the agent needs a $0.30 price rise just to break even on the trade — exactly matching reality.

What Changed

File	Change
`tensortrade/env/default/actions.py`	BSH: `Discrete(2)` toggle → `Discrete(3)` with hold/buy/sell + `_position` tracking
`tensortrade/env/default/rewards.py`	PBR: position=0 after sell (not -1), commission penalty, `_traded` flag
`tensortrade/env/default/rewards.py`	AdvancedPBR: same position and commission fixes
`tests/.../test_actions.py`	21 new tests covering BSH actions, PBR position semantics, commission penalty, and regression guards
`tests/.../test_checkpoint.py`	Updated for `Discrete(3)` action space
`tests/.../test_env_creation.py`	Updated for `Discrete(3)` action space

Verification

Before fix (position=-1 after sell):

Buy at 100, hold to 110 (+10%), sell to cash
Price continues to 115...
Cash hold reward = (-1) × (115-110) = -5  ← WRONG: punished for being safe!

After fix (position=0 after sell):

Buy at 100, hold to 110 (+10%), sell to cash  
Price continues to 115...
Cash hold reward = (0) × (115-110) = 0   ← CORRECT: neutral, not participating

All 270 existing tests pass (2 skipped as expected)
21 new tests added, all passing
Reward signal verified to have correct directional alignment across all commission levels and trading strategies

Test plan

pytest tests/ — 270 pass, 2 skip
pytest tests/tensortrade/unit/env/default/test_actions.py — 21 pass
Regression test: cash position gives exactly 0 reward during rallies (not negative)
Regression test: long position in rising market gives positive reward
Commission penalty scales correctly with price

🤖 Generated with Claude Code

BSH had a flawed action scheme (Discrete(2) toggle) and PBR had two critical bugs causing the agent to never learn profitable strategies: 1. BSH used a toggle (0/1) instead of explicit hold/buy/sell actions, making it impossible to hold a position without toggling state 2. PBR position=-1 after sell created a fake short signal that punished the agent for being safe in cash during rallies 3. PBR was completely blind to commission costs — identical reward signal whether commission was 0% or 100% Fixes: - BSH: Discrete(3) with 0=hold, 1=buy, 2=sell + position tracking - PBR position: sell to 0 (cash/flat) instead of -1 (fake short) - PBR init/reset: position=0 instead of -1 - Commission penalty: price x commission on trades, scaled to match the price-based reward units so the agent correctly learns trade costs - AdvancedPBR: same position and commission fixes - Tests: updated for Discrete(3) semantics, added commission penalty test Verified: 270 tests pass, reward signal has correct directional alignment with actual returns across all commission levels and trading strategies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… implementation available. Set BSH3 as default action scheme used when running tests

carlogrisetti

Looks good to me

carlogrisetti · 2026-02-08T02:47:27Z

I kept BSH as the original one, and renamed the "fixed" BSH to BSH3. I also updated the tests to reference BSH3

rhamnett and others added 2 commits February 7, 2026 16:59

Make the revised BSH version known as BSH3 and leave the "legacy" BSH…

b708c37

… implementation available. Set BSH3 as default action scheme used when running tests

carlogrisetti force-pushed the fix/pbr-reward-position branch from e1e0ef3 to b708c37 Compare February 8, 2026 01:41

carlogrisetti approved these changes Feb 8, 2026

View reviewed changes

rhamnett closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BSH action scheme and PBR reward for correct trading behavior#485

Fix BSH action scheme and PBR reward for correct trading behavior#485
rhamnett wants to merge 2 commits into
masterfrom
fix/pbr-reward-position

rhamnett commented Feb 7, 2026

Uh oh!

carlogrisetti left a comment

Uh oh!

carlogrisetti commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rhamnett commented Feb 7, 2026

Summary

Bug 1: BSH used a binary toggle instead of explicit actions

Bug 2: PBR position=-1 after sell created a phantom short signal

Bug 3: PBR was completely blind to commission costs

What Changed

Verification

Test plan

Uh oh!

carlogrisetti left a comment

Choose a reason for hiding this comment

Uh oh!

carlogrisetti commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants