Skip to content

Add GTX 1060 (Pascal sm_61) support#321

Open
raj1003 wants to merge 2 commits intokarpathy:masterfrom
raj1003:gtx1060-support
Open

Add GTX 1060 (Pascal sm_61) support#321
raj1003 wants to merge 2 commits intokarpathy:masterfrom
raj1003:gtx1060-support

Conversation

@raj1003
Copy link

@raj1003 raj1003 commented Mar 18, 2026

  • Replace Flash Attention 3 with PyTorch SDPA (F.scaled_dot_product_attention)
  • Remove kernels dependency
  • Downgrade PyTorch 2.9.1 -> 2.5.1+cu121 (last version supporting sm_61)
  • Switch bfloat16 -> float16 (Pascal lacks bf16 support)
  • Disable torch.compile (Triton requires sm_70+)
  • Rewrite optimizer steps for eager mode with float32 master weights
  • Reduce model size for 6GB VRAM: DEPTH 4, DEVICE_BATCH_SIZE 8, TOTAL_BATCH_SIZE 32K, MAX_SEQ_LEN 512, WINDOW_PATTERN L
  • Reduce EVAL_TOKENS for faster validation

Verified: trains to completion in ~6min, 985MB VRAM, val_bpb ~3.21

raj1003 added 2 commits March 17, 2026 20:49
- Replace Flash Attention 3 with PyTorch SDPA (F.scaled_dot_product_attention)
- Remove kernels dependency
- Downgrade PyTorch 2.9.1 -> 2.5.1+cu121 (last version supporting sm_61)
- Switch bfloat16 -> float16 (Pascal lacks bf16 support)
- Disable torch.compile (Triton requires sm_70+)
- Rewrite optimizer steps for eager mode with float32 master weights
- Reduce model size for 6GB VRAM: DEPTH 4, DEVICE_BATCH_SIZE 8,
  TOTAL_BATCH_SIZE 32K, MAX_SEQ_LEN 512, WINDOW_PATTERN L
- Reduce EVAL_TOKENS for faster validation

Verified: trains to completion in ~6min, 985MB VRAM, val_bpb ~3.21
- Add detect_gpu_config() that queries GPU compute capability and VRAM
- 4 tiers: high (H100/A100 40GB+), mid-high (RTX 4090/3090 16GB+),
  mid (RTX 3070/2080 8GB+), low (GTX 1060 <8GB)
- Auto-select: attention backend (FA3 vs SDPA), dtype (bf16 vs fp16),
  torch.compile (sm_70+ only), model depth, batch sizes, seq length
- FA3 restored as optional dep for Ampere/Hopper GPUs
- prepare.py reads MAX_SEQ_LEN/EVAL_TOKENS from env vars set by train.py
- Optimizer supports both compiled (CPU tensor) and eager (float32) paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant