Mechanistic Interpretations of OthelloGPT

This repository provides the experiments in Emergent Linear Representations in World Models of Self-Supervised Sequence Models.

Abstract

How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023a). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for “my colour” vs. “opponent’s colour” may be a simple yet powerful way to interpret the model’s internal state. This precise understanding of the internal representations allows us to control the model’s behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.

Installation

conda env create -f environment.yml
conda activate mech_int_othello

Download relevent data:

OthelloGPT: We only analyze the synthetic model. Save this checkpoint to the root directory of this repo.
sequence data: Refer to the code here to generate training+validation data. Keep this data in ./data.

Relevant Files

This repo uses bits and pieces from Othello World. All of the experiments in our paper can be found in the mech_int directory.

board_probe.py contains training + evaluation scripts for our linear probes. See train() and evaluate().
train_flipped.py contains training + evaluation scripts for our Flipped probes. See train() and evaluate().
intervene.py, intervene_blank.py, intervene_flipped.py contain our intervention experiments.
tl_othello_utils.py contains various utility functions.
./figures/ contains various notebooks that were used to create our figures.
./probes/ contains all of our probes.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
mech_int		mech_int
mingpt		mingpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mechanistic Interpretations of OthelloGPT

Abstract

Installation

Relevant Files

About

Releases

Packages

Contributors 6

Languages

License

ajyl/mech_int_othelloGPT

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretations of OthelloGPT

Abstract

Installation

Relevant Files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages