Skip to content

Commit b1ecdcf

Browse files
rueckstiessThomas Rueckstiess
andauthored
Reproducing all experiments (#6)
* folder structure, 3 subdirs, splitting README. * ddxplus experiments, currently fails in run_origami.py:164. * added DDXPlus experiments * [wip] adding Codenet experiment code. * fixed code and added readme. * add dropped accuracy calculations * update README to include RAM estimate. * added .env.local and load_secrets() to codenet experiments * rename EXPERIMENTS.md to README.md --------- Co-authored-by: Thomas Rueckstiess <[email protected]>
1 parent 4ef5a93 commit b1ecdcf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+1176
-70
lines changed

experiments/README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Reproducing the results from our paper
2+
3+
This directory contains the code and instructions to reproduce the experiments from our paper:
4+
[ORIGAMI: A generative transformer architecture for predictions from semi-structured data](https://arxiv.org/abs/2412.17348).
5+
6+
There are 3 sub-directories, each with their own `README.md` file:
7+
8+
- [`json2vec`](./json2vec/README.md) contains the experiments from section 3.1, where we compare on standard tabular benchmarks that have been converted to JSON against various baselines and the json2vec models from [A Framework for End-to-End Learning on Semantic Tree-Structured Data](https://arxiv.org/abs/2002.05707) by William Woof and Ke Chen.
9+
- [`ddxplus`](./ddxplus/README.md) contains the experiments from section 3.2 for a medical diagnosis task on patient information. This experiment demonstrates prediction of multi-token values representing arrays of possible pathologies.
10+
- [`codenet`](./codenet/README.md) contains the experiments from section 3.3 related to a Java code classification task. Here we demonstrate the model's ability to deal with complex and deeply nested JSON objects.
11+
12+
### Experiment Tracking
13+
14+
We use the open source library [guild.ai](https://guild.ai) for experiment management and result tracking.
15+
16+
### Datasets
17+
18+
We bundled all datasets used in the paper in a [MongoDB dump file](). To reproduce the results, first
19+
you need MongoDB installed on your system (or a remote server). Then, download the dump file, unzip it, and restore it into your MongoDB instance:
20+
21+
```
22+
mongorestore dump/
23+
```
24+
25+
This assumes your `mongod` server is running on `localhost` on default port 27017 and without authentication. If your setup varies, consult the [documentation](https://www.mongodb.com/docs/database-tools/mongorestore/) for `mongorestore` on how to restore the data.
26+
27+
If your database setup (URI, port, authentication) differs, also make sure to update the [`.env.local`](.env.local) file in each sub-directory accordingly.
File renamed without changes.

experiments/codenet/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# CodeNet Java Experiments
2+
3+
In this experiment, we convert Java code snippets from the [CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/) dataset into Abstract Syntax Trees and store them as JSON objects.
4+
We then train an ORiGAMi model on these ASTs for a classification task, where the programming problem ID is the target label. More details on the dataset and classification task can be found
5+
in the paper [CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks](https://arxiv.org/abs/2105.12655) by Ruchir Puri et al.
6+
7+
First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `codenet` directory.
8+
9+
### Training and evaluating the model
10+
11+
Due to resource constraints, we did not perform a hyperparameter optimization. We use a model with 4 transformer layers, 4 heads and 192 embedding dimensionality. All parameters are
12+
configured as defaults in the `guild.yml` file.
13+
14+
To run the training and evaluation on the test set, use:
15+
16+
```bash
17+
guild run train
18+
```
19+
20+
Note: Training with the default parameters requires est. 50 GB of GPU RAM.

experiments/codenet/guild.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
train:
2+
description: Train a model on the codenet Java dataset
3+
main: train
4+
flags-dest: namespace:flags
5+
flags:
6+
n_batches: 200000
7+
n_problems: 250
8+
batch_size: 8
9+
learning_rate: 1e-3
10+
n_embd: 192
11+
max_tokens: 4000
12+
max_length: 4000
13+
eval_every: 1000
14+
15+
# matches the guild_output_scalars() helper function
16+
output-scalars:
17+
- step: '\| step: (\step)'
18+
- '\| (\key): (\value)'

experiments/codenet/train.py

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
from pathlib import Path
2+
from types import SimpleNamespace
3+
4+
from pymongo import MongoClient
5+
from sklearn.pipeline import Pipeline
6+
7+
from origami.inference import Predictor
8+
from origami.model import ORIGAMI
9+
from origami.model.vpda import ObjectVPDA
10+
from origami.preprocessing import (
11+
DFDataset,
12+
DocPermuterPipe,
13+
DocTokenizerPipe,
14+
PadTruncTokensPipe,
15+
TargetFieldPipe,
16+
TokenEncoderPipe,
17+
UpscalerPipe,
18+
load_df_from_mongodb,
19+
)
20+
from origami.utils.common import set_seed
21+
from origami.utils.config import GuardrailsMethod, ModelConfig, PositionEncodingMethod, TrainConfig
22+
from origami.utils.guild import load_secrets, print_guild_scalars
23+
24+
# populated by guild
25+
flags = SimpleNamespace()
26+
secrets = load_secrets()
27+
28+
# for reproducibility
29+
set_seed(1234)
30+
31+
TARGET_FIELD = "problem"
32+
UPSCALE = 2
33+
34+
client = MongoClient(secrets["MONGO_URI"])
35+
collection = client["codenet_java"].train
36+
37+
target_problems = collection.distinct(TARGET_FIELD)
38+
num_problems = len(target_problems)
39+
40+
target_problems = target_problems[: flags.n_problems]
41+
print(f"training on {flags.n_problems} problems (out of {num_problems})")
42+
43+
# load data into dataframe for train/test
44+
45+
train_docs_df = load_df_from_mongodb(
46+
"mongodb://localhost:27017",
47+
"codenet_java",
48+
"train",
49+
filter={"problem": {"$in": target_problems}},
50+
projection={"_id": 0, "filePath": 0},
51+
)
52+
53+
test_docs_df = load_df_from_mongodb(
54+
"mongodb://localhost:27017",
55+
"codenet_java",
56+
"test",
57+
filter={"problem": {"$in": target_problems}},
58+
projection={"_id": 0, "filePath": 0},
59+
)
60+
61+
num_train_inst = len(train_docs_df)
62+
num_test_inst = len(test_docs_df)
63+
64+
# create train and test pipelines
65+
pipes = {
66+
# --- train only ---
67+
"upscaler": UpscalerPipe(n=UPSCALE),
68+
"permuter": DocPermuterPipe(shuffle_arrays=True),
69+
# --- test only ---
70+
"target": TargetFieldPipe(TARGET_FIELD),
71+
# --- train and test ---
72+
"tokenizer": DocTokenizerPipe(path_in_field_tokens=False),
73+
"padding": PadTruncTokensPipe(length=flags.max_length),
74+
"encoder": TokenEncoderPipe(max_tokens=flags.max_tokens),
75+
}
76+
77+
train_pipeline = Pipeline(
78+
[(name, pipes[name]) for name in ("target", "upscaler", "permuter", "tokenizer", "padding", "encoder")],
79+
verbose=True,
80+
)
81+
test_pipeline = Pipeline([(name, pipes[name]) for name in ("target", "tokenizer", "padding", "encoder")], verbose=True)
82+
83+
# process train, eval and test data (first fit both, then transform)
84+
train_pipeline.fit(train_docs_df)
85+
test_pipeline.fit(test_docs_df)
86+
87+
train_df = train_pipeline.transform(train_docs_df)
88+
test_df = test_pipeline.transform(test_docs_df)
89+
90+
# drop ordered_docs columns to save space
91+
train_df.drop(columns=["docs"], inplace=True)
92+
test_df.drop(columns=["docs"], inplace=True)
93+
94+
# drop all rows where the tokens array doesn't end in 0 (longer than max_length)
95+
train_df = train_df[train_df["tokens"].apply(lambda x: x[-1] == 0)]
96+
test_df = test_df[test_df["tokens"].apply(lambda x: x[-1] == 0)]
97+
98+
# get stateful objects
99+
encoder = pipes["encoder"].encoder
100+
block_size = pipes["padding"].length
101+
102+
# print data stats
103+
print(
104+
f"dropped {(1 - (len(train_df) / (UPSCALE * num_train_inst))) * 100:.2f}% training instances, and "
105+
f"{(1 - (len(test_df) / num_test_inst)) * 100:.2f}% test instances."
106+
)
107+
print(f"vocab size {encoder.vocab_size}")
108+
print(f"block size {block_size}")
109+
110+
# confirm that all targets are in the vocabulary
111+
for target in train_df["target"].unique():
112+
enc = encoder.encode(target)
113+
assert target == encoder.decode(enc), f"token not {target} represented in vocab."
114+
115+
for target in test_df["target"].unique():
116+
enc = encoder.encode(target)
117+
assert target == encoder.decode(enc), f"token not {target} represented in vocab."
118+
119+
# create datasets, VPDA and model
120+
121+
# model and train configs
122+
model_config = ModelConfig.from_preset("small")
123+
model_config.position_encoding = PositionEncodingMethod.KEY_VALUE
124+
model_config.vocab_size = encoder.vocab_size
125+
model_config.block_size = block_size
126+
model_config.n_embd = flags.n_embd
127+
model_config.mask_field_token_losses = False
128+
model_config.tie_weights = False
129+
model_config.guardrails = GuardrailsMethod.STRUCTURE_ONLY
130+
model_config.fuse_pos_with_mlp = True
131+
132+
train_config = TrainConfig()
133+
train_config.learning_rate = flags.learning_rate
134+
train_config.batch_size = flags.batch_size
135+
train_config.n_warmup_batches = 100
136+
train_config.eval_every = flags.eval_every
137+
138+
# datasets
139+
train_dataset = DFDataset(train_df)
140+
test_dataset = DFDataset(test_df)
141+
142+
vpda = ObjectVPDA(encoder)
143+
model = ORIGAMI(model_config, train_config, vpda=vpda)
144+
145+
# load model checkpoint if it exists
146+
checkpoint_file = Path("./gpt-codenet-snapshot.pt")
147+
if checkpoint_file.is_file():
148+
model.load("gpt-codenet-snapshot.pt")
149+
print(f"loading existing checkpoint at batch_num {model.batch_num}...")
150+
151+
152+
# create a predictor
153+
predictor = Predictor(model, encoder, TARGET_FIELD)
154+
155+
156+
def progress_callback(model):
157+
print_guild_scalars(
158+
step=f"{int(model.batch_num)}",
159+
epoch=model.epoch_num,
160+
batch_num=model.batch_num,
161+
batch_dt=f"{model.batch_dt * 1000:.2f}",
162+
batch_loss=f"{model.loss:.4f}",
163+
lr=f"{model.learning_rate:.2e}",
164+
)
165+
if model.batch_num % train_config.eval_every == 0:
166+
try:
167+
# train_acc = predictor.accuracy(train_dataset.sample(n=100))
168+
test_acc = predictor.accuracy(test_dataset.sample(n=100), show_progress=True)
169+
print_guild_scalars(
170+
step=f"{int(model.batch_num)}",
171+
# train_acc=f"{train_acc:.4f}",
172+
test_acc=f"{test_acc:.4f}",
173+
)
174+
# print(f"Train accuracy @ 100: {train_acc:.4f}, Test accuracy @ 100: {test_acc:.4f}")
175+
except AssertionError as e:
176+
print(e)
177+
print("continuing...")
178+
179+
model.save("gpt-codenet-snapshot.pt")
180+
print("model saved to gpt-codenet-snapshot.pt")
181+
182+
183+
model.set_callback("on_batch_end", progress_callback)
184+
185+
try:
186+
model.train_model(train_dataset, batches=flags.n_batches)
187+
except KeyboardInterrupt:
188+
pass
189+
190+
# final save
191+
model.save("gpt-codenet-snapshot.pt")
192+
print("model saved to gpt-codenet-snapshot.pt")
193+
194+
test_acc = predictor.accuracy(test_dataset, show_progress=True)
195+
print_guild_scalars(
196+
step=f"{int(model.batch_num / train_config.eval_every)}",
197+
test_acc=f"{test_acc:.4f}",
198+
)
199+
200+
dropped_ratio = 1 - (len(test_df) / num_test_inst)
201+
print(f"Final test accuracy when taking into account the dropped instances: {(1 - dropped_ratio) * test_acc:.4f}%")

experiments/ddxplus/.env.local

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
MONGO_URI="mongodb://localhost:27017"
2+
DATABASE=ddxplus

experiments/ddxplus/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# DDXPlus Experiments
2+
3+
In this experiment we train the model on the [DDXPlus dataset](https://arxiv.org/abs/2205.09148), a dataset for automated medical diagnosis. We devise a task to predict the most likely differential diagnoses for each instance, a multi-label prediction task.
4+
5+
For ORiGAMi, we reformat the dataset into JSON format with two different representations:
6+
7+
- A flat representation, in which we store the evidences and their values as strings.
8+
- An object representation, where the evidences are stored as object containing array values.
9+
10+
We compare our model against baselines: Logistic Regression, Random Forests, XGBoost, LightGBM. The baselines are trained on a
11+
flat representation by converting the evidence-value strings into a multi-label binary matrix. We wrap each model in a scikit-learn
12+
[MultiOutputClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html).
13+
14+
First, make sure you have restored the datasets from the mongo dump file as described in [../README.md](../README.md). All commands (see below) must be run from the `ddxplus` directory.
15+
16+
## ORiGAMi
17+
18+
We train a model with the `medium` size preset by default: 6 layers, 6 heads, 192 embedding dimensionality. To train with other model sizes, append `model_size=<size>` to the command, using one of the following options: `xs`, `small`, `medium`, `large`, `xl`.
19+
20+
To train and evaluate ORiGAMi on the flat evidences structure, run the following:
21+
22+
```bash
23+
guild run origami:train evidences=flat eval_data=test seed="[1, 2, 3, 4, 5]"
24+
```
25+
26+
For the object representation of evidences, run instead:
27+
28+
```bash
29+
guild run origami:train evidences=object eval_data=test seed="[1, 2, 3, 4, 5]"
30+
```
31+
32+
This will repeat the training and evaluation 5 times with different random seeds and evaluate on the test set.
33+
34+
## Baselines
35+
36+
### Hyperparameter optimization
37+
38+
First perform HPO, supplying the `<model>` as one of `lr` (Logistic Regression), `rf` (Random Forest), `xgb` (XGBoost), `lgb` (LightGBM) and the appropriate number of trial runs with `--max-trials <num>`, and give the run a name with `<label>`, e.g.
39+
40+
```bash
41+
NUMPY_EXPERIMENTAL_DTYPE_API=1 guild run lr:hyperopt --optimizer random --max-trials 20 --label <label>
42+
```
43+
44+
To find the best parameters on the validation dataset, use:
45+
46+
```bash
47+
guild compare -Fl <label> -u
48+
```
49+
50+
Sort the `f1_val_mean` column in descending order (press `S` key) and pick the run ID (first column) of the best configuration.
51+
52+
Get the hyperparameters (= flags) with `guild runs info <run-id>`.
53+
54+
### Evaluate best hyperparameters on test dataset
55+
56+
Once the optimal hyperparameters are found, run the model with the optimal hyperparameters, e.g.:
57+
58+
```bash
59+
guild run lr:train <param1>=<value1> <param2=value2> ...
60+
```
61+
62+
Replace the `<param>` and `<value>` placeholders with the optimal hyperparameters. You can ignore `model_name` and `n_random_seeds` here.
63+
By default, the evaluation is done 5 times with different random seeds.
64+
65+
The `<metric>_test_mean` and `<metric_test_val>` scores show the evaluation on the test dataset, where `<metric>` is one of `f1`, `precision`, `recall`.
File renamed without changes.

0 commit comments

Comments
 (0)