Skip to content

Commit 7854f1a

Browse files
author
Frederikke Marin
committed
remove mention of tfrecords from other files
1 parent d5fc399 commit 7854f1a

File tree

3 files changed

+4
-7
lines changed

3 files changed

+4
-7
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ We recommend installing BEND in a conda environment with Python 3.10.
3535

3636
### 3. Computing embeddings
3737

38-
For training downstream models, it is practical to precompute and save the embeddings to avoid recomputing them at each epoch. As embeddings can grow large when working with genomes, we use TFRecords as the format.
38+
For training downstream models, it is practical to precompute and save the embeddings to avoid recomputing them at each epoch. As embeddings can grow large when working with genomes, we use tar files as the format.
3939
Firstly download the desired data from the [data folder](https://sid.erda.dk/cgi-sid/ls.py?share_id=eXAmVvbRSW) and place it in BEND/ (for ease of use maintain the same folder structure).
4040
To precompute the embeddings for all models and tasks, run :
4141
```
@@ -89,7 +89,7 @@ embedder = HyenaDNAEmbedder('pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen'
8989
#### Training and evaluating supervised models
9090

9191
It is first required that the [above step (computing the embeddings)](#2-computing-embeddings) is completed.
92-
The embeddings should afterwards be located in `BEND/data/{task_name}/{embedder}/*tfrecords`
92+
The embeddings should afterwards be located in `BEND/data/{task_name}/{embedder}/*tar.gz`
9393

9494
To run a downstream task run (from `BEND/`):
9595
```

bend/utils/data_downstream.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ def return_dataloader(data : Union[str, list],
8686
Parameters
8787
----------
8888
data : Union[str, list]
89-
Path to tfrecord or list of paths to tar files.
89+
Path to single tar file or list of paths to tar files.
9090
batch_size : int, optional
9191
Batch size. The default is 8.
9292
num_workers : int, optional

docs/source/hydra.rst

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -135,10 +135,7 @@ Below is an example of one such config file.
135135
num_workers : 0
136136
padding_value : -100
137137
shuffle : 5000
138-
data_dir : ./data/${task}/${embedder}/ # directory where the tf reoc
139-
train_data : [train.tfrecord] # list of tfrecords to be used for training
140-
valid_data : [valid.tfrecord] # list of tfrecords to be used for validation
141-
test_data : [test.tfrecord] # list of tfrecords to be used for testing
138+
data_dir : ./data/${task}/${embedder}/ # directory where the tar files are stored
142139
# cross_validation : 1 # which number fold to run for Cross validation (use either this or the above train/test/valid options)
143140
params: # training arguments
144141
epochs: 100

0 commit comments

Comments
 (0)