remove mention of tfrecords from other files

Frederikke Marin · Frederikke Marin · commit 7854f1ac1307 · 2023-10-03T22:19:53.000+02:00
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ We recommend installing BEND in a conda environment with Python 3.10.
 
 ### 3. Computing embeddings
 
-For training downstream models, it is practical to precompute and save the embeddings to avoid recomputing them at each epoch. As embeddings can grow large when working with genomes, we use TFRecords as the format.
+For training downstream models, it is practical to precompute and save the embeddings to avoid recomputing them at each epoch. As embeddings can grow large when working with genomes, we use tar files as the format.
 Firstly download the desired data from the [data folder](https://sid.erda.dk/cgi-sid/ls.py?share_id=eXAmVvbRSW) and place it in BEND/ (for ease of use maintain the same folder structure). 
 To precompute the embeddings for all models and tasks, run : 
 ```
@@ -89,7 +89,7 @@ embedder = HyenaDNAEmbedder('pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen'
 #### Training and evaluating supervised models
 
 It is first required that the [above step (computing the embeddings)](#2-computing-embeddings) is completed.
-The embeddings should afterwards be located in `BEND/data/{task_name}/{embedder}/*tfrecords`
+The embeddings should afterwards be located in `BEND/data/{task_name}/{embedder}/*tar.gz`
 
 To run a downstream task run (from `BEND/`):
 ```
diff --git a/bend/utils/data_downstream.py b/bend/utils/data_downstream.py
@@ -86,7 +86,7 @@ def return_dataloader(data : Union[str, list],
     Parameters
     ----------
     data : Union[str, list]
-        Path to tfrecord or list of paths to tar files.
+        Path to single tar file or list of paths to tar files.
     batch_size : int, optional
         Batch size. The default is 8.
     num_workers : int, optional
diff --git a/docs/source/hydra.rst b/docs/source/hydra.rst
@@ -135,10 +135,7 @@ Below is an example of one such config file.
     num_workers : 0
     padding_value : -100
     shuffle : 5000
-    data_dir : ./data/${task}/${embedder}/ # directory where the tf reoc
-    train_data : [train.tfrecord] # list of tfrecords to be used for training
-    valid_data : [valid.tfrecord] # list of tfrecords to be used for validation
-    test_data :  [test.tfrecord] # list of tfrecords to be used for testing
+    data_dir : ./data/${task}/${embedder}/ # directory where the tar files are stored
     # cross_validation : 1 # which number fold to run for Cross validation (use either this or the above train/test/valid options)
   params: # training arguments
     epochs: 100