Skip to content

Commit 7e41b66

Browse files
feat: re-shuffle Nanoset sample indices within epochs
More details here: huggingface#247 "Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. This commit adds unique within-epoch shuffling for each epoch." Squashed commit of the following: commit f73d111 Author: Thomas Bouvier <[email protected]> Date: Sun Mar 23 01:19:15 2025 +0100 docs: document the shuffling process in Nanoset commit f060414 Author: Lauler <[email protected]> Date: Thu Nov 28 09:04:17 2024 +0100 Simplify random seed in epoch data for reproducibility commit eab4770 Author: Lauler <[email protected]> Date: Sun Nov 24 14:29:31 2024 +0100 Add shuffling for subsequent epochs when data is repeated
1 parent 0f6ad3d commit 7e41b66

File tree

2 files changed

+19
-9
lines changed

2 files changed

+19
-9
lines changed

docs/nanoset.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
107107
dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
108108
dataset_sample_index = [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
109109
```
110-
Then, we **shuffle with the same permutation both indexes** and concatenate them `number of epochs` times, which is defined by `train split num samples` / `number of samples per epoch`.
110+
Then, we **shuffle both indexes with the same permutation** for a given epoch and repeat this process, concatenating the resulting indexes `number of epochs` times (defined by `train split num samples` / `number of samples per epoch`).
111111
```
112112
Given:
113113

src/nanotron/data/nanoset.py

+18-8
Original file line numberDiff line numberDiff line change
@@ -111,14 +111,24 @@ def build_nanoset_index(self) -> np.ndarray:
111111
dataset_index, dataset_sample_index = build_nanoset_index_helper(
112112
n_samples=samples_per_epoch, weights=self.dataset_weights, dataset_sizes=self.dataset_lengths
113113
)
114-
# Shuffle the indexes the same way
115-
numpy_random_state = np.random.RandomState(self.random_seed)
116-
numpy_random_state.shuffle(dataset_index)
117-
numpy_random_state = np.random.RandomState(self.random_seed)
118-
numpy_random_state.shuffle(dataset_sample_index)
119-
# Concatenate num_epochs the shuffled indexes
120-
dataset_index = np.concatenate([dataset_index for _ in range(num_epochs)])
121-
dataset_sample_index = np.concatenate([dataset_sample_index for _ in range(num_epochs)])
114+
115+
# Shuffle indices in each epoch with different random seeds and concatenate them
116+
dataset_indices = []
117+
dataset_sample_indices = []
118+
for num_epoch in range(num_epochs):
119+
# Shuffle the sample and dataset indices in epoch with a given seed
120+
numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
121+
numpy_random_state.shuffle(dataset_index)
122+
numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
123+
numpy_random_state.shuffle(dataset_sample_index)
124+
125+
dataset_indices.append(dataset_index)
126+
dataset_sample_indices.append(dataset_sample_index)
127+
128+
# Concatenate the within-epoch shuffled indices
129+
dataset_index = np.concatenate(dataset_indices)
130+
dataset_sample_index = np.concatenate(dataset_sample_indices)
131+
122132
# Just keep the necessary samples
123133
dataset_index = dataset_index[: self.train_split_num_samples]
124134
dataset_sample_index = dataset_sample_index[: self.train_split_num_samples]

0 commit comments

Comments
 (0)