feat: re-shuffle Nanoset sample indices within epochs

thomas-bouvier · thomas-bouvier · commit 7e41b66ee843 · 2025-03-23T01:51:01.000+01:00
More details here: huggingface#247 "Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. This commit adds unique within-epoch shuffling for each epoch." Squashed commit of the following: commit f73d111 Author: Thomas Bouvier <contact@thomas-bouvier.io> Date: Sun Mar 23 01:19:15 2025 +0100 docs: document the shuffling process in Nanoset commit f060414 Author: Lauler <fatonr@gmail.com> Date: Thu Nov 28 09:04:17 2024 +0100 Simplify random seed in epoch data for reproducibility commit eab4770 Author: Lauler <fatonr@gmail.com> Date: Sun Nov 24 14:29:31 2024 +0100 Add shuffling for subsequent epochs when data is repeated
diff --git a/docs/nanoset.md b/docs/nanoset.md
@@ -107,7 +107,7 @@ dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
 dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
 dataset_sample_index =  [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]
 ```
-Then, we **shuffle with the same permutation both indexes** and concatenate them `number of epochs` times, which is defined by `train split num samples` / `number of samples per epoch`.
+Then, we **shuffle both indexes with the same permutation** for a given epoch and repeat this process, concatenating the resulting indexes `number of epochs` times (defined by `train split num samples` / `number of samples per epoch`).
 ```
 Given:
 
diff --git a/src/nanotron/data/nanoset.py b/src/nanotron/data/nanoset.py
@@ -111,14 +111,24 @@ def build_nanoset_index(self) -> np.ndarray:
         dataset_index, dataset_sample_index = build_nanoset_index_helper(
             n_samples=samples_per_epoch, weights=self.dataset_weights, dataset_sizes=self.dataset_lengths
         )
-        # Shuffle the indexes the same way
-        numpy_random_state = np.random.RandomState(self.random_seed)
-        numpy_random_state.shuffle(dataset_index)
-        numpy_random_state = np.random.RandomState(self.random_seed)
-        numpy_random_state.shuffle(dataset_sample_index)
-        # Concatenate num_epochs the shuffled indexes
-        dataset_index = np.concatenate([dataset_index for _ in range(num_epochs)])
-        dataset_sample_index = np.concatenate([dataset_sample_index for _ in range(num_epochs)])
+
+        # Shuffle indices in each epoch with different random seeds and concatenate them
+        dataset_indices = []
+        dataset_sample_indices = []
+        for num_epoch in range(num_epochs):
+            # Shuffle the sample and dataset indices in epoch with a given seed
+            numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
+            numpy_random_state.shuffle(dataset_index)
+            numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
+            numpy_random_state.shuffle(dataset_sample_index)
+
+            dataset_indices.append(dataset_index)
+            dataset_sample_indices.append(dataset_sample_index)
+
+        # Concatenate the within-epoch shuffled indices
+        dataset_index = np.concatenate(dataset_indices)
+        dataset_sample_index = np.concatenate(dataset_sample_indices)
+
         # Just keep the necessary samples
         dataset_index = dataset_index[: self.train_split_num_samples]
         dataset_sample_index = dataset_sample_index[: self.train_split_num_samples]