Training Models on Small Tabular Datasets #498

drkhanusa · 2024-10-15T11:21:49Z

drkhanusa
Oct 15, 2024

As I've observed, models are often trained on fairly large datasets, for instance, using the load_covertype_dataset function, where the dataset contains over 580k samples. Currently, I am working on a problem involving the prognosis of injury severity for patients with traumatic brain injury (TBI). In my case, the dataset is quite limited, with only 1,500 samples and 70 features per sample. However, given that this is real-world data from patients in a hospital, this is actually a substantial amount of data.

I’ve tested all the models on my dataset, but I've noticed that they are extremely sensitive when trained on small datasets. Even a slight adjustment can result in significant changes to the model's outcomes. For example, when I change parameters like the number of layers in the CategoryEmbeddingModel or the number of attention blocks in the FTTransformer, the model almost completely fails to learn anything. The loss and accuracy values remain unchanged, showing no improvement or variation at all.

Is there any way I can improve the stability of models when working with small datasets? I would really appreciate any suggestions. Thank you very much!!

manujosephv · 2024-10-22T03:28:33Z

manujosephv
Oct 22, 2024
Maintainer

Small datasets aren't the best to work with DL models. Reducing the complexity of the model (using hyperparameters) is recommended when dataset is too small. And any engineered feature which would capture the domain knowledge would also help the model learn better..

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Models on Small Tabular Datasets #498

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Training Models on Small Tabular Datasets #498

drkhanusa Oct 15, 2024

Replies: 1 comment

manujosephv Oct 22, 2024 Maintainer

drkhanusa
Oct 15, 2024

manujosephv
Oct 22, 2024
Maintainer