Large dataset running out of vram? #760

12bitmisfit · 2021-07-19T01:08:50Z

12bitmisfit
Jul 19, 2021

I'm trying to train on a very large dataset. I can train on smaller portions of it by moving them into another directory and pointing train.py at that but if I try to use the whole dataset at once it won't run as I don't have enough vram for it. I have an rtx 3090 with 24GB of vram so I can't really do much better vram wise without getting into ridiculous pricing. I know I have to define num_classes when I'm over 1000 sub folders in train/validation, and it runs just fine with fairly large datasets but I'd like to be able to train against all 357k sub folders I have.

Is there something I'm missing in how to limit the amount of vram being used? Is it trying to load all the images onto vram for training? Is there a way I can limit this? I've tried silly things like reducing batch size to 1 but the problem seems to be that I just can't fit the dataset into vram.

alexander-soare · 2021-07-19T09:43:32Z

alexander-soare
Jul 19, 2021

@12bitmisfit your whole dataset should never go into VRAM at once. All that sits in VRAM is your model and optimizer parameters, a single input batch, and whatever needs to be remembered during a forward/backward pass. If you're running out even with batch size 1 it may be one of or a combination of:

Image size is way too large
You're working with a big model

Or bugs like:

You have some line in your forward pass that erroneously makes a giant tensor (check your dims)
You think your batch size is 1 but it's really not - check your data loaders and what's coming out of them (I suspect this given your comments)

BTW you can't limit the VRAM being used explicitly (this just doesn't make sense in this context). You can only do it implicitly by controlling things I mentioned above.

Hope this helps.

0 replies

rwightman · 2021-07-20T20:03:58Z

rwightman
Jul 20, 2021
Maintainer

@12bitmisfit to add to @alexander-soare 's response, dataset itself should not impact GPU memory usage if used via dataset / dataloader setup as in the timm training scripts.

However, the number of classes does impact memory use. 357K classes (if that's what you actually have) is... insane. You will have to hit up some literature on how to deal with this situation as it will not be easy to deal with. It'll be approx 400-700M params just for the classifier, and you have to do softmax over a dim of 357K which is also non-trivial. You'll likely need some fancy model parallelism and approximate/hierarchical solutions

1 reply

alexander-soare Jul 20, 2021

Ohhh the classifier... Lol good spot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large dataset running out of vram? #760

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Large dataset running out of vram? #760

12bitmisfit Jul 19, 2021

Replies: 2 comments · 1 reply

alexander-soare Jul 19, 2021

rwightman Jul 20, 2021 Maintainer

alexander-soare Jul 20, 2021

12bitmisfit
Jul 19, 2021

Replies: 2 comments 1 reply

alexander-soare
Jul 19, 2021

rwightman
Jul 20, 2021
Maintainer