-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
I am using Sagemaker Object2Vec to train on data of size 2GB.
ml.p2.xlarge instance took 12 hours to train the data on 4 epochs going at the speed of 5000 samples/sec.
Now, I am using a higher level instance ml.p2.16xlarge and it only trains at 400 samples/sec with this in the logs
It is expected that ml.p2.16xlarge would train faster.
This is what I see in the logs
only 114 out of 240 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:739: only 114 out of 240 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .vvvvvvvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: v.vvvvvvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vv.vvvvvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvv.vvvvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvv.vvvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvv.vvv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvv.vv.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvvv.v.......
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvvvv........
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: ..........vvvvvv
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........v.vvvvv
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vv.vvvv
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvv.vvv
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvv.vv
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvvv.v
2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvvvv.
System information
- SageMaker Python SDK version: 1.71.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans):Object2Vec
- Custom Docker image (Y/N):N
Reactions are currently unavailable