Skip to content

Sagemaker Object2Vec training throughput #1757

@adityagupta970

Description

@adityagupta970

I am using Sagemaker Object2Vec to train on data of size 2GB.

ml.p2.xlarge instance took 12 hours to train the data on 4 epochs going at the speed of 5000 samples/sec.

Now, I am using a higher level instance ml.p2.16xlarge and it only trains at 400 samples/sec with this in the logs

It is expected that ml.p2.16xlarge would train faster.

This is what I see in the logs
only 114 out of 240 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off

[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:739: only 114 out of 240 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .vvvvvvvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: v.vvvvvvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vv.vvvvvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvv.vvvvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvv.vvvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvv.vvv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvv.vv.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvvv.v.......

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: vvvvvvvv........

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: ..........vvvvvv

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........v.vvvvv

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vv.vvvv

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvv.vvv

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvv.vv

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvvv.v

2020-07-27T23:03:49.956-07:00
[06:03:49] /opt/brazil-pkg-cache/packages/AIAlgorithmsMXNet/AIAlgorithmsMXNet-1.3.x_Cuda_10.1.x.672.0/AL2012/generic-flavor/src/src/kvstore/././comm.h:748: .........vvvvvv.

System information

  • SageMaker Python SDK version: 1.71.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):Object2Vec
  • Custom Docker image (Y/N):N

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions