Skip to content

InternalError only on H100 GPU #1435

@ethancalamia

Description

@ethancalamia

Hello!

I am working on a 3 year MMM with the intention of running on a H100 GPU in Databricks. Last week, I began experiencing the error message shown below when running prior_distribution.PriorDistribution() function.

InternalError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Mul] name: 
[Trace ID: 00-3f7351ac07be6b603522816cce6b1166-b3f8b265419a170f-00]

I have tested this on multiple base environments with no improvement. With the same exact notebook, if run on an A10 GPU in Databricks, I do not experience this issue. I prefer to use the H100 due to more GPUs and memory to help with convergence, but am able to run iterations using the A10 cluster.

Could someone please assist with how to resolve this?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions