InternalError only on H100 GPU

Hello!

I am working on a 3 year MMM with the intention of running on a H100 GPU in Databricks. Last week, I began experiencing the error message shown below when running `prior_distribution.PriorDistribution()` function.

```
InternalError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Mul] name: 
[Trace ID: 00-3f7351ac07be6b603522816cce6b1166-b3f8b265419a170f-00]

```

 I have tested this on multiple base environments with no improvement. With the same exact notebook, if run on an A10 GPU in Databricks, I do not experience this issue. I prefer to use the H100 due to more GPUs and memory to help with convergence, but am able to run iterations using the A10 cluster.

Could someone please assist with how to resolve this?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternalError only on H100 GPU #1435

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InternalError only on H100 GPU #1435

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions