-
Notifications
You must be signed in to change notification settings - Fork 227
Open
Description
Hello!
I am working on a 3 year MMM with the intention of running on a H100 GPU in Databricks. Last week, I began experiencing the error message shown below when running prior_distribution.PriorDistribution() function.
InternalError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Mul] name:
[Trace ID: 00-3f7351ac07be6b603522816cce6b1166-b3f8b265419a170f-00]
I have tested this on multiple base environments with no improvement. With the same exact notebook, if run on an A10 GPU in Databricks, I do not experience this issue. I prefer to use the H100 due to more GPUs and memory to help with convergence, but am able to run iterations using the A10 cluster.
Could someone please assist with how to resolve this?
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels