Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use GPU on CST in Poly-Grames node006/node007 #61

Open
jcohenadad opened this issue Jul 8, 2022 · 22 comments
Open

Cannot use GPU on CST in Poly-Grames node006/node007 #61

jcohenadad opened this issue Jul 8, 2022 · 22 comments

Comments

@jcohenadad
Copy link
Member

jcohenadad commented Jul 8, 2022

@4rnaudB in addition to documenting your debugging, can you pls also describe a way to test a simulation (eg: open a 'dummy' project in CST, click on solver, etc.)
--> Presentation on how to set up the GPU acceleration

Relevant documentation: https://updates.cst.com/downloads/GPU_Computing_Guide_2022.pdf

@JustinDeM
Copy link

The GPU on Node006 and Node007 is a Tesla V100S-PCIE-32GB :
Screen Shot 2022-07-08 at 11 24 23
which is not directly supported like the Tesla V100-PCIE-32GB. We will have to the following (from https://updates.cst.com/downloads/GPU_Computing_Guide_2022.pdf) :
Screen Shot 2022-07-08 at 11 26 21

@JustinDeM
Copy link

I couldn't manage to find whether the Tesla-V100S is capable of running CUDA 11.2 code, but I suspect it is because it is more powerful than the Tesla-V100. If so, to enable the GPU in CST, administrator privilege would be required to set environment variables.

@jcohenadad
Copy link
Member Author

jcohenadad commented Jul 8, 2022

administrator privilege would be required to set environment variables

so we need to loop in JS-- i'll take care of it

EDIT: email sent-- waiting for an answer

@jcohenadad
Copy link
Member Author

testing on node006, I tried launching HWAccDiagnostics, but got this:
image
Likely related to #61 (comment)

@jcohenadad
Copy link
Member Author

@JustinDeM @4rnaudB did you manage to run HWAccDiagnostics? Have you tried installing CUDA drivers?

@4rnaudB
Copy link
Member

4rnaudB commented Jul 8, 2022

trying it right now

@JustinDeM
Copy link

We get this error message :
Screen Shot 2022-07-08 at 12 42 24

@4rnaudB
Copy link
Member

4rnaudB commented Jul 8, 2022

Seems we don't have access to C:\ we can't run it. There is another test you can do if you have access to see if the GPU is recognized.
image

@4rnaudB
Copy link
Member

4rnaudB commented Jul 8, 2022

Also, we got an error this morning with node006 saying no GPU was recognized but we don't get the same error with node007 when running the simulation.

@JustinDeM
Copy link

When we run the Check GPU Computing macro on Node007, we get this :
Screen Shot 2022-07-08 at 12 53 15

(The Tesla-V100s is a Volta series).

@jcohenadad
Copy link
Member Author

There is another test you can do if you have access to see if the GPU is recognized.

I don't have access-- hopefully a Poly admin will help us ASAP

@jcohenadad
Copy link
Member Author

When we run the Check GPU Computing macro on Node007, we get this :

Ah! That's interesting, when I ran it on node006 I got the error message reported in #61 (comment)

@jcohenadad
Copy link
Member Author

jcohenadad commented Jul 8, 2022

#61 (comment) is pretty encouraging because the GPU is recognized! What is ECC? can it be disabled?

EDIT: https://www.nvidia.com/content/Control-Panel-Help/vLatest/en-us/mergedProjects/nvwks/To_change_the_ecc_state.htm

@JustinDeM
Copy link

JustinDeM commented Jul 8, 2022

Error Correction Code. We are trying to change it but the Nvidia Control Panel won't open. It could be related to the fact that someone else is using a lot of the ressources of Node007 right now :
Screen Shot 2022-07-08 at 13 06 33

@jcohenadad
Copy link
Member Author

jcohenadad commented Jul 8, 2022

It could be related to the fact that someone else is using a lot of the ressources of Node007 right now :

Good point-- we should ask the GRAMES admin if they have a way to allocate ressource via a priority list, as done in Compute Canada with SLURM. --> #63

@jcohenadad
Copy link
Member Author

Answer from JS:

Ce serveur vient d’être monté il n’y a pas ci longtemps. J’ai vu le message sur GitHub, je vais ajouter la variable un peu plus tard… Pour le moment seulement ADS été utilisé sur ces serveurs

@4rnaudB
Copy link
Member

4rnaudB commented Jul 27, 2022

@4rnaudB in addition to documenting your debugging, can you pls also describe a way to test a simulation (eg: open a 'dummy' project in CST, click on solver, etc.)

Relevant documentation: https://updates.cst.com/downloads/GPU_Computing_Guide_2022.pdf

Should this be an issue or a powerpoint? (I'm planning to put images for a step by step process)

@jcohenadad
Copy link
Member Author

Should this be an issue or a powerpoint? (I'm planning to put images for a step by step process)

hum, good question. If you prefer creating a PPTX, then I suggest you do it with Gslide, put the document here and link it in this issue?

@4rnaudB
Copy link
Member

4rnaudB commented Jul 27, 2022

Sounds good

@4rnaudB
Copy link
Member

4rnaudB commented Jul 27, 2022

The problem is that the link to the google slide would be at the end of the issue so it might not be very intuitive/practical to add it in this issue.

@jcohenadad
Copy link
Member Author

The problem is that the link to the google slide would be at the end of the issue so it might not be very intuitive/practical to add it in this issue.

so how about adding it at the top? #61 (comment)

@4rnaudB
Copy link
Member

4rnaudB commented Jul 27, 2022

Oh nice, didn't know I could edit your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants