You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run the ./images/init_venv.sh,
then run this command:
python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node 1 tml/projects/twhin/run.py --config_yaml_path="tml/projects/twhin/config/local.yaml" --save_dir="tml/model"
And It told this error:
Traceback (most recent call last):
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 105, in
app.run(main)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 97, in main
run(
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 70, in run
ctl.train(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 192, in train
outputs = train_step_fn()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 57, in step_fn
outputs = pipeline.progress(data_iterator)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/train_pipeline.py", line 582, in progress
(losses, output) = cast(Tuple[torch.Tensor, Out], self._model(batch_i))
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 147, in forward
outputs = self.model(batch)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 255, in forward
return self._dmp_wrapped_module(*args, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 39, in forward
outs = self.large_embeddings(batch.nodes)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/common/modules/embedding/embedding.py", line 55, in forward
pooled_embs = self.ebc(sparse_features)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/types.py", line 594, in forward
dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embeddingbag.py", line 424, in input_dist
input_dist(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/sharding/rw_sharding.py", line 303, in forward
) = bucketize_kjt_before_all2all(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embedding_sharding.py", line 169, in bucketize_kjt_before_all2all
) = torch.ops.fbgemm.block_bucketize_sparse_features(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/_ops.py", line 442, in call
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
Here is my CUDA version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
When I run the ./images/init_venv.sh,
then run this command:
python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node 1 tml/projects/twhin/run.py --config_yaml_path="tml/projects/twhin/config/local.yaml" --save_dir="tml/model"
And It told this error:
Traceback (most recent call last):
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 105, in
app.run(main)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 97, in main
run(
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 70, in run
ctl.train(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 192, in train
outputs = train_step_fn()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 57, in step_fn
outputs = pipeline.progress(data_iterator)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/train_pipeline.py", line 582, in progress
(losses, output) = cast(Tuple[torch.Tensor, Out], self._model(batch_i))
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 147, in forward
outputs = self.model(batch)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 255, in forward
return self._dmp_wrapped_module(*args, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 39, in forward
outs = self.large_embeddings(batch.nodes)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/common/modules/embedding/embedding.py", line 55, in forward
pooled_embs = self.ebc(sparse_features)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/types.py", line 594, in forward
dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embeddingbag.py", line 424, in input_dist
input_dist(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/sharding/rw_sharding.py", line 303, in forward
) = bucketize_kjt_before_all2all(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embedding_sharding.py", line 169, in bucketize_kjt_before_all2all
) = torch.ops.fbgemm.block_bucketize_sparse_features(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/_ops.py", line 442, in call
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
Here is my CUDA version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Here is my pip list:
Package Version
absl-py 1.4.0
aiofiles 22.1.0
aiohttp 3.8.3
aiosignal 1.3.1
appdirs 1.4.4
arrow 1.2.3
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrs 22.1.0
backcall 0.2.0
black 22.6.0
cachetools 5.3.0
cblack 22.6.0
certifi 2022.12.7
cfgv 3.3.1
charset-normalizer 2.1.1
click 8.1.3
cmake 3.25.0
Cython 0.29.32
decorator 5.1.1
distlib 0.3.6
distro 1.8.0
dm-tree 0.1.6
docker 6.0.1
docker-pycreds 0.4.0
docstring-parser 0.8.1
exceptiongroup 1.1.0
executing 1.2.0
fbgemm-gpu-cpu 0.3.2
filelock 3.8.2
fire 0.5.0
flatbuffers 1.12
frozenlist 1.3.3
fsspec 2022.11.0
gast 0.4.0
gcsfs 2022.11.0
gitdb 4.0.10
GitPython 3.1.31
google-api-core 2.8.2
google-auth 2.16.0
google-auth-oauthlib 0.4.6
google-cloud-core 2.3.2
google-cloud-storage 2.7.0
google-crc32c 1.5.0
google-pasta 0.2.0
google-resumable-media 2.4.1
googleapis-common-protos 1.56.4
grpcio 1.51.1
h5py 3.8.0
hypothesis 6.61.0
identify 2.5.17
idna 3.4
importlib-metadata 6.0.0
iniconfig 2.0.0
iopath 0.1.10
ipdb 0.13.11
ipython 8.10.0
jedi 0.18.2
Jinja2 3.1.2
keras 2.9.0
Keras-Preprocessing 1.1.2
libclang 15.0.6.1
libcst 0.4.9
Markdown 3.4.1
MarkupSafe 2.1.1
matplotlib-inline 0.1.6
moreorless 0.4.0
multidict 6.0.4
mypy 1.0.1
mypy-extensions 0.4.3
nest-asyncio 1.5.6
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
oauthlib 3.2.2
opt-einsum 3.3.0
packaging 22.0
pandas 1.5.3
parso 0.8.3
pathspec 0.11.0
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1
platformdirs 3.0.0
pluggy 1.0.0
portalocker 2.6.0
portpicker 1.5.2
pre-commit 3.0.4
prompt-toolkit 3.0.36
protobuf 3.20.2
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 10.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydantic 1.9.0
pyDeprecate 0.3.2
Pygments 2.14.0
pyparsing 3.0.9
pyre-extensions 0.0.27
pytest 7.2.1
pytest-mypy 0.10.3
python-dateutil 2.8.2
pytz 2022.6
PyYAML 6.0
requests 2.28.1
requests-oauthlib 1.3.1
rsa 4.9
scikit-build 0.16.3
sentry-sdk 1.16.0
setproctitle 1.3.2
setuptools 65.5.0
six 1.16.0
smmap 5.0.0
sortedcontainers 2.4.0
stack-data 0.6.2
stdlibs 2022.10.9
tabulate 0.9.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.9.3
tensorflow-estimator 2.9.0
tensorflow-io-gcs-filesystem 0.30.0
termcolor 2.2.0
toml 0.10.2
tomli 2.0.1
torch 1.13.1
torchmetrics 0.11.0
torchrec 0.3.2
torchsnapshot 0.1.0
torchx 0.3.0
tqdm 4.64.1
trailrunner 1.2.1
traitlets 5.9.0
typing_extensions 4.4.0
typing-inspect 0.8.0
urllib3 1.26.13
usort 1.0.5
virtualenv 20.19.0
wandb 0.13.11
wcwidth 0.2.6
websocket-client 1.4.2
Werkzeug 2.2.3
wrapt 1.14.1
yarl 1.8.2
zipp 3.12.1
Thank you very much for helping me about this configure problem~
The text was updated successfully, but these errors were encountered: