Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why I got this error message? #138

Open
zhuchenxi opened this issue Apr 20, 2023 · 0 comments
Open

why I got this error message? #138

zhuchenxi opened this issue Apr 20, 2023 · 0 comments

Comments

@zhuchenxi
Copy link

When I run the ./images/init_venv.sh,
then run this command:
python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc_per_node 1 tml/projects/twhin/run.py --config_yaml_path="tml/projects/twhin/config/local.yaml" --save_dir="tml/model"

And It told this error:
Traceback (most recent call last):
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 105, in
app.run(main)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 97, in main
run(
File "/DATA/jupyter/personal/twitter/tml/projects/twhin/run.py", line 70, in run
ctl.train(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 192, in train
outputs = train_step_fn()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/custom_training_loop.py", line 57, in step_fn
outputs = pipeline.progress(data_iterator)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/core/train_pipeline.py", line 582, in progress
(losses, output) = cast(Tuple[torch.Tensor, Out], self._model(batch_i))
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 147, in forward
outputs = self.model(batch)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/model_parallel.py", line 255, in forward
return self._dmp_wrapped_module(*args, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/projects/twhin/models/models.py", line 39, in forward
outs = self.large_embeddings(batch.nodes)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/tml/common/modules/embedding/embedding.py", line 55, in forward
pooled_embs = self.ebc(sparse_features)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/types.py", line 594, in forward
dist_input = self.input_dist(ctx, *input, **kwargs).wait().wait()
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embeddingbag.py", line 424, in input_dist
input_dist(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/sharding/rw_sharding.py", line 303, in forward
) = bucketize_kjt_before_all2all(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torchrec/distributed/embedding_sharding.py", line 169, in bucketize_kjt_before_all2all
) = torch.ops.fbgemm.block_bucketize_sparse_features(
File "/DATA/jupyter/personal/twitter/tml/tml_venv/lib/python3.10/site-packages/torch/_ops.py", line 442, in call
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'fbgemm::block_bucketize_sparse_features' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'fbgemm::block_bucketize_sparse_features' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

Here is my CUDA version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

Here is my pip list:
Package Version


absl-py 1.4.0
aiofiles 22.1.0
aiohttp 3.8.3
aiosignal 1.3.1
appdirs 1.4.4
arrow 1.2.3
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrs 22.1.0
backcall 0.2.0
black 22.6.0
cachetools 5.3.0
cblack 22.6.0
certifi 2022.12.7
cfgv 3.3.1
charset-normalizer 2.1.1
click 8.1.3
cmake 3.25.0
Cython 0.29.32
decorator 5.1.1
distlib 0.3.6
distro 1.8.0
dm-tree 0.1.6
docker 6.0.1
docker-pycreds 0.4.0
docstring-parser 0.8.1
exceptiongroup 1.1.0
executing 1.2.0
fbgemm-gpu-cpu 0.3.2
filelock 3.8.2
fire 0.5.0
flatbuffers 1.12
frozenlist 1.3.3
fsspec 2022.11.0
gast 0.4.0
gcsfs 2022.11.0
gitdb 4.0.10
GitPython 3.1.31
google-api-core 2.8.2
google-auth 2.16.0
google-auth-oauthlib 0.4.6
google-cloud-core 2.3.2
google-cloud-storage 2.7.0
google-crc32c 1.5.0
google-pasta 0.2.0
google-resumable-media 2.4.1
googleapis-common-protos 1.56.4
grpcio 1.51.1
h5py 3.8.0
hypothesis 6.61.0
identify 2.5.17
idna 3.4
importlib-metadata 6.0.0
iniconfig 2.0.0
iopath 0.1.10
ipdb 0.13.11
ipython 8.10.0
jedi 0.18.2
Jinja2 3.1.2
keras 2.9.0
Keras-Preprocessing 1.1.2
libclang 15.0.6.1
libcst 0.4.9
Markdown 3.4.1
MarkupSafe 2.1.1
matplotlib-inline 0.1.6
moreorless 0.4.0
multidict 6.0.4
mypy 1.0.1
mypy-extensions 0.4.3
nest-asyncio 1.5.6
ninja 1.11.1
nodeenv 1.7.0
numpy 1.22.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
oauthlib 3.2.2
opt-einsum 3.3.0
packaging 22.0
pandas 1.5.3
parso 0.8.3
pathspec 0.11.0
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1
platformdirs 3.0.0
pluggy 1.0.0
portalocker 2.6.0
portpicker 1.5.2
pre-commit 3.0.4
prompt-toolkit 3.0.36
protobuf 3.20.2
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 10.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydantic 1.9.0
pyDeprecate 0.3.2
Pygments 2.14.0
pyparsing 3.0.9
pyre-extensions 0.0.27
pytest 7.2.1
pytest-mypy 0.10.3
python-dateutil 2.8.2
pytz 2022.6
PyYAML 6.0
requests 2.28.1
requests-oauthlib 1.3.1
rsa 4.9
scikit-build 0.16.3
sentry-sdk 1.16.0
setproctitle 1.3.2
setuptools 65.5.0
six 1.16.0
smmap 5.0.0
sortedcontainers 2.4.0
stack-data 0.6.2
stdlibs 2022.10.9
tabulate 0.9.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.9.3
tensorflow-estimator 2.9.0
tensorflow-io-gcs-filesystem 0.30.0
termcolor 2.2.0
toml 0.10.2
tomli 2.0.1
torch 1.13.1
torchmetrics 0.11.0
torchrec 0.3.2
torchsnapshot 0.1.0
torchx 0.3.0
tqdm 4.64.1
trailrunner 1.2.1
traitlets 5.9.0
typing_extensions 4.4.0
typing-inspect 0.8.0
urllib3 1.26.13
usort 1.0.5
virtualenv 20.19.0
wandb 0.13.11
wcwidth 0.2.6
websocket-client 1.4.2
Werkzeug 2.2.3
wrapt 1.14.1
yarl 1.8.2
zipp 3.12.1

Thank you very much for helping me about this configure problem~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant