Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset issue #1

Open
WeianMao opened this issue Nov 6, 2022 · 1 comment
Open

dataset issue #1

WeianMao opened this issue Nov 6, 2022 · 1 comment

Comments

@WeianMao
Copy link

WeianMao commented Nov 6, 2022

thanks for your great work. but i meet some bug.

i used the dataset, ffdataset format, provided by hfai.

and I got the (pdb_code, mmcif_string, bfd_hits, mgnify_hits, pdb70_hits, uniref90_hits) information from ffrecord dataloader. However, for the sample '2ljb', the contant from ffrecord file is different to that from original openfold dataset. Specifically, there are A and B chain in ffrecord dataset, but there are A,B,C,D four chain in the original dataset.
As a result, when i am start the training, i will recieve a error':
Traceback (most recent call last):
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/wayne/project/folding/alphafold-optimized/train_fold.py", line 100, in main
for idx, batch in enumerate(dataloader):
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 485, in _batch_prop_gen
for batch in iterator:
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/wayne/miniconda3/envs/hfof/lib/python3.8/site-packages/ffrecord-1.4.0+0cebd18-py3.8-linux-x86_64.egg/ffrecord/torch/dataloader.py", line 155, in fetch
data = self.dataset[indexes]
File "/home/wayne/project/folding/alphafold-optimized/hfaidataset.py", line 58, in getitem
mmcifdata = self.transform(*mmcifdata)
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 151, in call
data = self._parse_mmcif(
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_modules.py", line 130, in _parse_mmcif
data = self.data_pipeline.process_mmcif_hfai(
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_pipeline.py", line 820, in process_mmcif_hfai
mmcif_feats = make_mmcif_features(mmcif, chain_id)
File "/home/wayne/project/folding/alphafold-optimized/openfold/data/data_pipeline.py", line 97, in make_mmcif_features
input_sequence = mmcif_object.chain_to_seqres[chain_id]
KeyError: 'D'

Because the file ID is 2ljb_D, the transform function try to index the D chain from 2ljb.cif but there is not D chain in the cif file. i check the original dataset from openfold, there should be a D chain in the mmcif. so i think there are something wrong with the ffrecord dataset. So how to fix the bug? thank you very much.

BTW, i found a cache function, 'cachePath = f"../full_dataset/mmcif_parse_cache/{file_id}.pkl"', is added into hfai implement. i am worry about it will cause heavy io operation and make the ffrecord dataset meaningless. am i wrong? thanks!

@WeianMao
Copy link
Author

WeianMao commented Nov 6, 2022

the random seed i used is 4242022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant