oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

lilyminium · 2024-04-05T06:43:03Z

Describe the bug

Not a bug per se, but could impact on toolkit usability for large molecules -- while debugging openforcefield/openff-nagl#101 I saw that converting molecules to and from OpenEye consumes a large amount of memory that is not seen with RDKit. For a 5177 atom protein, calling Molecule.from_openeye consumes about 800 MiB. Memray attributes most of this to oeconf.GetCoords, even though no conformers are generated or attached at any point to the molecule. Would it be possible to check for conformers before calling conf.GetCoords? (It may be that this triggers the same memory-consuming process, though!)

To Reproduce

mre.py (also attached):

from openff.toolkit import Molecule

protein = Molecule.from_smiles(
    "CC[C@H](C)[C@H](NC(=O)CNC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)CNC(=O)[C@H](CS)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@H](CCCNC(N)=[NH2+])NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@H](Cc1ccc(O)cc1)NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@H](C)NC(=O)[C@@H]1CCCN1C(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CO)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)CNC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H]1CCCN1C(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CO)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)CNC(=O)CNC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(=O)[O-])NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](CCCC[NH3+])NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)CNC(=O)[C@H](Cc1ccccc1)NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(N)=O)NC(=O)[C@@H]1CCCN1C(=O)CNC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CCCN1C(=O)[C@H](C)NC(=O)CNC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](CC(=O)[O-])NC(=O)[C@@H]([NH3+])CCSC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)CC)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1cnc[nH]1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](CS[C@H]1CC(=O)N(c2ccc3c(c2)C(=O)OC32c3ccc(O)cc3Oc3cc(O)ccc32)C1=O)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)NCC(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(N)=O)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](C)C(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)NCC(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(N)=[NH2+])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(N)=O)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@H](C(=O)N[C@@H](CCC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(=O)[O-])C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N[C@@H](CC(=O)[O-])C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCSC)C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCC(=O)[O-])C(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)NCC(=O)N1CCC[C@H]1C(=O)N[C@@H](CC(=O)[O-])C(=O)NCC(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(O)cc1)C(=O)N[C@H](C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](CCC(N)=O)C(=O)NC)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)C(C)C)[C@@H](C)CC)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)O)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)[C@@H](C)CC)[C@@H](C)CC)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C)[C@@H](C)CC)C(C)C)C(C)C)C(C)C)[C@@H](C)O)C(C)C)[C@@H](C)O)[C@@H](C)O)[C@@H](C)CC)[C@@H](C)CC)C(C)C"
)
oemol = protein.to_openeye()
offmol = Molecule.from_openeye(oemol)

Requires memray installed:

>>> python -m memray run mre.py

The screenshot points to this line:

openff-toolkit/openff/toolkit/utils/openeye_wrapper.py

Line 1329 in 97af593

off_atom_coords = conf.GetCoords()[oe_id]

Output

Computing environment (please complete the following information):

Operating system
Output of running conda list

  Name                     Version    Build         Channel
─────────────────────────────────────────────────────────────────
  openff-amber-ff-ports    0.0.4      pyhca7485f_0  conda-forge
  openff-forcefields       2024.03.0  pyhca7485f_0  conda-forge
  openff-interchange-base  0.3.25     pyhd8ed1ab_0  conda-forge
  openff-models            0.1.2      pyhca7485f_0  conda-forge
  openff-nagl              0.3.6      pyhd8ed1ab_0  conda-forge
  openff-nagl-base         0.3.6      pyhd8ed1ab_0  conda-forge
  openff-nagl-models       0.1.2      pyhd8ed1ab_0  conda-forge
  openff-recharge          0.5.2      pyhd8ed1ab_0  conda-forge
  openff-toolkit-base      0.15.2     pyhd8ed1ab_0  conda-forge
  openff-units             0.2.2      pyhca7485f_0  conda-forge
  openff-utilities         0.1.12     pyhd8ed1ab_0  conda-forge

Additional context

mre.zip

Manifest:

mre.py (includes the protein smirks)
memray-mre.py.10332.bin: the output of memray
memray-flamegraph-mre.py.10332.html: the interactive graph in the screenshot

The text was updated successfully, but these errors were encountered:

mattwthompson · 2024-04-05T14:59:18Z

I tried implementing this since it should be easy, but it's not. Simply adding a NumConfs() call doesn't do the trick. I don't know how to check for OpenEye's annoying "courtesy conformer" without calling out to GetConfs(), which I understand to be the problem:

In [32]: oemol = Molecule.from_smiles("CCO").to_openeye()

In [33]: oemol.NumConfs()
Out[33]: 1

In [34]: [*oemol.GetConfs()][0].GetCoords()
Out[34]:
{0: (0.0, 0.0, 0.0),
 1: (0.0, 0.0, 0.0),
 2: (0.0, 0.0, 0.0),
 3: (0.0, 0.0, 0.0),
 4: (0.0, 0.0, 0.0),
 5: (0.0, 0.0, 0.0),
 6: (0.0, 0.0, 0.0),
 7: (0.0, 0.0, 0.0),
 8: (0.0, 0.0, 0.0)}

In [35]: molecule = Molecule.from_smiles("O=S(=O)(N)c1c(Cl)cc2c(c1)S(=O)(=O)NCN2")

In [36]: molecule.generate_conformers(n_conformers=1)

In [37]: oemol = molecule.to_openeye()

In [38]: oemol.NumConfs()
Out[38]: 1

In [39]: [*oemol.GetConfs()][0].GetCoords()
Out[39]:
{0: (1.8719326257705688, 3.7204949855804443, 2.2212681770324707),
 1: (1.2912099361419678, 4.097604274749756, 0.9475870132446289),
 2: (0.3753527104854584, 5.218091011047363, 0.8554574251174927),
 3: (2.534075975418091, 4.290732383728027, -0.20339979231357574),
 4: (0.4765625, 2.689453125, 0.296875),
 5: (-0.5654296875, 2.794921875, -0.62060546875),
 6: (-1.133737325668335, 4.327646255493164, -1.178165078163147),
 7: (-1.181640625, 1.642578125, -1.119140625),
 8: (-0.7685546875, 0.360595703125, -0.7216796875),
 9: (0.280029296875, 0.290283203125, 0.2086181640625),
 10: (0.90185546875, 1.43359375, 0.71923828125),
 11: (0.84521484375, -1.2734375, 0.802734375),
 12: (1.9853515625, -1.6318359375, -0.0174713134765625),
 13: (0.93994140625, -1.193359375, 2.24609375),
 14: (-0.484130859375, -2.26953125, 0.403076171875),
 15: (-0.9970703125, -2.099609375, -0.96826171875),
 16: (-1.4326171875, -0.7451171875, -1.2314453125),
 17: (2.4708824157714844, 5.089303016662598, -0.8458374738693237),
 18: (3.4932398796081543, 4.066527366638184, 0.08694052696228027),
 19: (-2.0012941360473633, 1.7379204034805298, -1.8306996822357178),
 20: (1.7089277505874634, 1.3463486433029175, 1.4417718648910522),
 21: (-1.2067643404006958, -2.400458812713623, 1.1223875284194946),
 22: (-0.20102126896381378, -2.3644607067108154, -1.6715291738510132),
 23: (-1.8237838745117188, -2.7984254360198975, -1.1371092796325684),
 24: (-2.3384859561920166, -0.6081215143203735, -1.6641637086868286)}

mattwthompson · 2024-04-05T14:59:24Z

Okay, actually thinking about this a little more clearly, using GetConfs (which returns an iterator of all conformers) might be the issue if it's not a generator. I can't tell from the docs and SWIG magic if it's lazy like a generator or EEAAO more like a list.

mattwthompson · 2024-04-05T15:08:32Z

There's also GetConfIter which only exists when there are two or more conformers. This could provide a useful branching point if it didn't fail to distinguish whether a single conformer was real or not when there's only one.

lilyminium · 2024-04-08T23:31:28Z

Hm, the courtesy conformer is annoying. This is low priority at best since it's a very moderate amount of memory even for a decent sized protein. Thanks for looking into it!

lilyminium mentioned this issue Apr 5, 2024

Very large molecules consume huge amounts of memory openforcefield/openff-nagl#101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

lilyminium commented Apr 5, 2024 •

edited

Loading

mattwthompson commented Apr 5, 2024

mattwthompson commented Apr 5, 2024

mattwthompson commented Apr 5, 2024

lilyminium commented Apr 8, 2024

oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

oemol.GetConfs() consuming large amount of memory even when no conformers are present #1855

Comments

lilyminium commented Apr 5, 2024 • edited Loading

mattwthompson commented Apr 5, 2024

mattwthompson commented Apr 5, 2024

mattwthompson commented Apr 5, 2024

lilyminium commented Apr 8, 2024

lilyminium commented Apr 5, 2024 •

edited

Loading