|
1 | 1 | # RFdiffusion3 — Input specification (dialect **2**) |
2 | 2 |
|
3 | | -> **TL;DR** |
4 | | -> Inputs are now defined with a single `InputSpecification` class. |
5 | | -> Selections like “what’s fixed?”, “what’s sequence-free?”, “which atoms are donors/acceptors?” are all expressed with the same **InputSelection** mini-language. |
6 | | -> Everything is reproducibly logged back out alongside your generation. |
7 | | -
|
8 | 3 | --- |
9 | 4 |
|
10 | | -- [What changed (high level)](#what-changed-high-level) |
| 5 | +## Contents |
11 | 6 | - [Quick start](#quick-start) |
| 7 | +- [InputSpecification fields](#inputspecification-fields) |
12 | 8 | - [The `InputSelection` mini-language](#the-inputselection-mini-language) |
13 | | -- [Full schema: `InputSpecification`](#full-schema-inputspecification) |
14 | | -- [Common recipes (cookbook)](#common-recipes-cookbook) |
| 9 | +- [Unindexing specifics](#unindexing-specifics) |
15 | 10 | - [Partial diffusion](#partial-diffusion) |
16 | | -- [Symmetry](#symmetry) |
17 | | -- [Origin (`ori_token`) and initialization](#origin-ori_token-and-initialization) |
18 | | -- [Validation & error messages](#validation--error-messages) |
19 | | -- [Metadata & logging](#metadata--logging) |
20 | | -- [Legacy configs (dialect=1) & migration guide](#legacy-configs-dialect1--migration-guide) |
21 | | -- [Multi-example files](#multi-example-files) |
| 11 | +- [Debugging recommendations](#debugging-recommendations) |
22 | 12 | - [FAQ / gotchas](#faq--gotchas) |
23 | 13 |
|
24 | 14 | --- |
25 | 15 |
|
26 | | -## How it works (high level) |
| 16 | +## Quick start |
27 | 17 |
|
28 | | -- **Unified selections.** All per-residue/atom choices now use **InputSelection**: |
29 | | - - You can pass `true`/`false`, a **contig string** (`"A1-10,B5-8"`), or a **dictionary** (`{"A1-10": "ALL", "B5": "N,CA,C,O"}`). |
30 | | - - Selection fields include: `select_fixed_atoms`, `select_unfixed_sequence`, `select_buried`, `select_partially_buried`, `select_exposed`, `select_hbond_donor`, `select_hbond_acceptor`, `select_hotspots`. |
31 | | -- **Clearer unindexing.** For **unindexed** motifs you typically either fix `"ALL"` atoms or explicitly choose subsets such as `"TIP"`/`"BKBN"`/explicit atom lists via a **dictionary** (see examples). |
32 | | - When using `unindex`, only **the atoms you mark as fixed** are carried over from the input. |
33 | | -- **Reproducibility.** The exact specification and the **sampled contig** are logged back into the output JSON. We also log useful counts (atoms, residues, chains). |
34 | | -- **Safer parsing.** You’ll now get early, informative errors if: |
35 | | - - You pass unknown keys, |
36 | | - - A selection doesn’t match any atoms, |
37 | | - - Indexed and unindexed motifs overlap, |
38 | | - - Mutually exclusive selections overlap (e.g., two RASA bins for the same atom). |
39 | | -- **Backwards compatible.** Add `"dialect": 1` to keep your old configs running while you migrate. (Deprecated.) |
| 18 | +JSON inputs take the following top-level structure; |
| 19 | +```json |
| 20 | +{ |
| 21 | + "spec-1": { // First design configuration |
| 22 | + "input": "<path/to/pdb>", |
| 23 | + "contig": "50-80,/0,A1-100", // Diffuses length 50-80 monomer in chain A & selects indices A1 -> A100 in input pdb to have fixed coordinates and sequences |
| 24 | + "select_unfixed_sequence": "A20-35", // Converts selected indices in input to have unfixed sequence (inputs become atom14). |
| 25 | + "ligand": "HAX,OAA", // Selects ligands HAX and OAA based on res name in the input |
| 26 | + }, |
| 27 | + "spec-2": { |
| 28 | + // ... args for the second (independent) configuration for design. |
| 29 | + } |
| 30 | +} |
| 31 | +``` |
40 | 32 |
|
41 | | ---- |
| 33 | +## InputSpecification fields |
| 34 | + |
| 35 | +Below is a table of all of the inputs that the `InputSpecification` accepts. Use these fields to describe what RFdiffusion3 should do with your inputs. |
42 | 36 |
|
43 | | -## InputSpecification |
44 | 37 |
|
45 | 38 | | Field | Type | Description | |
46 | 39 | | -------------------------------------------------------------- | ----------------- | --------------------------------------------------------------------- | |
|
67 | 60 | | `partial_t` | `float?` | Noise (Å) for partial diffusion, enables partial diffusion | |
68 | 61 |
|
69 | 62 |
|
70 | | -## Quick start |
71 | | - |
72 | | -### Minimal JSON example |
73 | | - |
74 | | -```json |
75 | | -{ |
76 | | - "": { |
77 | | - "input": "path/to/template.pdb", |
78 | | - "contig": "A1-80", |
79 | | - "length": "150-180", |
80 | | - "select_fixed_atoms": true, |
81 | | - "select_unfixed_sequence": "A20-35", |
82 | | - "ligand": "HAX,OAA", |
83 | | - "dialect": 2 |
84 | | - } |
85 | | -} |
86 | | -``` |
87 | | -### Mininmal YAML example |
88 | | -``` |
89 | | -input: path/to/template.pdb |
90 | | -contig: A1-80 |
91 | | -length: 150-180 |
92 | | -select_fixed_atoms: true |
93 | | -select_unfixed_sequence: A20-35 |
94 | | -ligand: HAX,OAA |
95 | | -dialect: 2 |
96 | | -
|
97 | | -``` |
98 | | - |
99 | | -### Python API |
100 | | -``` |
101 | | -from rfd3.inference.input_parsing import create_atom_array_from_design_specification |
102 | | -
|
103 | | -atom_array, metadata = create_atom_array_from_design_specification( |
104 | | - input="path/to/template.pdb", |
105 | | - contig="A1-80", |
106 | | - length="150-180", |
107 | | - select_fixed_atoms=True, |
108 | | - select_unfixed_sequence="A20-35", |
109 | | - dialect=2, |
110 | | -) |
111 | | -``` |
| 63 | +A few notes on the above: |
| 64 | +- **Unified selections.** All per-residue/atom choices now use **InputSelection**: |
| 65 | + - You can pass `true`/`false`, a **contig string** (`"A1-10,B5-8"`), or a **dictionary** (`{"A1-10": "ALL", "B5": "N,CA,C,O"}`). |
| 66 | + - Selection fields include: `select_fixed_atoms`, `select_unfixed_sequence`, `select_buried`, `select_partially_buried`, `select_exposed`, `select_hbond_donor`, `select_hbond_acceptor`, `select_hotspots`. |
| 67 | +- **Clearer unindexing.** For **unindexed** motifs you typically either fix `"ALL"` atoms or explicitly choose subsets such as `"TIP"`/`"BKBN"`/explicit atom lists via a **dictionary** (see examples). |
| 68 | + When using `unindex`, only **the atoms you mark as fixed** are carried over from the input. |
| 69 | +- **Reproducibility.** The exact specification and the **sampled contig** are logged back into the output JSON. We also log useful counts (atoms, residues, chains). |
| 70 | +- **Safer parsing.** You’ll now get early, informative errors if: |
| 71 | + - You pass unknown keys, |
| 72 | + - A selection doesn’t match any atoms, |
| 73 | + - Indexed and unindexed motifs overlap, |
| 74 | + - Mutually exclusive selections overlap (e.g., two RASA bins for the same atom). |
| 75 | +- **Backwards compatible.** Add `"dialect": 1` to keep your old configs running while you migrate. (Deprecated.) |
112 | 76 |
|
| 77 | +--- |
113 | 78 | ## The InputSelection mini-language |
114 | 79 |
|
115 | | -Fields which are specified as `InputSelection` are fields which can take either: `Bool, List, Dict`. |
116 | | -Dictionaries are the most expressive and can also take special : |
| 80 | +Fields marked as `InputSelection` accept either a boolean, a contig-style string, or a dictionary. Dictionaries are the most expressive and can also use shorthand values like `ALL`, `TIP`, or `BKBN`: |
117 | 81 | ```yaml |
118 | 82 | select_fixed_atoms: |
119 | | - A1-2: BKBN |
| 83 | + A1-2: BKBN # equivalent to 'N,CA,C,O' |
120 | 84 | A3: N,CA,C,O,CB # specific atoms by atom name |
121 | 85 | B5-7: ALL # Selects all atoms within B5,B6 and B7 |
122 | | - B10: TIP # selects common tipatom for residue (constants.py) |
| 86 | + B10: TIP # selects common tip atom for residue (constants.py) |
123 | 87 | LIG: '' # selects no atoms (i.e. unfixes the atoms for ligands named `LIG`) |
124 | 88 | ``` |
125 | 89 |
|
126 | | -[Diagram] |
| 90 | +<p align="center"> |
| 91 | + <img src=".assets/input_selection.png" alt="InputSelection language for foundry"> |
| 92 | +</p> |
127 | 93 |
|
128 | 94 | ## Unindexing specifics |
129 | 95 |
|
130 | 96 | `unindex` marks motif tokens whose relative sequence placement is unknown to the model (useful for scaffolding around active sites, etc.). |
131 | 97 | Use a string to list the unindexed components and where breaks occur. |
132 | 98 | Use a dictionary if you want to fix specific atoms of those residues; atoms not fixed are not copied from the input (they will be diffused). |
133 | | -Breaks between unindexed components follow the contig conventions you’re used to. For example: |
134 | | - |
135 | | -`"A244,A274,A320,A329,A375"` |
| 99 | +Breaks between unindexed components follow the contig conventions you’re used to. For example: `"A244,A274,A320,A329,A375"` lists multiple unindexed components; internal “breakpoints” are inferred and logged. (Offset syntax like A11-12 or A11,0,A12 still ties residues.) |
| 100 | +You can specify consecutive residues as e.g. `A11-12` (instead of `A11,A12`), this will tie the two components together in sequence (or at least it leaks to the model that residues are together in sequence). |
| 101 | +Similarly, you can specify manually any number of residues that offsets two components, e.g. `A11,0,A12` (0 sequence offset, equivalent to just `A11-12`), or `A11,3,A12` (3-residue separation). |
| 102 | +From our initial tests this only leads to a slight bias in the model, but newer models may show better adherence! |
| 103 | + |
| 104 | +## Partial Diffusion |
| 105 | +To enable partial diffusion, you can pass `partial_t` with any example. This sets the *noise level* in *angstroms* for the sampler: |
| 106 | +- The `specification.partial_t` argument can be specified from JSON or the command line. |
| 107 | +- Partial diffusion will fix/unfix ligands and nucleic acids as normal, by default it will fix non-protein components and they must be specified explicitly. |
| 108 | +- By default, the ca-aligned `ca_rmsd_to_input` will be logged. |
| 109 | +- Currently, partial diffusion subsets the inference schedule based on the partial_t, so `inference_sampler.num_timesteps` will affect how many steps are used but it is not equal to the number of steps used. |
| 110 | + |
| 111 | +In the following example, RFD3 will noise out by 15 angstroms and constrain atoms of three residues. In this output one of the 8 diffusion outputs swapped its sequence index by one residue: |
| 112 | +```json |
| 113 | +{ |
| 114 | + "partial_diffusion": { |
| 115 | + "input": "paper_examples/7v11.cif", |
| 116 | + "ligand": "OQO", |
| 117 | + "partial_t": 15.0, |
| 118 | + "unindex": "A431,A572-573", |
| 119 | + "select_fixed_atoms": { |
| 120 | + "A431": "TIP", |
| 121 | + "A572": "BKBN", |
| 122 | + "A573": "BKBN" |
| 123 | + } |
| 124 | + } |
| 125 | +} |
| 126 | +``` |
| 127 | +Below is an example of what the output should look like (diffusion outputs in teal, original native in navajo white): |
| 128 | +<p align="center"> |
| 129 | + <img src=".assets/partial_diff.png" alt="Partial diffusion" width=650> |
| 130 | +</p> |
136 | 131 |
|
137 | | -lists multiple unindexed components; internal “breakpoints” are inferred and logged. (Offset syntax like A11-12 or A11,0,A12 still ties residues.) |
| 132 | +## Debugging recommendations |
| 133 | +- For unindexed scaffolding, you can use the option `cleanup_guideposts=False` to keep the models' outputs for the guideposts. The guideposts are saved as separate chains based on whether their relative indices were leaked to the model: e.g. for `unindex=A11-12,A22`, you should see `A11` and `A12` indexed together on one chain and `A22` on its own chain, indicating the model was provided with the fact that `A11` and `A12` are immediately next to one another in sequence but their distance to `A22` is unknown. |
| 134 | +- To see the full 14 diffused virtual atoms you can use `cleanup_virtual_atoms=False`. Default is to discard them for the sake of downstream processing. |
| 135 | +- To see the trajectories, you can use `dump_trajectories=True`. This can be useful if the outputs look strange but the config is correct, or if you want to make cool gifs of course! Trajectories do not have sequence labels and contain virtual atoms. |
138 | 136 |
|
139 | | -# Appendix |
140 | 137 | ## FAQ / gotchas |
141 | 138 | <details> |
142 | | - <summary><b>Do I need select_fixed_atoms & select_unfixed_sequence every time?</b></summary> |
143 | | - |
144 | | - No. Defaults apply when input present. |
| 139 | + <details> |
| 140 | + <summary><b>Can I guide on secondary structure?</b></summary> |
| 141 | + Currently no - in future models we may do so, however, you can use `is_non_loopy: true` to make fewer loops. We find this produces a lot more helices and fewer loops (and less sheets). |
145 | 142 | </details> |
146 | 143 |
|
147 | | -<details> |
148 | 144 | <summary><b>Do I need select_fixed_atoms & select_unfixed_sequence every time?</b></summary> |
149 | | - |
| 145 | + |
150 | 146 | No. Defaults apply when input present. |
151 | 147 | </details> |
152 | 148 |
|
153 | 149 | <details> |
154 | | - <summary><b>What does "ALL" vs "TIP" in unindex mean?</b></summary> |
155 | | - |
156 | | - - **`ALL`** → copy full residue |
157 | | - - **`TIP`** → fix only sidechain tip atoms |
158 | | - </details> |
159 | | - |
160 | | - <details> |
161 | | - <summary><b>Can selections overlap?</b></summary> |
162 | | - |
163 | | - Only certain ones (fixed vs unfixed) may; RASA & donor/acceptor cannot. |
164 | | - </details> |
165 | | - |
166 | | - <details> |
167 | | - <summary><b>How to fix backbone but redesign sidechains?</b></summary> |
| 150 | + <summary><b>Why "Input provided but unused"?</b></summary> |
168 | 151 |
|
169 | | - `redesign_motif_sidechains: true` |
| 152 | + This indicates you gave an input pdb / cif (not `input: null`) but no contig, unindex, ligand or partial_t. |
170 | 153 | </details> |
171 | 154 |
|
172 | 155 | <details> |
173 | | - <summary><b>Why "Input provided but unused"?</b></summary> |
| 156 | + <summary><b>What do the logged bfactors mean?</b></summary> |
174 | 157 |
|
175 | | - You gave input but no contig, unindex, or partial_t. |
| 158 | + The sequence head from RFD3 logs its confidence for each token in the output structure, you can run `spectrum b` in `pymol` to see it. It usually doesn't mean anything but can give you some idea if the model has gone vastly distribution if the entropy is high (uncertain assignment of sequence). |
176 | 159 | </details> |
| 160 | +</details> |
177 | 161 |
|
178 | | -## Shorthand atoms for easy specification |
179 | | -Keyword Expands to |
180 | | -BKBN N, CA, C, O |
181 | | -TIP Residue-specific “tip” atoms |
182 | | -ALL All atoms of each residue |
| 162 | +Let us know if you have any additional questions, we'd be happy to answer them! |
183 | 163 |
|
| 164 | +## Further examples of InputSelection syntax |
184 | 165 |
|
| 166 | +Below is a reference for more examples of different ways you can specify inputs to select from your pdb in configs; we hope the community can find use in this flexible system for future models! |
| 167 | +<p align="center"> |
| 168 | + <img src=".assets/input_selection_large.png" alt="Input selection syntax" width=650> |
| 169 | +</p> |
0 commit comments