This repository hosts the code used to replicate the paper Augmenting Decompiler Output with Learned Variable Names and Types. It is a fork of the original DIRTY implementation written by Chen et. al. While most of the model code remains identical, we add support for generating a training dataset with the Ghidra decompiler, allowing researchers without an IDA Pro license to train their own DIRTY model. The original README provides clear instructions on how to download and run their pre-trained DIRTY model, but the README's instructions are slightly unclear when describing how to train your own model. This README explicitly covers all the steps necessary to train a DIRTY model from scratch.
This is @edmcman's fork of the original DIRTY-Ghidra repository. It features a number of improvements and bug fixes, and also includes the ability to perform inference on new examples.
Most people probably just want to use DIRTY-Ghidra to predict variable names and types for their own binaries. If that is you, follow these instructions:
- Clone this repository to
DIRTY_DIR
- Optional but highly recommended: Create a virtual environment (venv) with
python -m venv /path/to/venv; source /path/to/venv/bin/activate
. This will prevent DIRTY from interfering with your system python packages. - Install the requirements via
pip install -r requirements.txt
- Install Ghidra
- Install Ghidrathon. Make sure you configure Ghidrathon (
python ghidrathon_configure.py
) using the venv from step 2. - Download the latest model from HF (
huggingface_hub[cli] && huggingface-cli download --repo-type model ejschwartz/dirty-ghidra --local-dir $DIRTY_DIR/dirty
) - Run
mkdir ~/ghidra_scripts && ln -s DIRTY_DIR/scripts/DIRTY_infer.py ~/ghidra_scripts/DIRTY_infer.py
if on Linux. - Open a function in Ghidra. Run the script
DIRTY_infer.py
in the script manager. - Optionally assign the script to a keyboard shortcut.
- Linux with Python 3.10+
- PyTorch ≥ 1.5.1
- Ghidrathon >= 4.0.0
pip install -r requirements.txt
A few libraries are required by the python packages. On ubuntu, you can install these with:
apt install pkg-config libsentencepiece-dev libprotobuf-dev
The first step to train DIRTY is to obtain a unprocessed DIRT dataset. Instructions can be found in the dataset-gen-ghidra folder.
Once we have a unprocessed dataset, we want to preprocess the dataset to generate the training samples the model will train on.
# inside the `dirty` directory
python3 -m utils.preprocess [-h] [options] INPUT_FOLDER INPUT_FNAMES TARGET_FOLDER
Given the path to the INPUT_FOLDER
that contains the unprocessed dataset and the path to the INPUT_FNAMES
that contains the names of all files you want to process, this script creates the preprocessed dataset in TARGET_FOLDER
.
TARGET_FOLDER
will contain the following files:
- train-shard-*.tar : archive of the training dataset samples
- dev.tar : archive of the validation dataset
- test.tar : archive of the test dataset
- typelib.json: list of all types contained across the three datasets
We also need to build a vocabulary of tokens that the model will understand
# inside the `dirty` directory
python3 -m utils.vocab [-h] --use-bpe [options] TRAIN_FILES_TAR PATH_TO_TYPELIB_JSON TARGET_DIRECTORY/vocab.bpe10000
This script generates vocabulary files located in TARGET_DIRECTORY
. It is recommended to prefix the vocab files with vocab.bpe10000
to match the expected vocabulary filenames in the model config files.
Finally, lets move our dataset and vocabulary files to the directory expected by the model config files.
# inside the `dirty` directory
mkdir -p data1/
mv PATH_TO_TRAIN_SHARDS_TAR PATH_TO_DEV_TAR PATH_TO_TEST_TAR PATH_TO_VOCAB_BPE10000 data1/
We can now train our own DIRTY model and test its performance. Follow the steps starting at the Train DIRTY section of the original README