Revisiting Deep Learning for Variable Type Recovery

This repository hosts the code used to replicate the paper Augmenting Decompiler Output with Learned Variable Names and Types. It is a fork of the original DIRTY implementation written by Chen et. al. While most of the model code remains identical, we add support for generating a training dataset with the Ghidra decompiler, allowing researchers without an IDA Pro license to train their own DIRTY model. The original README provides clear instructions on how to download and run their pre-trained DIRTY model, but the README's instructions are slightly unclear when describing how to train your own model. This README explicitly covers all the steps necessary to train a DIRTY model from scratch.

This is @edmcman's fork of the original DIRTY-Ghidra repository. It features a number of improvements and bug fixes, and also includes the ability to perform inference on new examples.

Getting Started with DIRTY-Ghidra Inference

Most people probably just want to use DIRTY-Ghidra to predict variable names and types for their own binaries. If that is you, follow these instructions:

Clone this repository to DIRTY_DIR
Optional but highly recommended: Create a virtual environment (venv) with python -m venv /path/to/venv; source /path/to/venv/bin/activate. This will prevent DIRTY from interfering with your system python packages.
Install the requirements via pip install -r requirements.txt
Install Ghidra
Install Ghidrathon. Make sure you configure Ghidrathon (python ghidrathon_configure.py) using the venv from step 2.
Download the latest model from HF (huggingface_hub[cli] && huggingface-cli download --repo-type model ejschwartz/dirty-ghidra --local-dir $DIRTY_DIR/dirty)
Run mkdir ~/ghidra_scripts && ln -s DIRTY_DIR/scripts/DIRTY_infer.py ~/ghidra_scripts/DIRTY_infer.py if on Linux.
Open a function in Ghidra. Run the script DIRTY_infer.py in the script manager.
Optionally assign the script to a keyboard shortcut.

Requirements

Linux with Python 3.10+
PyTorch ≥ 1.5.1
Ghidrathon >= 4.0.0
pip install -r requirements.txt

Libraries

A few libraries are required by the python packages. On ubuntu, you can install these with:

apt install pkg-config libsentencepiece-dev libprotobuf-dev

Training a DIRTY model

Dataset Generation

The first step to train DIRTY is to obtain a unprocessed DIRT dataset. Instructions can be found in the dataset-gen-ghidra folder.

Dataset Preprocessing

Once we have a unprocessed dataset, we want to preprocess the dataset to generate the training samples the model will train on.

# inside the `dirty` directory
python3 -m utils.preprocess [-h] [options] INPUT_FOLDER INPUT_FNAMES TARGET_FOLDER

Given the path to the INPUT_FOLDER that contains the unprocessed dataset and the path to the INPUT_FNAMES that contains the names of all files you want to process, this script creates the preprocessed dataset in TARGET_FOLDER. TARGET_FOLDER will contain the following files:

train-shard-*.tar : archive of the training dataset samples
dev.tar : archive of the validation dataset
test.tar : archive of the test dataset
typelib.json: list of all types contained across the three datasets

We also need to build a vocabulary of tokens that the model will understand

# inside the `dirty` directory
python3 -m utils.vocab [-h] --use-bpe [options] TRAIN_FILES_TAR PATH_TO_TYPELIB_JSON TARGET_DIRECTORY/vocab.bpe10000

This script generates vocabulary files located in TARGET_DIRECTORY. It is recommended to prefix the vocab files with vocab.bpe10000 to match the expected vocabulary filenames in the model config files.

Finally, lets move our dataset and vocabulary files to the directory expected by the model config files.

# inside the `dirty` directory
mkdir -p data1/
mv PATH_TO_TRAIN_SHARDS_TAR PATH_TO_DEV_TAR PATH_TO_TEST_TAR PATH_TO_VOCAB_BPE10000 data1/

We can now train our own DIRTY model and test its performance. Follow the steps starting at the Train DIRTY section of the original README

Name		Name	Last commit message	Last commit date
Latest commit History 531 Commits
.github		.github
binary		binary
dataset-gen-ghidra		dataset-gen-ghidra
dataset-gen		dataset-gen
dire		dire
dirty		dirty
idastubs		idastubs
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
fig.png		fig.png
projects.txt		projects.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revisiting Deep Learning for Variable Type Recovery

Getting Started with DIRTY-Ghidra Inference

Requirements

Libraries

Training a DIRTY model

Dataset Generation

Dataset Preprocessing

About

Releases

Packages

Languages

License

edmcman/DIRTY-Ghidra

Folders and files

Latest commit

History

Repository files navigation

Revisiting Deep Learning for Variable Type Recovery

Getting Started with DIRTY-Ghidra Inference

Requirements

Libraries

Training a DIRTY model

Dataset Generation

Dataset Preprocessing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages