We present the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a generalized architecture designed to tackle all the proposed tasks in the GraSP benchmark. Our method utilizes a localized instrument segmentation baseline applied on independent keyframes that act as a region proposal network and provide pixel-precise instrument masks and their corresponding segment embeddings. Further, our model uses a global video feature extractor on time windows centered on a keyframe to compute a class embedding and a sequence of spatio-temporal embeddings. A frame classification head uses the class embedding to classify the middle frame of the time window into a phase or a step, and a region classification head interrelates the global spatio-temporal features with the localized region embeddings for atomic action prediction or instrument region classification. In the following subsections, we explain the details of our proposed architecture.
This work is an extended and consolidated version of three previous works:
- Towards Holistic Surgical Scene Understanding, MICCAI 2022, Oral. Code here.
- Winner solution of the 2022 SAR-RARP50 challenge
- MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation, ISBI 2023, Oral. Code here.
Please follow these steps to run TAPIS:
$ conda create --name tapis python=3.8 -y
$ conda activate tapis
$ conda install pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.4 -c pytorch -c nvidia
# (for older cuda versions)
# conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
$ git clone https://github.com/BCV-Uniandes/GraSP
$ cd GraSP/TAPIS
$ pip install -r requirements.txt
$ pip install 'git+https://github.com/facebookresearch/fvcore'
$ pip install 'git+https://github.com/facebookresearch/fairscale'
$ python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
In this Google Drive Link, you will find a compressed archive with our preprocessed data files, region proposals, and pre-trained models. We provide a README file with instructions about the data structures and the files in the link. Download this file and uncompress it with the following command.
$ tar -xzvf TAPIS.tar.gz
Then, locate the extracted files in a directory named GraSP inside the data directory of this repository. Please also include the video frames in a directory named "frames", and include the original annotations in the "annotations" directory next to the region predictions. In the end, the repository must have the following structure.
TAPIS
|
|__configs
| ...
|__data
| |__GraSP
| |__annotations
| | |__fold1_train_preds.json
| | |__fold1_val_preds.json
| | |__fold2_train_preds.json
| | |__fold2_val_preds.json
| | |__train_train_preds.json
| | |__test_val_preds.json
| | |__grasp_long-term_fold1.json
| | |__grasp_long-term_fold2.json
| | |__grasp_long-term_train.json
| | |__grasp_long-term_test.json
| | |__grasp_short-term_fold1.json
| | |__grasp_short-term_fold2.json
| | |__grasp_short-term_train.json
| | |__grasp_short-term_test.json
| |
| |__features
| | |__fold1_train_region_features.pth
| | |__fold1_val_region_features.pth
| | |__fold2_train_region_features.pth
| | |__fold2_val_region_features.pth
| | |__train_train_region_features.pth
| | |__test_val_region_features.pth
| |
| |__frame_lists
| | |__fold1.csv
| | |__fold2.csv
| | |__train.csv
| | |__test.csv
| |
| |__frames
| | |__CASE001
| | | |__000000000.jpg
| | | |__000000002.jpg
| | | ...
| | |__CASE002
| | | ...
| | ...
| |
| |__pretrained_models
| |__fold1
| | |__ACTIONS.pyth
| | |__LONG.pyth
| | |__PHASES.pyth
| | |__STEPS.pyth
| | |__INSTRUMENTS.pyth
| | |__SEGMENTATION_BASELINE
| | |__r50.pth
| | |__swinl.pth
| |__fold2
| | ...
| |__train
| |__ACTIONS.pyth
| |__LONG.pyth
| |__INSTRUMENTS.pyth
| |__SEGMENTATION_BASELINE
| |__swinl.pth
|
|__region_proposals
|__run_files
|__tapis
|__tools
Feel free to use soft/hard linking to other paths or to modify the directory structure, names, or locations of the files. However, you may also have to alter the .yaml config files or the bash running scripts.
Task | cross-val mAP | test mAP | config | run file | model path |
---|---|---|---|---|---|
Phases | 71.36 |
76,72 | PHASES | phases | TAPIS/pretrained_models/PHASES |
Steps | 50.74 |
52.01 | STEPS | steps | TAPIS/pretrained_models/STEPS |
Instruments | 90.28 |
89.09 | INSTRUMENTS | instruments | TAPIS/pretrained_models/INSTRUMENTS |
Actions | 35.46 |
39.50 | ACTIONS | actions | TAPIS/pretrained_models/ACTIONS |
We provide bash scripts with the default parameters to evaluate each GraSP task. Please first download our preprocessed data files and pretrained models as instructed earlier and run the following commands to run evaluation on each task:
# Run the script corresponding to the desired task to evaluate
$ sh run_files/grasp_<actions/instruments/phases/steps/long-term/short-term_rpn>
You can easily modify the bash scripts to train our models. Just set TRAIN.ENABLE True
on the desired script to enable training, and set TEST.ENABLE False
to avoid testing before training. You might also want to modify TRAIN.CHECKPOINT_FILE_PATH
to the model weights you want to use as initialization. You can modify the config files or the bash scripts to alter the architecture design, training schedule, video input design, etc. We provide documentation for each hyperparameter in the defaults script.
Although our codes are configured to evaluate the model's performance after each epoch, you can easily evaluate your model's predictions using our evaluation codes and implementations. For this purpose, you can run the evaluate script and provide the required paths in the arguments as documented in the script. You can run this script on the output files of the detectron2 library using the --filter
argument, or you can provide your predictions in the following format:
[
{"<frame/name>":
{
# For long-term tasks
"<phase/step>_score_dist": [class_1_score, ..., class_N_score],
# For short-term tasks
"instances":
[
{
"bbox": [x_min, y_min, x_max, y_max],
"<instruments/actions>_score_dist": [class_1_score, ..., class_N_score],
# For instrument segmentation
"segment" <Segmentation in RLE format>
}
]
}
},
...
]
You can run the evaluate.py
script as follows:
$ python evaluate.py --coco_anns_path /path/to/coco/annotations/json \
--pred-path /path/to/predictions/json or pth \
--output_path /path/to/output/directory \
--tasks <instruments/actions/phases/steps> \
--metrics <mAP/[email protected]_box/[email protected]_segm/mIoU/mAP_pres> \
(optional) --masks-path /path/to/segmentation/masks \
# Optional for detectron2 outputs
--filter \
--slection <topk/thresh/cls_thresh/...> \
--selection_info <filtering info>
Our instrument segmentation baseline is wholly based on Mask2Former, so we recommend checking their repo for details on their implementation.
To run our baseline, first go to the region proposal directory and install the corresponding dependencies. You must have already installed all the required dependencies of the main TAPIS code. The following is an example of how to install dependencies correctly.
$ conda activate tapis
$ cd ./region_proposals
$ pip install -r requirements.txt
$ cd mask2former/modeling/pixel_decoder/ops
$ sh make.sh
$ cd ../../../..
The original Mask2Former code does not accept segmentation annotations in RLE format; hence, to run our baseline, you must first transform our RLE masks into Polygons using the rle_to_polygon.py script as follows:
$ python rle_to_polygon.py --data_path /path/to/GraSP/annotations
Then to run the training code run the train_net.py script indicating the path to a configuration file in the configs directory with the --config-file
argument. You should also indicate the path to the GraSP dataset with the DATASETS.DATA_PATH
option, the path to the pretrained weights with the MODEL.WEIGHTS
option, and the desired output path with the OUTPUT_DIR
option. Download the pretrained Mask2Former weights for instance segmentation in the COCO dataset from the Mask2Former repo. Use the following command to train our baseline:
$ python train_net.py --num-gpus <number of GPUs> \
--config-file configs/grasp/<config file name>.yaml \
DATASETS.DATA_PATH path/to/grasp/dataset \
MODEL.WEIGHTS path/to/pretrained/model/weights \
OUTPUT_DIR output/path
You can modify most hyperparameters by changing the values in the configuration files or using command options; please check the Detectron2 library and the original Mask2Former repo for further details on configuration files and options.
To run the evaluation code, use the --eval-only
argument and the TAPIS model weights provided in the data link. Run the following command to evaluate our baseline:
$ python train_net.py --num-gpus <number of GPUs> --eval-only \
--config-file configs/grasp/<config file name>.yaml \
DATASETS.DATA_PATH path/to/grasp/dataset \
MODEL.WEIGHTS path/to/pretrained/model/weights \
OUTPUT_DIR output/path
Note: You can easily run our segmentation baseline in a custom dataset by modifying the register_surgical_dataset
function in the train_net.py script to register the dataset in a COCO JSON format. Once again, we recommend checking the Detectron2 library and the original Mask2Former for more details on registering your dataset.
Our code allows calculating region features during training and validation (on the fly) or storing precalculated region features:
Our published results are based on stored region features, as calculating features on the fly significantly increases computational complexity and slows training down. Our code stores the region features corresponding to the predicted segments in the same results files in the output directory of the segmentation baseline. However, you can use the match_annots_n_preds.py script to filter predictions, assign region features to ground truth instances for training, and parse predictions into necessary files for TAPIS. Use the code as follows:
$ python match_annots_n_preds.py
To calculate region features on the fly, we provide an example of configuring our code in the run_files/grasp_short-term_rpn.sh
file.
You can also run the segmentation baseline for the Endovis 2017 and Endovis 2018 datasets, as done in our previous MATIS paper. We recommend checking the paper and the MATIS repo.
To run our segmentation baseline in the Endovis 2017 and 2018 datasets, please download the preprocessed frames, instances annotations, and pretrained models from this link as instructed in the MATIS repo. Then run the segmentation baseline as previously instructed but using the provided configuration files for Endovis 2017 or Endovis 2018, and indicating the path to the downloaded data with the DATASETS.DATA_PATH
option
If you have any doubts, questions, issues, or comments, please email [email protected].
If you find GraSP or TAPIS useful for your research (or its previous versions, PSI-AVA, TAPIR, and MATIS), please include the following BibTex citations in your papers.
@article{ayobi2024pixelwise,
title={Pixel-Wise Recognition for Holistic Surgical Scene Understanding},
author={Nicol{\'a}s Ayobi and Santiago Rodr{\'i}guez and Alejandra P{\'e}rez and Isabela Hern{\'a}ndez and Nicol{\'a}s Aparicio and Eug{\'e}nie Dessevres and Sebasti{\'a}n Peña and Jessica Santander and Juan Ignacio Caicedo and Nicol{\'a}s Fern{\'a}ndez and Pablo Arbel{\'a}ez},
year={2024},
url={https://arxiv.org/abs/2401.11174},
eprint={2401.11174},
journal={arXiv},
primaryClass={cs.CV}
}
@InProceedings{ayobi2023matis,
author={Nicol{\'a}s Ayobi and Alejandra P{\'e}rez-Rond{\'o}n and Santiago Rodr{\'i}guez and Pablo Arbel{\'a}es},
booktitle={2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI)},
title={MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation},
year={2023},
pages={1-5},
doi={10.1109/ISBI53787.2023.10230819}
}
@InProceedings{valderrama2020tapir,
author={Natalia Valderrama and Paola Ruiz and Isabela Hern{\'a}ndez and Nicol{\'a}s Ayobi and Mathilde Verlyck and Jessica Santander and Juan Caicedo and Nicol{\'a}s Fern{\'a}ndez and Pablo Arbel{\'a}ez},
title={Towards Holistic Surgical Scene Understanding},
booktitle={Medical Image Computing and Computer Assisted Intervention -- MICCAI 2022},
year={2022},
publisher={Springer Nature Switzerland},
address={Cham},
pages={442--452},
isbn={978-3-031-16449-1}
}