This project develops a Hierarchical Deep Temporal Model designed to solve complex multi-agent coordination in sports video analytics.
While the foundational logic is inspired by Ibrahim et al. (CVPR 2016), this project provides a modern and enhanced implementation; beside this, it introduces a Global-Local Feature Fusion layer that explicitly integrates whole-frame environmental context with individual player dynamics, achieving 90.2% accuracy—a significant improvement over the 81.9% reported in the original study.
- Model Architecture
- Experiments
- Dataset
- Installation & Setup
- Project Structure
- References
- License & Citation
Figure 1: Hierarchical Fusion Architecture
- Processes 12 parallel player crops using a pretrained ResNet-50 and a per-player LSTM to capture individual motion dynamics.
- Players are split into two groups of 6 (Team 1 and Team 2) and max-pooled separately to maintain team-level spatial logic.
- Processes the whole frame to provide environmental context, such as net position, court boundaries, and ball trajectory.
- Features from the player-specific LSTMs are concatenated with global frame features before being fed into a final Group LSTM for activity classification.
We conducted an ablation study to see the contribution of each component to the final 90.2% accuracy. Each baseline isolates a specific feature (Spatial vs. Temporal and Global vs. Local) to justify the architectural choices of the B8 model.
- Logic: Only the middle frame of the video sequence is processed by the ResNet-50 backbone.
- Goal: Evaluate the performance of a single "spatial snapshot" of the court without any temporal or player-level information.
- Logic: The middle frame crops for all 12 players are processed by ResNet-50 and then max-pooled.
- Goal: Assess the importance of a single snapshot of player-level visual features without motion context.
- Logic: The full sequence of whole frames is fed into an LSTM.
- Goal: Measure the importance of temporal information in the global environment (ball and net) without focusing on individual players.
- Logic: Sequences of player crops are fed into parallel LSTMs; the final hidden state of each player is then pooled.
- Goal: Isolate the impact of temporal modeling at the individual player level.
- Logic: Players are pooled in each individual frame first, and the resulting sequence of team features is fed into a final LSTM.
- Goal: Test if temporal reasoning is more effective after spatial information has been summarized across the team.
- Logic: Player hidden states are pooled into one single group, concatenated with the global frame features, and processed by a second-stage LSTM.
- Goal: Evaluate the importance of late-fusion between player dynamics and environmental context.
- Logic: Player hidden states are pooled into two distinct groups (Team 1 & Team 2), concatenated with the global frame, and fed into the Group LSTM.
- Goal: Our final architecture that designed to decrease the confusion between "Left" and "Right" court classes via explicit team-aware pooling and multi-scale global fusion.
| Baseline | Accuracy | Weighted F1 Score |
|---|---|---|
| B1 | 74.05% | 73.98% |
| B3 | 81.90% | 81.61% |
| B4 | 75.02% | 75.07% |
| B5 | 80.78% | 79.11% |
| B6 | 79.96% | 79.84% |
| B7 | 87.58% | 87.58% |
| B8 | 90.20% | 90.23% |
Figure 2: Baseline 8 Confusion Matrix
All experiments were trained using Kaggle GPU environments. Pretrained checkpoints for the all baselines are available for download on this Kaggle Dataset
The project utilizes a dataset of high-resolution volleyball match sequences sourced from public YouTube videos. The data is designed for hierarchical action recognition, linking individual player movements to global team strategies.
- 96,620 annotated frames sourced from 4,831 video clips across 55 videos.
- Each clip includes global category label (8 Group Activity labels) and precise bounding boxes with action tags for every player (9 Individual Action labels).
Dataset available at Drive, Kaggle
Figure 3: Visualization of spatial annotations showing the "Left Set" group activity and individual player actions like "blocking," "standing," and "waiting."
The dataset reflects the natural distribution of a volleyball match.
| Group Activity Class | No. of Instances |
|---|---|
| Left Pass | 826 |
| Right Pass | 801 |
| Right Set | 644 |
| Left Spike | 642 |
| Left Set | 632 |
| Right Spike | 623 |
| Left Winpoint | 367 |
| Right Winpoint | 295 |
| Action Class | No. of Instances |
|---|---|
| Standing | 39,152 |
| Moving | 5,170 |
| Waiting | 3,647 |
| Blocking | 3,018 |
| Digging | 2,377 |
| Spiking | 1,327 |
| Setting | 1,325 |
| Falling | 1,243 |
| Jumping | 359 |
To ensure valid generalization, the data is split by Video ID. This prevents the model from "memorizing" specific court backgrounds or jersey colors during training.
To ensure fair comparison, we utilize the original author's data split.
| Split | Count | Video IDs |
|---|---|---|
| Train | 24 | 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54 |
| Validation | 15 | 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51 |
| Test | 16 | 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47 |
- Clone the repository:
git clone https://github.com/MOH-YAHIA/Group-Activity-Recognition.git
cd group-activity-recognition- Install required dependencies:
pip install -r requirements.txt- Download the Dataset
- Update the directory paths in the .yaml configuration files (located in the /config folder) to point to your local dataset location.
- If the selected baseline requires a pretrained backbone (e.g., B8), update the backbone path within the specific baseline configuration file.
- You can run any baseline training script by specifying the module name. Replace <script_name> with the baseline you wish to train (e.g., train_b1, train_b8).
python -m train.<script_name>group-activity-recognition/
├── config/ # Configuration parameters for each baseline experiment
├── logs/ # Training progress logs
├── models/ # Definitions for all model architectures (B1–B8)
├── outputs/ # Generated results for each baseline
│ └── Bx/ # Folders per baseline (e.g., B1, B8)
│ └── confusion_matrix,report # Confusion matrices and classification reports
├── scripts/ # Core execution scripts
│ ├── train.py # Main training pipeline
│ ├── val.py # Main evaluation pipeline
│ ├── train_b7_8.py # Specialized training for Fusion models (B7 & B8)
│ ├── eval_b7_8.py # Specialized evaluation for Fusion models
│ └── final_report.py # Script for generating classification reports and confusion matrices
├── train/ # Individual training routines for each baseline
├── utils/ # Data pipelines and processing utilities
│ ├── boxinfo.py # Annotation string parsing and processing
│ ├── volleyball_annot_loader.py # Metadata aggregation and directory traversal
│ ├── base_dataset.py # Base class for shared image/person level logic
│ ├── image_level_dataset.py # Dataset for whole-frame context models
│ ├── person_level_dataset.py # Dataset for individual player-crop sequences
│ ├── person_image_level_dataset.py# Multi-modal (global + local) dataset
│ ├── logger.py # Centralized logging configuration
│ └── volleyball-exploration.ipynb # Jupyter notebook for dataset EDA
├── .gitignore # Specifies intentionally untracked files to ignore
├── requirements.txt # Project dependencies
└── README.md # Project documentation
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori. > A Hierarchical Deep Temporal Model for Group Activity Recognition. > CVPR 2016, ArXiv
This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.
This work builds upon the Volleyball Dataset and the hierarchical framework proposed by Ibrahim et al. If you use this codebase or the dataset, please cite the following original publications:
@inproceedings{msibrahiCVPR16deepactivity,
author = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
title = {A Hierarchical Deep Temporal Model for Group Activity Recognition.},
booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016}
}
@inproceedings{msibrahiPAMI16deepactivity,
author = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
title = {Hierarchical Deep Temporal Models for Group Activity Recognition.},
journal = {arXiv preprint arXiv:1607.02643},
year = {2016}
}