A Hierarchical Deep Temporal Model for Group Activity Recognition

This project develops a Hierarchical Deep Temporal Model designed to solve complex multi-agent coordination in sports video analytics.

While the foundational logic is inspired by Ibrahim et al. (CVPR 2016), this project provides a modern and enhanced implementation; beside this, it introduces a Global-Local Feature Fusion layer that explicitly integrates whole-frame environmental context with individual player dynamics, achieving 90.2% accuracy—a significant improvement over the 81.9% reported in the original study.

Model Architecture

Figure 1: Hierarchical Fusion Architecture

Local Branch: Individual & Team Dynamics

Processes 12 parallel player crops using a pretrained ResNet-50 and a per-player LSTM to capture individual motion dynamics.
Players are split into two groups of 6 (Team 1 and Team 2) and max-pooled separately to maintain team-level spatial logic.

Global Branch: Environmental Context

Processes the whole frame to provide environmental context, such as net position, court boundaries, and ball trajectory.

Late Fusion Architecture: Context Integration

Features from the player-specific LSTMs are concatenated with global frame features before being fed into a final Group LSTM for activity classification.

Experiments

We conducted an ablation study to see the contribution of each component to the final 90.2% accuracy. Each baseline isolates a specific feature (Spatial vs. Temporal and Global vs. Local) to justify the architectural choices of the B8 model.

Baseline 1: Static Global Snapshot (B1)

Logic: Only the middle frame of the video sequence is processed by the ResNet-50 backbone.
Goal: Evaluate the performance of a single "spatial snapshot" of the court without any temporal or player-level information.

Baseline 3: Static Player Crops (B3)

Logic: The middle frame crops for all 12 players are processed by ResNet-50 and then max-pooled.
Goal: Assess the importance of a single snapshot of player-level visual features without motion context.

Baseline 4: Global Temporal Dynamics (B4)

Logic: The full sequence of whole frames is fed into an LSTM.
Goal: Measure the importance of temporal information in the global environment (ball and net) without focusing on individual players.

Baseline 5: Hierarchical Person Temporal (B5)

Logic: Sequences of player crops are fed into parallel LSTMs; the final hidden state of each player is then pooled.
Goal: Isolate the impact of temporal modeling at the individual player level.

Baseline 6: Temporal Post-Pooling (B6)

Logic: Players are pooled in each individual frame first, and the resulting sequence of team features is fed into a final LSTM.
Goal: Test if temporal reasoning is more effective after spatial information has been summarized across the team.

Baseline 7: Unified Global-Local Fusion (B7)

Logic: Player hidden states are pooled into one single group, concatenated with the global frame features, and processed by a second-stage LSTM.
Goal: Evaluate the importance of late-fusion between player dynamics and environmental context.

Baseline 8: Final Team-Aware Fusion (B8)

Logic: Player hidden states are pooled into two distinct groups (Team 1 & Team 2), concatenated with the global frame, and fed into the Group LSTM.
Goal: Our final architecture that designed to decrease the confusion between "Left" and "Right" court classes via explicit team-aware pooling and multi-scale global fusion.

Evaluation Results

Baseline	Accuracy	Weighted F1 Score
B1	74.05%	73.98%
B3	81.90%	81.61%
B4	75.02%	75.07%
B5	80.78%	79.11%
B6	79.96%	79.84%
B7	87.58%	87.58%
B8	90.20%	90.23%

Figure 2: Baseline 8 Confusion Matrix

All experiments were trained using Kaggle GPU environments. Pretrained checkpoints for the all baselines are available for download on this Kaggle Dataset

Dataset

The project utilizes a dataset of high-resolution volleyball match sequences sourced from public YouTube videos. The data is designed for hierarchical action recognition, linking individual player movements to global team strategies.

96,620 annotated frames sourced from 4,831 video clips across 55 videos.
Each clip includes global category label (8 Group Activity labels) and precise bounding boxes with action tags for every player (9 Individual Action labels).

Dataset available at Drive, Kaggle

Annotation Example

Figure 3: Visualization of spatial annotations showing the "Left Set" group activity and individual player actions like "blocking," "standing," and "waiting."

Class Distribution

The dataset reflects the natural distribution of a volleyball match.

1. Group Activity Classes (Macro-Level)

Group Activity Class	No. of Instances
Left Pass	826
Right Pass	801
Right Set	644
Left Spike	642
Left Set	632
Right Spike	623
Left Winpoint	367
Right Winpoint	295

2. Individual Action Classes (Micro-Level)

Action Class	No. of Instances
Standing	39,152
Moving	5,170
Waiting	3,647
Blocking	3,018
Digging	2,377
Spiking	1,327
Setting	1,325
Falling	1,243
Jumping	359

Data Split

To ensure valid generalization, the data is split by Video ID. This prevents the model from "memorizing" specific court backgrounds or jersey colors during training.
To ensure fair comparison, we utilize the original author's data split.

Split	Count	Video IDs
Train	24	1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54
Validation	15	0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51
Test	16	4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47

Installation & Setup

1. Clone & Install dependencies

Clone the repository:

git clone https://github.com/MOH-YAHIA/Group-Activity-Recognition.git
cd group-activity-recognition

Install required dependencies:

pip install -r requirements.txt

2. Dataset Preparation

Download the Dataset
Update the directory paths in the .yaml configuration files (located in the /config folder) to point to your local dataset location.

3. Usage & Execution

If the selected baseline requires a pretrained backbone (e.g., B8), update the backbone path within the specific baseline configuration file.
You can run any baseline training script by specifying the module name. Replace <script_name> with the baseline you wish to train (e.g., train_b1, train_b8).

python -m train.<script_name>

Project Structure

group-activity-recognition/
├── config/              # Configuration parameters for each baseline experiment
├── logs/                # Training progress logs
├── models/              # Definitions for all model architectures (B1–B8)
├── outputs/             # Generated results for each baseline
│   └── Bx/              # Folders per baseline (e.g., B1, B8)
│       └── confusion_matrix,report      # Confusion matrices and classification reports
├── scripts/             # Core execution scripts
│   ├── train.py         # Main training pipeline
│   ├── val.py           # Main evaluation pipeline
│   ├── train_b7_8.py    # Specialized training for Fusion models (B7 & B8)
│   ├── eval_b7_8.py     # Specialized evaluation for Fusion models
│   └── final_report.py  # Script for generating classification reports and confusion matrices
├── train/               # Individual training routines for each baseline
├── utils/               # Data pipelines and processing utilities
│   ├── boxinfo.py       # Annotation string parsing and processing
│   ├── volleyball_annot_loader.py   # Metadata aggregation and directory traversal
│   ├── base_dataset.py  # Base class for shared image/person level logic
│   ├── image_level_dataset.py       # Dataset for whole-frame context models
│   ├── person_level_dataset.py      # Dataset for individual player-crop sequences
│   ├── person_image_level_dataset.py# Multi-modal (global + local) dataset
│   ├── logger.py        # Centralized logging configuration
│   └── volleyball-exploration.ipynb # Jupyter notebook for dataset EDA
├── .gitignore           # Specifies intentionally untracked files to ignore
├── requirements.txt     # Project dependencies
└── README.md            # Project documentation

References

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori. > A Hierarchical Deep Temporal Model for Group Activity Recognition. > CVPR 2016, ArXiv

License & Citation

License

This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.

Acknowledgments & Attribution

This work builds upon the Volleyball Dataset and the hierarchical framework proposed by Ibrahim et al. If you use this codebase or the dataset, please cite the following original publications:

@inproceedings{msibrahiCVPR16deepactivity,
  author    = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
  title     = {A Hierarchical Deep Temporal Model for Group Activity Recognition.},
  booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2016}
}

@inproceedings{msibrahiPAMI16deepactivity,
  author    = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
  title     = {Hierarchical Deep Temporal Models for Group Activity Recognition.},
  journal   = {arXiv preprint arXiv:1607.02643},
  year      = {2016}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Hierarchical Deep Temporal Model for Group Activity Recognition

Table of Contents

Model Architecture

Local Branch: Individual & Team Dynamics

Global Branch: Environmental Context

Late Fusion Architecture: Context Integration

Experiments

Baseline 1: Static Global Snapshot (B1)

Baseline 3: Static Player Crops (B3)

Baseline 4: Global Temporal Dynamics (B4)

Baseline 5: Hierarchical Person Temporal (B5)

Baseline 6: Temporal Post-Pooling (B6)

Baseline 7: Unified Global-Local Fusion (B7)

Baseline 8: Final Team-Aware Fusion (B8)

Evaluation Results

Dataset

Annotation Example

Class Distribution

1. Group Activity Classes (Macro-Level)

2. Individual Action Classes (Micro-Level)

Data Split

Installation & Setup

1. Clone & Install dependencies

2. Dataset Preparation

3. Usage & Execution

Project Structure

References

License & Citation

License

Acknowledgments & Attribution

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
config		config
logs		logs
models		models
outputs		outputs
scripts		scripts
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

A Hierarchical Deep Temporal Model for Group Activity Recognition

Table of Contents

Model Architecture

Local Branch: Individual & Team Dynamics

Global Branch: Environmental Context

Late Fusion Architecture: Context Integration

Experiments

Baseline 1: Static Global Snapshot (B1)

Baseline 3: Static Player Crops (B3)

Baseline 4: Global Temporal Dynamics (B4)

Baseline 5: Hierarchical Person Temporal (B5)

Baseline 6: Temporal Post-Pooling (B6)

Baseline 7: Unified Global-Local Fusion (B7)

Baseline 8: Final Team-Aware Fusion (B8)

Evaluation Results

Dataset

Annotation Example

Class Distribution

1. Group Activity Classes (Macro-Level)

2. Individual Action Classes (Micro-Level)

Data Split

Installation & Setup

1. Clone & Install dependencies

2. Dataset Preparation

3. Usage & Execution

Project Structure

References

License & Citation

License

Acknowledgments & Attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages