Skip to content

MOH-YAHIA/Group-Activity-Recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Hierarchical Deep Temporal Model for Group Activity Recognition

This project develops a Hierarchical Deep Temporal Model designed to solve complex multi-agent coordination in sports video analytics.

While the foundational logic is inspired by Ibrahim et al. (CVPR 2016), this project provides a modern and enhanced implementation; beside this, it introduces a Global-Local Feature Fusion layer that explicitly integrates whole-frame environmental context with individual player dynamics, achieving 90.2% accuracy—a significant improvement over the 81.9% reported in the original study.

Table of Contents

Model Architecture

Model Arch Figure 1: Hierarchical Fusion Architecture

Local Branch: Individual & Team Dynamics

  • Processes 12 parallel player crops using a pretrained ResNet-50 and a per-player LSTM to capture individual motion dynamics.
  • Players are split into two groups of 6 (Team 1 and Team 2) and max-pooled separately to maintain team-level spatial logic.

Global Branch: Environmental Context

  • Processes the whole frame to provide environmental context, such as net position, court boundaries, and ball trajectory.

Late Fusion Architecture: Context Integration

  • Features from the player-specific LSTMs are concatenated with global frame features before being fed into a final Group LSTM for activity classification.

Experiments

We conducted an ablation study to see the contribution of each component to the final 90.2% accuracy. Each baseline isolates a specific feature (Spatial vs. Temporal and Global vs. Local) to justify the architectural choices of the B8 model.

Baseline 1: Static Global Snapshot (B1)

  • Logic: Only the middle frame of the video sequence is processed by the ResNet-50 backbone.
  • Goal: Evaluate the performance of a single "spatial snapshot" of the court without any temporal or player-level information.

Baseline 3: Static Player Crops (B3)

  • Logic: The middle frame crops for all 12 players are processed by ResNet-50 and then max-pooled.
  • Goal: Assess the importance of a single snapshot of player-level visual features without motion context.

Baseline 4: Global Temporal Dynamics (B4)

  • Logic: The full sequence of whole frames is fed into an LSTM.
  • Goal: Measure the importance of temporal information in the global environment (ball and net) without focusing on individual players.

Baseline 5: Hierarchical Person Temporal (B5)

  • Logic: Sequences of player crops are fed into parallel LSTMs; the final hidden state of each player is then pooled.
  • Goal: Isolate the impact of temporal modeling at the individual player level.

Baseline 6: Temporal Post-Pooling (B6)

  • Logic: Players are pooled in each individual frame first, and the resulting sequence of team features is fed into a final LSTM.
  • Goal: Test if temporal reasoning is more effective after spatial information has been summarized across the team.

Baseline 7: Unified Global-Local Fusion (B7)

  • Logic: Player hidden states are pooled into one single group, concatenated with the global frame features, and processed by a second-stage LSTM.
  • Goal: Evaluate the importance of late-fusion between player dynamics and environmental context.

Baseline 8: Final Team-Aware Fusion (B8)

  • Logic: Player hidden states are pooled into two distinct groups (Team 1 & Team 2), concatenated with the global frame, and fed into the Group LSTM.
  • Goal: Our final architecture that designed to decrease the confusion between "Left" and "Right" court classes via explicit team-aware pooling and multi-scale global fusion.

Evaluation Results

Baseline Accuracy Weighted F1 Score
B1 74.05% 73.98%
B3 81.90% 81.61%
B4 75.02% 75.07%
B5 80.78% 79.11%
B6 79.96% 79.84%
B7 87.58% 87.58%
B8 90.20% 90.23%

Figure 2: Baseline 8 Confusion Matrix


All experiments were trained using Kaggle GPU environments. Pretrained checkpoints for the all baselines are available for download on this Kaggle Dataset

Dataset

The project utilizes a dataset of high-resolution volleyball match sequences sourced from public YouTube videos. The data is designed for hierarchical action recognition, linking individual player movements to global team strategies.

  • 96,620 annotated frames sourced from 4,831 video clips across 55 videos.
  • Each clip includes global category label (8 Group Activity labels) and precise bounding boxes with action tags for every player (9 Individual Action labels).

Dataset available at Drive, Kaggle


Annotation Example

Volleyball Dataset Annotation Example Figure 3: Visualization of spatial annotations showing the "Left Set" group activity and individual player actions like "blocking," "standing," and "waiting."


Class Distribution

The dataset reflects the natural distribution of a volleyball match.

1. Group Activity Classes (Macro-Level)

Group Activity Class No. of Instances
Left Pass 826
Right Pass 801
Right Set 644
Left Spike 642
Left Set 632
Right Spike 623
Left Winpoint 367
Right Winpoint 295

2. Individual Action Classes (Micro-Level)

Action Class No. of Instances
Standing 39,152
Moving 5,170
Waiting 3,647
Blocking 3,018
Digging 2,377
Spiking 1,327
Setting 1,325
Falling 1,243
Jumping 359

Data Split

To ensure valid generalization, the data is split by Video ID. This prevents the model from "memorizing" specific court backgrounds or jersey colors during training.
To ensure fair comparison, we utilize the original author's data split.

Split Count Video IDs
Train 24 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54
Validation 15 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51
Test 16 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47

Installation & Setup

1. Clone & Install dependencies

  1. Clone the repository:
git clone https://github.com/MOH-YAHIA/Group-Activity-Recognition.git
cd group-activity-recognition
  1. Install required dependencies:
pip install -r requirements.txt

2. Dataset Preparation

  1. Download the Dataset
  2. Update the directory paths in the .yaml configuration files (located in the /config folder) to point to your local dataset location.

3. Usage & Execution

  1. If the selected baseline requires a pretrained backbone (e.g., B8), update the backbone path within the specific baseline configuration file.
  2. You can run any baseline training script by specifying the module name. Replace <script_name> with the baseline you wish to train (e.g., train_b1, train_b8).
python -m train.<script_name>

Project Structure

group-activity-recognition/
├── config/              # Configuration parameters for each baseline experiment
├── logs/                # Training progress logs
├── models/              # Definitions for all model architectures (B1–B8)
├── outputs/             # Generated results for each baseline
│   └── Bx/              # Folders per baseline (e.g., B1, B8)
│       └── confusion_matrix,report      # Confusion matrices and classification reports
├── scripts/             # Core execution scripts
│   ├── train.py         # Main training pipeline
│   ├── val.py           # Main evaluation pipeline
│   ├── train_b7_8.py    # Specialized training for Fusion models (B7 & B8)
│   ├── eval_b7_8.py     # Specialized evaluation for Fusion models
│   └── final_report.py  # Script for generating classification reports and confusion matrices
├── train/               # Individual training routines for each baseline
├── utils/               # Data pipelines and processing utilities
│   ├── boxinfo.py       # Annotation string parsing and processing
│   ├── volleyball_annot_loader.py   # Metadata aggregation and directory traversal
│   ├── base_dataset.py  # Base class for shared image/person level logic
│   ├── image_level_dataset.py       # Dataset for whole-frame context models
│   ├── person_level_dataset.py      # Dataset for individual player-crop sequences
│   ├── person_image_level_dataset.py# Multi-modal (global + local) dataset
│   ├── logger.py        # Centralized logging configuration
│   └── volleyball-exploration.ipynb # Jupyter notebook for dataset EDA
├── .gitignore           # Specifies intentionally untracked files to ignore
├── requirements.txt     # Project dependencies
└── README.md            # Project documentation

References

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori. > A Hierarchical Deep Temporal Model for Group Activity Recognition. > CVPR 2016, ArXiv

License & Citation

License

This project is licensed under the BSD 2-Clause License - see the LICENSE file for details.

Acknowledgments & Attribution

This work builds upon the Volleyball Dataset and the hierarchical framework proposed by Ibrahim et al. If you use this codebase or the dataset, please cite the following original publications:

@inproceedings{msibrahiCVPR16deepactivity,
  author    = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
  title     = {A Hierarchical Deep Temporal Model for Group Activity Recognition.},
  booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2016}
}

@inproceedings{msibrahiPAMI16deepactivity,
  author    = {Mostafa S. Ibrahim and Srikanth Muralidharan and Zhiwei Deng and Arash Vahdat and Greg Mori},
  title     = {Hierarchical Deep Temporal Models for Group Activity Recognition.},
  journal   = {arXiv preprint arXiv:1607.02643},
  year      = {2016}
}

About

Modern & enhanced implementation of Hierarchical Deep Temporal Models for Group Activity Recognition. Achieves 90.2% accuracy on the Volleyball Dataset using Global-Local Feature Fusion.

Topics

Resources

License

Stars

Watchers

Forks

Contributors