Skip to content

NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks.

Notifications You must be signed in to change notification settings

sruckh/canary-qwen-2.5b-RunPod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Canary-Qwen-2.5B RunPod Deployment

A dockerized deployment of NVIDIA's Canary-Qwen-2.5B speech recognition model for RunPod, featuring a Gradio interface for audio transcription and LLM-powered text analysis.

Docker Hub Build Status License

🎯 Overview

This container provides:

  • ASR Mode: Speech-to-text transcription with punctuation and capitalization
  • LLM Mode: Question-answering and analysis of transcribed text
  • Gradio Interface: User-friendly web interface for audio upload and processing
  • RunPod Optimized: Slim container with host-based model and dependency management

πŸš€ Quick Start on RunPod

1. Deploy Container

To deploy, use the following command, which mounts a local directory for caching:

# Use the built container image from Docker Hub
docker run -d \
  --name canary-qwen \
  --gpus all \
  -p 7860:7860 \
  -v /workspace/cache:/root/.cache:rw \
  -e GRADIO_SHARE=true \
  gemneye/canary-qwen-2.5b-runpod:latest

πŸ“ Directory Structure

/workspace/
β”œβ”€β”€ models/                 # Model files (2.5B parameters)
β”‚   └── canary-qwen-2.5b/
β”œβ”€β”€ cache/                  # HuggingFace cache
β”œβ”€β”€ data/                   # Temporary audio files
β”œβ”€β”€ logs/                   # Application logs
β”œβ”€β”€ venv/                   # Python virtual environment
β”œβ”€β”€ activate.sh             # Environment activation script
β”œβ”€β”€ launch.sh               # Application launch script
└── env_vars.sh             # Environment variables

🎡 Using the Interface

Audio Requirements

  • Format: WAV, FLAC, MP3, or other common audio formats
  • Sample Rate: Automatically resampled to 16kHz
  • Channels: Automatically converted to mono
  • Duration: Optimal performance with <40 seconds
  • Language: English only

Interface Features

1. Audio Upload

  • Drag and drop audio files
  • Record directly using microphone
  • Automatic transcription on upload

2. Transcription (ASR Mode)

  • High-accuracy speech-to-text
  • Automatic punctuation and capitalization
  • Real-time processing

3. LLM Analysis

  • Ask questions about the transcribed content
  • Summarization and analysis
  • Content extraction and insights

Example Prompts

"Summarize the main points discussed in the audio."
"What is the speaker's tone and emotion?"
"Extract any important dates, names, or numbers mentioned."
"What questions does the speaker ask?"
"Identify the key topics covered in this audio."

πŸ› οΈ Development and Customization

Building the Container

# Clone and build
git clone https://github.com/sruckh/canary-qwen-2.5b-RunPod.git
cd canary-qwen-2.5b-RunPod
docker build -t canary-qwen-2.5b .

Local Development (No GPU)

For development without GPU access:

# Edit docker-compose.yml to comment out GPU sections
# Then run
docker-compose up --build

Customizing the Interface

Edit app.py to modify:

  • UI layout and styling
  • Processing parameters
  • Model configuration
  • Response formatting

πŸ“Š Performance Specifications

Model Performance

  • Parameters: 2.5 billion
  • RTFx: 418 (Real-time factor)
  • WER: 5.63% average on benchmarks
  • Languages: English only

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support
  • Memory: 8GB+ GPU memory recommended
  • Storage: 10GB+ for model and cache
  • Network: For model download and Gradio share

Benchmarks

Dataset WER
AMI 10.18%
GigaSpeech 9.41%
LibriSpeech Clean 1.60%
LibriSpeech Other 3.10%
Earnings22 10.42%

πŸ”§ Configuration

Environment Variables

Variable Default Description
GRADIO_SHARE false Enable Gradio share links
GRADIO_SERVER_NAME 0.0.0.0 Server bind address
GRADIO_SERVER_PORT 7860 Server port
MODEL_PATH /models/canary-qwen-2.5b Model directory
HF_HOME /root/.cache HuggingFace cache
CUDA_VISIBLE_DEVICES 0 GPU selection

Model Configuration

The model supports two modes:

  1. ASR Mode: Direct speech-to-text transcription
  2. LLM Mode: Text analysis using the underlying language model

πŸ› Troubleshooting

Common Issues

Model Not Loading

# Check model path
ls -la /models/canary-qwen-2.5b/

# Check GPU availability
nvidia-smi

# Check logs
docker logs canary-qwen

Audio Processing Errors

  • Ensure audio file is valid format
  • Check file size (<100MB recommended)
  • Verify audio duration (<40 seconds optimal)

Memory Issues

  • Reduce batch size in model configuration
  • Ensure sufficient GPU memory (8GB+)
  • Monitor memory usage with nvidia-smi

Performance Optimization

  • Use 16kHz mono audio for best performance
  • Keep audio segments under 40 seconds
  • Enable GPU acceleration
  • Use SSD storage for model files

πŸ“„ License

This project uses the NVIDIA Canary-Qwen-2.5B model under the CC-BY-4.0 license.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“ž Support

For issues and questions:

  • Check the troubleshooting section
  • Review application logs
  • Verify RunPod environment configuration
  • Ensure proper model installation

Note: This container is optimized for RunPod deployment with GPU acceleration. Local development without GPU is possible but will have limited functionality.

About

NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published