Voxy AI

Voxy AI is a full-stack, self-hosted application designed to implement advanced AI-powered audio generation capabilities.

Features

Generate Speech from Text (Text-to-Speech): Convert written text into natural-sounding audio.
Change Voices (Voice Conversion): Transform an existing audio clip from one voice to another.
Generate Sound Effects: Create custom sound effects from descriptive text prompts.

Technical Implementation Details

Scalable AI Inference APIs: Designed and deployed scalable AI inference APIs with Python FastAPI, containerized using Docker, and orchestrated via Docker Compose on AWS EC2 instances leveraging GPU acceleration.
Cloud Infrastructure: Built a robust cloud infrastructure on AWS (EC2, S3, ECR, IAM) for hosting AI models, managing container images, and securely storing generated audio outputs.
Secure Audio Handling: Implemented secure data handling for audio assets by integrating AWS S3 for storage and utilizing pre-signed URLs to enable direct, time-limited client access to generated audio.
API Development: Developed API endpoints for each AI service, including input validation with Pydantic and API key authentication for secure access.
Model Deployment: Containerized complex PyTorch-based AI models into production-ready Docker images, ensuring consistent runtime environments across development and deployment.
Service Orchestration: Configured Docker Compose to manage the multi-service AI backend, defining service dependencies, port mappings, and resource allocations for optimal performance.
Audio Processing Tools: Integrated essential audio processing tools such as ffmpeg for audio format handling and phonemizer (with espeak-ng) for phonetic transcription within the TTS pipeline.
Model Fine-tuning: Built custom fine-tuning pipelines for StyleTTS2 and SeedVC models, including data preparation with WhisperX for transcription and managing training configurations for personalized voice generation.
Frontend Development: Developed the responsive frontend application using Next.js and React, integrating securely with the Python AI backend via a Node.js proxy to manage user interactions and audio history.

I spent a summer building this project. I spent time researching papers on which models I should use and planning the architecture.

Adding my plan below (sorry, might have to zoom in for a clearer view but it's detailed, accompanied by explanations outlining the rationale behind my selection of specific tools and technologies):

FEATURE OVERVIEW

FINETUNING

API DEPLOYMENT

FLOW

IN-DEPTH FLOW

FUTURE SCOPE

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Make-An-Audio		Make-An-Audio
StyleTTS2		StyleTTS2
StyleTTS2FineTune		StyleTTS2FineTune
frontend		frontend
seed-vc		seed-vc
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxy AI

Features

Technical Implementation Details

FEATURE OVERVIEW

FINETUNING

API DEPLOYMENT

FLOW

IN-DEPTH FLOW

FUTURE SCOPE

About

Uh oh!

Releases

Packages

Languages

snehajain1011/voxy_ai

Folders and files

Latest commit

History

Repository files navigation

Voxy AI

Features

Technical Implementation Details

FEATURE OVERVIEW

FINETUNING

API DEPLOYMENT

FLOW

IN-DEPTH FLOW

FUTURE SCOPE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages