Skip to content

snehajain1011/voxy_ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voxy AI

Voxy AI is a full-stack, self-hosted application designed to implement advanced AI-powered audio generation capabilities.

Features

  • Generate Speech from Text (Text-to-Speech): Convert written text into natural-sounding audio.
  • Change Voices (Voice Conversion): Transform an existing audio clip from one voice to another.
  • Generate Sound Effects: Create custom sound effects from descriptive text prompts.

Technical Implementation Details

  • Scalable AI Inference APIs: Designed and deployed scalable AI inference APIs with Python FastAPI, containerized using Docker, and orchestrated via Docker Compose on AWS EC2 instances leveraging GPU acceleration.
  • Cloud Infrastructure: Built a robust cloud infrastructure on AWS (EC2, S3, ECR, IAM) for hosting AI models, managing container images, and securely storing generated audio outputs.
  • Secure Audio Handling: Implemented secure data handling for audio assets by integrating AWS S3 for storage and utilizing pre-signed URLs to enable direct, time-limited client access to generated audio.
  • API Development: Developed API endpoints for each AI service, including input validation with Pydantic and API key authentication for secure access.
  • Model Deployment: Containerized complex PyTorch-based AI models into production-ready Docker images, ensuring consistent runtime environments across development and deployment.
  • Service Orchestration: Configured Docker Compose to manage the multi-service AI backend, defining service dependencies, port mappings, and resource allocations for optimal performance.
  • Audio Processing Tools: Integrated essential audio processing tools such as ffmpeg for audio format handling and phonemizer (with espeak-ng) for phonetic transcription within the TTS pipeline.
  • Model Fine-tuning: Built custom fine-tuning pipelines for StyleTTS2 and SeedVC models, including data preparation with WhisperX for transcription and managing training configurations for personalized voice generation.
  • Frontend Development: Developed the responsive frontend application using Next.js and React, integrating securely with the Python AI backend via a Node.js proxy to manage user interactions and audio history.

I spent a summer building this project. I spent time researching papers on which models I should use and planning the architecture.

Adding my plan below (sorry, might have to zoom in for a clearer view but it's detailed, accompanied by explanations outlining the rationale behind my selection of specific tools and technologies):

FEATURE OVERVIEW

image

FINETUNING

image image

API DEPLOYMENT

image

FLOW

image

IN-DEPTH FLOW

image image image

FUTURE SCOPE

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published