Voxy AI is a full-stack, self-hosted application designed to implement advanced AI-powered audio generation capabilities.
- Generate Speech from Text (Text-to-Speech): Convert written text into natural-sounding audio.
- Change Voices (Voice Conversion): Transform an existing audio clip from one voice to another.
- Generate Sound Effects: Create custom sound effects from descriptive text prompts.
- Scalable AI Inference APIs: Designed and deployed scalable AI inference APIs with Python FastAPI, containerized using Docker, and orchestrated via Docker Compose on AWS EC2 instances leveraging GPU acceleration.
- Cloud Infrastructure: Built a robust cloud infrastructure on AWS (EC2, S3, ECR, IAM) for hosting AI models, managing container images, and securely storing generated audio outputs.
- Secure Audio Handling: Implemented secure data handling for audio assets by integrating AWS S3 for storage and utilizing pre-signed URLs to enable direct, time-limited client access to generated audio.
- API Development: Developed API endpoints for each AI service, including input validation with Pydantic and API key authentication for secure access.
- Model Deployment: Containerized complex PyTorch-based AI models into production-ready Docker images, ensuring consistent runtime environments across development and deployment.
- Service Orchestration: Configured Docker Compose to manage the multi-service AI backend, defining service dependencies, port mappings, and resource allocations for optimal performance.
- Audio Processing Tools: Integrated essential audio processing tools such as ffmpeg for audio format handling and phonemizer (with espeak-ng) for phonetic transcription within the TTS pipeline.
- Model Fine-tuning: Built custom fine-tuning pipelines for StyleTTS2 and SeedVC models, including data preparation with WhisperX for transcription and managing training configurations for personalized voice generation.
- Frontend Development: Developed the responsive frontend application using Next.js and React, integrating securely with the Python AI backend via a Node.js proxy to manage user interactions and audio history.
I spent a summer building this project. I spent time researching papers on which models I should use and planning the architecture.
Adding my plan below (sorry, might have to zoom in for a clearer view but it's detailed, accompanied by explanations outlining the rationale behind my selection of specific tools and technologies):