The AI AudioBook Generator is an advanced application that transforms text documents into expressive, human-like audiobooks. It extracts text from various file formats, rewrites it in a storytelling style using Gemini LLM, and converts it into natural speech using Coqui TTS or pyttsx3.
- ๐ Multi-Format Support: Upload PDF, DOCX, and TXT documents
- ๐ค AI-Powered Narration: Gemini LLM enhances text into audiobook-style narration
- ๐๏ธ Natural Speech Generation: Coqui TTS for high-quality voice synthesis
- ๐ด Offline Capability: pyttsx3 fallback for offline usage
- ๐ Secure Configuration: Environment-based API key management
- ๐จ User-Friendly UI: Clean, interactive Streamlit interface
- โ๏ธ Centralized Configuration: Easy settings management via
config.py
- Students for learning on the go
- Professionals for consuming reports and documents
- Visually impaired users for accessible content
- Content creators for repurposing written content
- Anyone who prefers listening over reading
AI-AudioBook-Generator/
โ
โโโ app.py # Streamlit user interface
โโโ config.py # Loads .env and manages global settings
โโโ llm_enrichment.py # Gemini AI narration enhancement
โโโ text_extraction.py # PDF/DOCX/TXT text extraction
โโโ tts_generator.py # Coqui TTS + pyttsx3 audio generation
โโโ requirements.txt # Python dependencies
โโโ .env.example # Environment variables template
โโโ README.md # Project documentation
- Python 3.11 or higher
- Git
- Internet connection (for Gemini API and Coqui TTS)
- API keys for Gemini AI
git clone https://github.com/Harsha-2005/AI-AudioBook-Generator.git
cd AI-AudioBook-Generator# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python -m venv venv
source venv/bin/activatepip install -r requirements.txtWindows:
- Install eSpeak NG
- Add it to PATH
- Restart terminal
Ubuntu/Linux:
sudo apt update
sudo apt install espeak-ngmacOS:
brew install espeak- Visit Google AI Studio
- Generate a Gemini API key
- (Optional) Get OpenAI API key if needed
Create a .env file in the project root:
# API Keys
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
# TTS Configuration
TTS_ENGINE=coqui
TTS_OUTPUT_FORMAT=wav
# Application Settings
DEBUG_MODE=False.env to .gitignore to keep your keys secure!
streamlit run app.py- Upload Document: Drag and drop or select a PDF, DOCX, or TXT file
- Preview Text: Review the extracted text before processing
- Generate Audiobook: Click the "Generate Audiobook" button
- Listen/Download: Play the audio directly or download the generated file
Upload Document โ Extract Text โ AI Narration Enhancement โ Convert to Speech โ Download Audiobook
- PDF: Uses PyPDF2 & pdfplumber for accurate text extraction
- DOCX: Leverages python-docx for Word document parsing
- TXT: Direct file reading with encoding detection
- Gemini LLM rewrites text into audiobook-style narration
- Intelligent chunking prevents token overflow
- Adds expressive elements for better listening experience
- Primary: Coqui TTS for natural, human-like speech
- Fallback: pyttsx3 for offline functionality
- Configurable output formats (WAV, MP3)
- Streamlit-based interactive UI
- Real-time progress tracking
- Audio playback and download options
python -m pytest tests/- โ Text extraction from all supported formats
- โ LLM narration enhancement
- โ Audio synthesis with both TTS engines
- โ Error handling and fallback mechanisms
- Push code to GitHub
- Connect to Streamlit Cloud
- Add environment variables in settings
- Create new Space
- Select Streamlit SDK
- Upload code and configure secrets
# Build Docker image
docker build -t audiobook-generator .
# Run container
docker run -p 8501:8501 audiobook-generator# Run as background service
nohup streamlit run app.py --server.port 8501 &| Category | Technology |
|---|---|
| Programming Language | Python 3.11 |
| UI Framework | Streamlit |
| AI/ML | Google Gemini LLM |
| Speech Synthesis | Coqui TTS, pyttsx3 |
| Text Extraction | PyPDF2, pdfplumber, python-docx |
| Configuration | python-dotenv |
| Document Parsing | PyMuPDF, docx2txt |
-
API Key Errors
- Verify
.envfile exists and contains correct keys - Check API key validity at Google AI Studio
- Verify
-
TTS Engine Issues
- Ensure eSpeak NG is properly installed
- Check internet connection for Coqui TTS
-
Memory Issues
- Reduce chunk size in
config.pyfor large documents - Close other applications to free up memory
- Reduce chunk size in
Enable debug mode in .env for detailed logging:
DEBUG_MODE=True| Document Size | Processing Time | Audio Duration |
|---|---|---|
| 1-10 pages | 1-2 minutes | 5-15 minutes |
| 10-50 pages | 3-5 minutes | 15-60 minutes |
| 50+ pages | 5-10+ minutes | 60+ minutes |
- ๐ญ Multi-voice selection for different characters
- ๐ Multi-language support for global accessibility
- ๐ต Background music mixing options
- ๐ Chapter-wise audio segmentation
- โ๏ธ Cloud storage integration for saving audiobooks
- ๐ฅ User authentication and library management
- ๐ฑ Mobile app development
- ๐ OCR support for scanned documents
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Gemini AI
- Coqui TTS
- Streamlit
- All open-source libraries used in this project
Harsha Pavan Maddala
- GitHub: @Harsha-2005
- LinkedIn: Harsha Pavan Maddala