Skip to content

Harsha-2005/AudioBook-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽง AI AudioBook Generator

Python Version Streamlit App Gemini-LLM Coqui TTS License

The AI AudioBook Generator is an advanced application that transforms text documents into expressive, human-like audiobooks. It extracts text from various file formats, rewrites it in a storytelling style using Gemini LLM, and converts it into natural speech using Coqui TTS or pyttsx3.

โœจ Features

  • ๐Ÿ“„ Multi-Format Support: Upload PDF, DOCX, and TXT documents
  • ๐Ÿค– AI-Powered Narration: Gemini LLM enhances text into audiobook-style narration
  • ๐ŸŽ™๏ธ Natural Speech Generation: Coqui TTS for high-quality voice synthesis
  • ๐Ÿ“ด Offline Capability: pyttsx3 fallback for offline usage
  • ๐Ÿ”’ Secure Configuration: Environment-based API key management
  • ๐ŸŽจ User-Friendly UI: Clean, interactive Streamlit interface
  • โš™๏ธ Centralized Configuration: Easy settings management via config.py

๐ŸŽฏ Target Audience

  • Students for learning on the go
  • Professionals for consuming reports and documents
  • Visually impaired users for accessible content
  • Content creators for repurposing written content
  • Anyone who prefers listening over reading

๐Ÿ—๏ธ System Architecture

AI-AudioBook-Generator/
โ”‚
โ”œโ”€โ”€ app.py                    # Streamlit user interface
โ”œโ”€โ”€ config.py                 # Loads .env and manages global settings
โ”œโ”€โ”€ llm_enrichment.py         # Gemini AI narration enhancement
โ”œโ”€โ”€ text_extraction.py        # PDF/DOCX/TXT text extraction
โ”œโ”€โ”€ tts_generator.py          # Coqui TTS + pyttsx3 audio generation
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ .env.example              # Environment variables template
โ””โ”€โ”€ README.md                 # Project documentation

๐Ÿ“‹ Prerequisites

  • Python 3.11 or higher
  • Git
  • Internet connection (for Gemini API and Coqui TTS)
  • API keys for Gemini AI

๐Ÿš€ Quick Installation

1๏ธโƒฃ Clone the Repository

git clone https://github.com/Harsha-2005/AI-AudioBook-Generator.git
cd AI-AudioBook-Generator

2๏ธโƒฃ Create Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python -m venv venv
source venv/bin/activate

3๏ธโƒฃ Install Dependencies

pip install -r requirements.txt

4๏ธโƒฃ System Dependencies

Windows:

  • Install eSpeak NG
  • Add it to PATH
  • Restart terminal

Ubuntu/Linux:

sudo apt update
sudo apt install espeak-ng

macOS:

brew install espeak

๐Ÿ” Configuration

1. Get API Keys

  1. Visit Google AI Studio
  2. Generate a Gemini API key
  3. (Optional) Get OpenAI API key if needed

2. Configure Environment

Create a .env file in the project root:

# API Keys
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# TTS Configuration
TTS_ENGINE=coqui
TTS_OUTPUT_FORMAT=wav

# Application Settings
DEBUG_MODE=False

โš ๏ธ Important: Add .env to .gitignore to keep your keys secure!

๐Ÿ–ฅ๏ธ Usage

Start the Application

streamlit run app.py

Using the Web Interface

  1. Upload Document: Drag and drop or select a PDF, DOCX, or TXT file
  2. Preview Text: Review the extracted text before processing
  3. Generate Audiobook: Click the "Generate Audiobook" button
  4. Listen/Download: Play the audio directly or download the generated file

Workflow

Upload Document โ†’ Extract Text โ†’ AI Narration Enhancement โ†’ Convert to Speech โ†’ Download Audiobook

๐Ÿง  How It Works

1. Text Extraction

  • PDF: Uses PyPDF2 & pdfplumber for accurate text extraction
  • DOCX: Leverages python-docx for Word document parsing
  • TXT: Direct file reading with encoding detection

2. AI Narration Enhancement

  • Gemini LLM rewrites text into audiobook-style narration
  • Intelligent chunking prevents token overflow
  • Adds expressive elements for better listening experience

3. Text-to-Speech Generation

  • Primary: Coqui TTS for natural, human-like speech
  • Fallback: pyttsx3 for offline functionality
  • Configurable output formats (WAV, MP3)

4. User Interface

  • Streamlit-based interactive UI
  • Real-time progress tracking
  • Audio playback and download options

๐Ÿงช Testing

Unit Testing

python -m pytest tests/

Test Coverage

  • โœ… Text extraction from all supported formats
  • โœ… LLM narration enhancement
  • โœ… Audio synthesis with both TTS engines
  • โœ… Error handling and fallback mechanisms

๐Ÿš€ Deployment Options

Option 1: Streamlit Cloud

  1. Push code to GitHub
  2. Connect to Streamlit Cloud
  3. Add environment variables in settings

Option 2: Hugging Face Spaces

  1. Create new Space
  2. Select Streamlit SDK
  3. Upload code and configure secrets

Option 3: Docker

# Build Docker image
docker build -t audiobook-generator .

# Run container
docker run -p 8501:8501 audiobook-generator

Option 4: Local Deployment

# Run as background service
nohup streamlit run app.py --server.port 8501 &

๐Ÿงฐ Tech Stack

Category Technology
Programming Language Python 3.11
UI Framework Streamlit
AI/ML Google Gemini LLM
Speech Synthesis Coqui TTS, pyttsx3
Text Extraction PyPDF2, pdfplumber, python-docx
Configuration python-dotenv
Document Parsing PyMuPDF, docx2txt

๐Ÿ”ง Troubleshooting

Common Issues

  1. API Key Errors

    • Verify .env file exists and contains correct keys
    • Check API key validity at Google AI Studio
  2. TTS Engine Issues

    • Ensure eSpeak NG is properly installed
    • Check internet connection for Coqui TTS
  3. Memory Issues

    • Reduce chunk size in config.py for large documents
    • Close other applications to free up memory

Logs

Enable debug mode in .env for detailed logging:

DEBUG_MODE=True

๐Ÿ“ˆ Performance Metrics

Document Size Processing Time Audio Duration
1-10 pages 1-2 minutes 5-15 minutes
10-50 pages 3-5 minutes 15-60 minutes
50+ pages 5-10+ minutes 60+ minutes

๐Ÿ”ฎ Future Enhancements

  • ๐ŸŽญ Multi-voice selection for different characters
  • ๐ŸŒ Multi-language support for global accessibility
  • ๐ŸŽต Background music mixing options
  • ๐Ÿ“– Chapter-wise audio segmentation
  • โ˜๏ธ Cloud storage integration for saving audiobooks
  • ๐Ÿ‘ฅ User authentication and library management
  • ๐Ÿ“ฑ Mobile app development
  • ๐Ÿ” OCR support for scanned documents

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

๐Ÿ‘จโ€๐Ÿ’ป Author

Harsha Pavan Maddala

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages