An intelligent academic paper discovery and extraction platform that uses Google Gemini AI to analyze research topics and automatically scrape relevant PDFs from academic databases.
- AI-Powered Topic Analysis: Uses Google Gemini 2.0 Flash to analyze research descriptions and break them into relevant sub-topics
- Automated PDF Scraping: Automatically searches and downloads Open Access PDFs from Crossref and Unpaywall
- Intelligent Keyword Generation: Converts natural language descriptions into effective academic search keywords
- Multi-Topic Processing: Handles complex research topics by breaking them into manageable sub-topics
- Real-time Progress Tracking: Shows live scraping progress and results
- Organized Download: Collects all PDFs into a single downloadable zip file
- User Input: Enter a detailed research topic or concept in natural language
- AI Analysis: Gemini 2.0 Flash analyzes the description and breaks it into relevant sub-topics
- Keyword Generation: For each sub-topic, the system generates specific academic keywords
- PDF Scraping: Searches Crossref and Unpaywall databases for Open Access PDFs using the generated keywords
- Collection: Downloads and organizes all found PDFs
- Download: Provides a single zip file containing all collected research papers
- Python 3.7 or higher
- Flask
- Requests
- python-dotenv
- tqdm
-
Clone the repository:
git clone <repository-url> cd pdf-research-scraper
-
Install required packages:
pip install -r requirements.txt
-
Create a
.envfile in the project root with your Gemini API key:GEMINI_API_KEY=your_gemini_api_key_here
Visit the live application: https://pdf-research-scraper.onrender.com/
-
Start the application:
python app.py
-
Open your browser and navigate to
http://localhost:5000 -
Enter your research topic in detail and click "Start Scraping"
-
Monitor the progress in real-time
-
Download the collected PDFs when the process completes
pdf-research-scraper/
├── app.py # Flask web application
├── scrape_pdfs.py # PDF scraping logic
├── .env # API key storage (not in version control)
├── .gitignore # Git ignore file
├── requirements.txt # Python dependencies
├── README.md # This file
└── templates/
└── index.html # Web interface
- Go to Google AI Studio
- Create an account or sign in
- Navigate to API Keys section
- Create a new API key
- Copy the key and add it to your
.envfile
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a pull request
This project is licensed under the MIT License.