PDF Research Scraper

An intelligent academic paper discovery and extraction platform that uses Google Gemini AI to analyze research topics and automatically scrape relevant PDFs from academic databases.

Features

AI-Powered Topic Analysis: Uses Google Gemini 2.0 Flash to analyze research descriptions and break them into relevant sub-topics
Automated PDF Scraping: Automatically searches and downloads Open Access PDFs from Crossref and Unpaywall
Intelligent Keyword Generation: Converts natural language descriptions into effective academic search keywords
Multi-Topic Processing: Handles complex research topics by breaking them into manageable sub-topics
Real-time Progress Tracking: Shows live scraping progress and results
Organized Download: Collects all PDFs into a single downloadable zip file

How It Works

User Input: Enter a detailed research topic or concept in natural language
AI Analysis: Gemini 2.0 Flash analyzes the description and breaks it into relevant sub-topics
Keyword Generation: For each sub-topic, the system generates specific academic keywords
PDF Scraping: Searches Crossref and Unpaywall databases for Open Access PDFs using the generated keywords
Collection: Downloads and organizes all found PDFs
Download: Provides a single zip file containing all collected research papers

Prerequisites

Python 3.7 or higher
Flask
Requests
python-dotenv
tqdm

Installation

Clone the repository:

git clone <repository-url>
cd pdf-research-scraper

Install required packages:
```
pip install -r requirements.txt
```
Create a .env file in the project root with your Gemini API key:
```
GEMINI_API_KEY=your_gemini_api_key_here
```

Usage

Online (Deployed Version)

Visit the live application: https://pdf-research-scraper.onrender.com/

Local Setup

Start the application:
```
python app.py
```
Open your browser and navigate to http://localhost:5000
Enter your research topic in detail and click "Start Scraping"
Monitor the progress in real-time
Download the collected PDFs when the process completes

Project Structure

pdf-research-scraper/
├── app.py              # Flask web application
├── scrape_pdfs.py      # PDF scraping logic
├── .env               # API key storage (not in version control)
├── .gitignore         # Git ignore file
├── requirements.txt   # Python dependencies
├── README.md          # This file
└── templates/
    └── index.html     # Web interface

How to Get a Gemini API Key

Go to Google AI Studio
Create an account or sign in
Navigate to API Keys section
Create a new API key
Copy the key and add it to your .env file

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a pull request

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Research Scraper

Features

How It Works

Prerequisites

Installation

Usage

Online (Deployed Version)

Local Setup

Project Structure

How to Get a Gemini API Key

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
scrape_pdfs.py		scrape_pdfs.py

Akshay-gurav-31/PDF-Research-Scraper

Folders and files

Latest commit

History

Repository files navigation

PDF Research Scraper

Features

How It Works

Prerequisites

Installation

Usage

Online (Deployed Version)

Local Setup

Project Structure

How to Get a Gemini API Key

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages