Skip to content

A Python tool for tracking IEEE VIS conference papers across multiple academic sources (arXiv, Semantic Scholar)

License

Notifications You must be signed in to change notification settings

ctsilva/vis-paper-tracker

Repository files navigation

VIS Paper Tracker

A Python tool that automatically tracks the online availability of IEEE VIS 2025 conference papers across multiple academic sources (arXiv, Semantic Scholar). The tool scrapes paper information directly from the VIS 2025 website and maintains an incremental index showing when papers become available online with timestamps and abstracts.

🚀 Quick Start Guide

New user? Follow these 4 simple steps:

  1. Install dependencies: pip3 install beautifulsoup4 requests
  2. Get papers: python3 vis_paper_tracker.py --scrape
  3. Start tracking: python3 vis_paper_tracker.py --add-papers vis2025_papers.json
  4. Search & see results: python3 vis_paper_tracker.py --update --report

Daily updates: Just run python3 vis_paper_tracker.py --update --report

Features

  • Automatic Web Scraping: Fetches papers directly from IEEE VIS 2025 website with proper encoding handling
  • Multi-Source Search: Searches arXiv and Semantic Scholar APIs
  • Paper Classification: Detects Full Papers and Short Papers (posters are excluded)
  • Session Tracking: Captures conference session information
  • Progress Logging: Optional detailed logging with statistics
  • Data Export: JSON index and CSV files for analysis
  • Analysis Tools: Built-in author analysis with statistics and visualizations
  • Incremental Updates: Tracks discovery dates and maintains search history
  • Robust Progress Saving: Saves progress every 10 papers to prevent data loss
  • Data Cleaning: Built-in author name formatting and encoding issue fixes

Quick Start

Prerequisites

  • Python 3.7 or higher
  • Internet connection

Installation

  1. Download or clone this repository
  2. Install required dependencies:
pip3 install beautifulsoup4 requests

First Time Setup (Complete Workflow)

Follow these steps in order:

# Step 1: Scrape papers from IEEE VIS 2025 website
python3 vis_paper_tracker.py --scrape

# Step 2: Add papers to tracking system (use the file created in step 1)
python3 vis_paper_tracker.py --add-papers vis2025_papers.json

# Step 3: Search for papers online (this takes time due to rate limiting)
python3 vis_paper_tracker.py --update

# Step 4: See results
python3 vis_paper_tracker.py --report

Daily Updates (After Setup)

Once you've done the initial setup, just run:

# Check for newly available papers
python3 vis_paper_tracker.py --update --report

Understanding the Data Workflow

The tool uses a two-stage data system:

Stage 1: Seed Data (one-time setup)

  • vis2025_papers.json - Static list of papers from VIS 2025 website
  • Created by: --scrape command
  • Contains: titles, authors, sessions, paper types
  • Used once to populate the tracking system

Stage 2: Live Tracking (ongoing updates)

  • paper_tracking_data/paper_index.json - Live database with search results
  • paper_tracking_data/paper_availability.csv - Spreadsheet export for analysis
  • Updated by: --update command
  • Contains: search status, URLs, abstracts, discovery dates

Optional Files

  • Log files - Detailed activity logs (if you use --log-file)

Data Flow

--scrape → vis2025_papers.json → --add-papers → paper_tracking_data/ → --update → results
   ↑              ↑                    ↑               ↑              ↑         ↑
 One-time      Seed data         Populate       Live tracking    Search     Reports

Common Commands

First Time Only

# Get papers from VIS website and start tracking
python3 vis_paper_tracker.py --scrape
python3 vis_paper_tracker.py --add-papers vis2025_papers.json

Regular Use

# Search for papers and see results
python3 vis_paper_tracker.py --update --report

# With detailed logging
python3 vis_paper_tracker.py --update --report --log-file daily_check.log

Maintenance

# Fix formatting issues in author names
python3 vis_paper_tracker.py --clean-authors

# Re-scrape if new papers are added to VIS website
python3 vis_paper_tracker.py --scrape
# Note: This overwrites vis2025_papers.json with fresh data

Advanced Options

# Use custom data directory
python3 vis_paper_tracker.py --data-dir my_tracking_data --update

# Enable debug logging
python3 vis_paper_tracker.py --update --debug

# Just generate a report (no searching)
python3 vis_paper_tracker.py --report

Data Format

Input Papers (JSON)

{
  "title": "Paper Title",
  "authors": "Author1, Author2, Author3",
  "session": "Session Name",
  "award": "Award Type",
  "paper_type": "Full Paper"
}

Paper Types

  • Full Paper: Main conference papers
  • Short Paper: Shorter research contributions
  • Unknown: When type cannot be determined

Note: Posters are automatically excluded from tracking as they are typically not published as citable papers.

Output Files

  • vis2025_papers.json: Scraped paper list from VIS 2025 website
  • paper_tracking_data/paper_index.json: Persistent tracking database
  • paper_tracking_data/paper_availability.csv: Export for data analysis
  • Log files (optional): Detailed status reports and statistics

Project Structure

After setup, your project directory will look like this:

vis-paper-tracker/
├── vis_paper_tracker.py       # Main script
├── analyze_authors.py          # Author analysis and visualization script
├── vis2025_papers.json         # Seed data (created by --scrape)
├── paper_tracking_data/        # Live tracking database
│   ├── paper_index.json        # Detailed search results
│   └── paper_availability.csv  # Spreadsheet export
├── README.md                   # This documentation
├── requirements.txt           # Python dependencies
├── LICENSE                    # MIT license
└── .gitignore                 # Git ignore rules

How It Works

  1. Web Scraping: Fetches papers from IEEE VIS 2025 website with session names and paper types (excludes posters)
  2. Author Cleaning: Removes double commas and normalizes author lists
  3. Multi-Source Search: Searches arXiv first, then Semantic Scholar for each paper
  4. Fuzzy Matching: Uses 70% threshold word overlap for title matching
  5. Discovery Tracking: Records first discovery dates and maintains search history
  6. Progress Saving: Automatically saves progress every 10 papers to prevent data loss
  7. Abstract Storage: Keeps the longest abstract found across sources
  8. Progress Logging: Optional detailed logging with paper type and session statistics

Data Analysis

The tracker includes analysis scripts to explore patterns in the collected data:

Author Analysis

Analyze authorship patterns with the included analyze_authors.py script:

# Basic author analysis
python3 analyze_authors.py

# Show distribution statistics
python3 analyze_authors.py --stats-only

# Generate histogram (requires matplotlib)
pip3 install matplotlib numpy
python3 analyze_authors.py --histogram

# Top 10 authors with interactive plot
python3 analyze_authors.py --top-n 10 --histogram --show-plot

Sample Output:

TOP AUTHORS BY PAPER COUNT
- Kwan-Liu Ma: 8 papers (7 full, 1 short) - 24 collaborators
- Huamin Qu: 8 papers (7 full, 1 short) - 35 collaborators
- Cindy Xiong Bearfield: 8 papers (5 full, 3 short) - 31 collaborators

PAPER COUNT DISTRIBUTION
- 1 paper: 886 authors (82.1%)
- 2 papers: 131 authors (12.1%)
- 3+ papers: 62 authors (5.8%)

Outputs:

  • Console report with top authors and collaboration statistics
  • author_analysis.csv - Detailed spreadsheet with all authors
  • author_histogram.png - Visual distribution chart

Analysis Options

# Analysis script options
python3 analyze_authors.py --help

# Key parameters:
--top-n 20              # Number of top authors to show
--histogram             # Generate visual histogram
--stats-only           # Show statistics without full report
--csv filename.csv      # Custom CSV output filename
--hist-file plot.png    # Custom histogram filename
--show-plot            # Display plot interactively

Custom Analysis

The tracking data is stored in standard JSON/CSV formats, making it easy to:

  • Import into R, Python pandas, or Excel for custom analysis
  • Create visualizations with your preferred tools
  • Analyze collaboration networks, temporal patterns, or subject areas
  • Compare productivity across institutions or research groups

Troubleshooting

Common Issues

Error: "No such file or directory"

# Make sure you use the correct filename
ls *.json  # See what files exist
python3 vis_paper_tracker.py --add-papers vis2025_papers.json  # Use existing file

Error: "ModuleNotFoundError"

# Install missing dependencies
pip3 install beautifulsoup4 requests

Slow performance during updates

  • This is normal! The tool waits 1.5 seconds between API calls to respect rate limits
  • A full update of 290 papers takes ~7-8 minutes
  • Progress is saved every 10 papers, so interruptions won't lose much work
  • You'll see progress counters like "Checking (45/290): ..." showing current status

No papers found

  • Check your internet connection
  • Try running with debug logging: --debug
  • Some papers may not be available on arXiv or Semantic Scholar yet

Double commas in author names

# Clean up existing data
python3 vis_paper_tracker.py --clean-authors

Garbled characters in paper titles (like âThey Arenât Built For Meâ)

  • This is from encoding issues in older scraped data
  • Re-scrape to get clean data:
python3 vis_paper_tracker.py --scrape
rm -rf paper_tracking_data/
python3 vis_paper_tracker.py --add-papers vis2025_papers.json

Papers show "Added" but report shows 0 papers

  • This was a bug that has been fixed
  • If you encounter this, update to the latest version of the code

Getting Help

If you encounter issues:

  1. Run with debug logging: python3 vis_paper_tracker.py --debug --update
  2. Check the log file for detailed error messages
  3. Verify your Python version: python3 --version (needs 3.7+)

API Information

  • arXiv API: No authentication required, 15-second timeout
  • Semantic Scholar: No API key needed, returns top 5 results
  • IEEE VIS 2025: Direct HTML scraping, no authentication needed
  • Rate Limiting: 1.5 second delay between API requests

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Claudio Silva & Claude

Acknowledgments

  • IEEE VIS 2025 conference organizers
  • arXiv and Semantic Scholar for their open APIs

About

A Python tool for tracking IEEE VIS conference papers across multiple academic sources (arXiv, Semantic Scholar)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages