Repository for a bibliographic reference matching tool designed to match references from Crossref JSON files against OpenCitations Meta. It implements a heuristic-based approach, enabling the retrieval and validation of bibliographic metadata even in cases of incomplete or inconsistent citation records and generates comprehensive reports with detailed statistics.
- Features
- Architecture Overview
- Installation
- Configuration
- Usage
- Workflow
- Scoring System
- Output Files
- Logging System
- Error Handling
- Advanced Features
- Troubleshooting
- Multi-Format Support: Processes Crossref JSON and TEI XML files
- Async Architecture: Concurrent processing with asyncio and aiohttp for high performance
- Intelligent SPARQL Matching: 6 query strategies with early stopping when threshold is met
- Sophisticated Scoring System: Weighted scoring (max 48 points) based on DOI, title, authors, year, volume, and pages
- GROBID Integration: Enriches references using GROBID for unstructured text parsing
- Comprehensive Logging: Multi-file logging system with 5 specialized logs
- Rate Limiting: Token bucket algorithm (2.5 req/s, burst of 10)
- Concurrent Processing: Semaphore-controlled parallelism (10 concurrent references)
- Dynamic Threshold: Automatic threshold adjustment (90% trigger)
- Detailed Statistics: Match rates, field contributions, query type distribution
┌─────────────────────────────────────────────────────────────────┐
│ Reference Matching Tool │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 1. Reference Extraction Phase │
└─────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│JSON/TEI Files│ + │ GROBID │
│ │ │ Fallback │
└──────────────┘ └──────────────┘
│ │
└────────┬────────┘
▼
┌─────────────────────────────────────────┐
│ 2. Reference Normalization Phase │
│ - Clean titles, authors, DOIs │
│ - Normalize text (Unicode, accents) │
│ - Extract numeric fields (year, pages) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 3. SPARQL Query Construction Phase │
│ │
│ Query Execution (sequential): │
│ 1. year_and_doi │
│ 2. doi_title │
│ 3. author_title │
│ 4. year_author_page │
│ 5. year_volume_page │
│ 6. year_author_volume │
│ │
│ Early stop when score >= threshold │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 4. OpenCitations SPARQL Query Phase │
│ - Async query execution │
│ - Rate limiting │
│ - Token bucket algorithm │
│ - Error handling │
│ - Max 3 retries with backoff │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 5. Candidate Scoring Phase │
│ │
│ Scoring Components: │
│ ├─ DOI Exact Match: 15 pts │
│ ├─ Title Similarity: 14-10 pts │
│ ├─ Author Match: 7 pts │
│ ├─ Year Match: 1 pt │
│ ├─ Volume Match: 3 pts │
│ └─ Page Match: 8 pts │
│ │
│ Threshold: 26/48 points (54.5%) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 6. Result Generation & Export Phase │
│ - CSV matched references │
│ - CSV unmatched references │
│ - HTML processing report │
│ - Statistics text file │
│ - 5 specialized log files │
└─────────────────────────────────────────┘
- Python 3.8+
- GROBID (optional, for processing fallback)
- Internet connection (for OpenCitations SPARQL endpoint)
pip install -r requirements.txtpython ReferenceMatchingTool.py crossref_references.json \
--output output_file.csv \
--threshold 26 \
--use-grobid \
--grobid-config grobid_config.jsonpython ReferenceMatchingTool.py references.tei.xml \
--output output_file.csv \
--threshold 26 \
--use-grobid \
--grobid-config grobid_config.jsonpython ReferenceMatchingTool.py input_directory/ \
--batch \
--output output_directory/ \
--threshold 26 \
--use-grobidpython ReferenceMatchingTool.py crossref_references.json \
--output matches.csv \
--no-doipython ReferenceMatchingTool.py crossref_references.json \
--output matches.csv \
--rate-limit 1.5 \
--burst-size 5| Argument | Type | Description | Default |
|---|---|---|---|
input |
str | Required. Path to input Crossref JSON or TEI XML file, or directory for batch processing | - |
--output, -o |
str | Output CSV file path (single mode) or directory (batch mode). Auto-generated if not specified | Auto-generated |
--threshold, -t |
int | Minimum matching score (0-48) required to consider a reference as matched. Lower = more permissive | 26 |
--use-grobid |
flag | Enable GROBID fallback to extract metadata from unstructured citation text when initial matching fails | False |
--grobid-config |
str | Path to GROBID configuration JSON file. If not specified, searches: current directory (grobid_config.json), ~/.grobid/config.json, script directory, parent directories (up to 3 levels), and GROBID_CONFIG_PATH environment variable |
None (auto-search) |
--batch, -b |
flag | Enable batch mode to process all JSON/XML files in the input directory concurrently | False |
--use-doi |
flag | Include DOI-based queries (year_and_doi, doi_title) in the matching strategy. Default enabled | True |
--no-doi |
flag | Disable DOI-based queries. Useful when DOI metadata is unreliable or missing | - |
--timeout |
int | Maximum time in seconds to wait for each SPARQL query response before timing out | 600 |
--max-retries |
int | Number of retry attempts for failed SPARQL queries (handles transient network errors) | 3 |
--batch-size |
int | Number of files to process simultaneously in each batch. Lower values reduce memory usage | 3 |
--pause-duration |
int | Delay in seconds between processing batches to avoid overwhelming the server | 10 |
--error-threshold |
int | Maximum number of consecutive server errors (5xx) before stopping batch processing | 10 |
--log-level |
str | Verbosity of logging output: DEBUG (detailed), INFO (standard), WARNING, or ERROR (minimal) | INFO |
--rate-limit |
float | Maximum SPARQL queries per second to respect OpenCitations API rate limits | 2.5 |
--burst-size |
int | Maximum number of concurrent requests allowed in token bucket before rate limiting kicks in | 10 |
┌────────────────────────────────────────────────────────────┐
│ Step 1: INPUT PROCESSING │
├────────────────────────────────────────────────────────────┤
│ │
│ Crossref JSON TEI XML File │
│ │ │ │
│ └────────────────────┘ │
│ ▼ │
│ Parse & Extract References │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ ▼ ▼ │
│ Crossref Format TEI Format │
│ (JSON structure) (biblStruct) │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Step 2: REFERENCE NORMALIZATION │
├────────────────────────────────────────────────────────────┤
│ │
│ For each reference: │
│ 1. Clean title (remove punctuation, normalize case) │
│ 2. Normalize DOI (strip prefix, lowercase) │
│ 3. Extract authors (parse names, handle formats) │
│ 4. Validate year (check range 1700-current+1) │
│ 5. Normalize Unicode (remove accents, special chars) │
│ 6. Extract volume/page numbers │
│ │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Step 3: QUERY STRATEGY SELECTION │
├────────────────────────────────────────────────────────────┤
│ │
│ Query Execution Order (sequential, early stop): │
│ │
│ 1. year_and_doi (if DOI + year available) │
│ 2. doi_title (if DOI + title available) │
│ 3. author_title (if author + title available) │
│ 4. year_author_page (if year + author + page available) │
│ 5. year_volume_page (if year + volume + page available) │
│ 6. year_author_volume (if year + author + vol available) │
│ │
│ Early stop when: score >= threshold │
│ Grobid fallback: if initial match fails │
│ No-year attempt: if suspiscious year is found │
│ │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Step 4: SPARQL QUERY EXECUTION │
├────────────────────────────────────────────────────────────┤
│ │
│ For selected query: │
│ 1. Construct SPARQL query with filters │
│ 2. Apply rate limiting (2.5 req/sec) │
│ 3. Execute query against OpenCitations endpoint │
│ 4. Handle errors: │
│ - 429 (Rate Limit): Exponential backoff │
│ - 5xx (Server Error): Retry with delay │
│ - Timeout: Retry with extended timeout │
│ 5. Parse results (extract candidates) │
│ │
│ Max retries: 3 │
│ Timeout: 600 seconds │
│ │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Step 5: CANDIDATE SCORING │
├────────────────────────────────────────────────────────────┤
│ │
│ For each candidate from SPARQL results: │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ DOI Matching (15 points max) │ │
│ │ - Exact match: +15 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Title Similarity (14 points max) │ │
│ │ - 100% match: +14 pts │ │
│ │ - 95-99%: +13 pts │ │
│ │ - 90-94%: +13 pts │ │
│ │ - 85-89%: +12 pts │ │
│ │ - 80-84%: +11 pts │ │
│ │ - 75-79%: +10 pts │ │
│ │ - <75%: +0 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Author Matching (7 points max) │ │
│ │ - Any exact surname match: +7 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Year Matching (1 point max) │ │
│ │ - Exact year: +1 pt │ │
│ │ - Adjacent: +0 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Volume Matching (3 points max) │ │
│ │ - Exact match: +3 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Page Matching (8 points max) │ │
│ │ - Start OR End page match: +8 pts │ │
│ └─────────────────────────────────────────┘ │
│ │
│ TOTAL SCORE: Sum of all components (max 48 points) │
│ THRESHOLD: 26 points (54.5% of maximum) │
│ ADJUSTED THRESHOLD: 90% of 26 │
│ │
│ Select early-winning candidate with score >= 90% of 26 │
│ │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Step 6: RESULT COMPILATION & OUTPUT │
├────────────────────────────────────────────────────────────┤
│ │
│ Generate outputs: │
│ │
│ 1. CSV Matched References │
│ - reference_id, article_title │
│ - matched_title, score │
│ - matched_doi, meta_id │
│ - query_type │
│ │
│ 2. CSV Unmatched References │
│ - All reference metadata │
│ - Best score achieved │
│ - Score breakdown (original/grobid/no-year) │
│ - GROBID attempt status │
│ │
│ 3. Statistics Text File │
│ - Total references, match rate │
│ - Field availability stats │
│ - Query type distribution │
│ - GROBID fallback statistics │
│ │
│ 4. Log Files (5 specialized logs) │
│ - reference_matching_main.log │
│ - reference_matching_authors.log │
│ - reference_matching_queries.log │
│ - reference_matching_scores.log │
│ - reference_matching_errors.log │
│ │
└────────────────────────────────────────────────────────────┘
Reference Input:
Title: "Deep Learning in Medical Imaging"
DOI: 10.1234/example.2020.001
Authors: ["Smith, J.", "Johnson, M."]
Year: 2020
Volume: 15
Pages: 123-145
OpenCitations Candidate:
Title: "Deep Learning in Medical Imaging"
DOI: 10.1234/example.2020.001
Authors: ["Smith, John", "Johnson, Mary"]
Year: 2020
Volume: 15
Start Page: 123
End Page: 145
Score Calculation:
✓ DOI exact match: +15 points
✓ Title 100% match: +14 points
✓ Author match (Smith): +7 points
✓ Year exact: +1 point
✓ Volume match: +3 points
✓ Page match (123): +8 points
─────────────────────────────────
TOTAL: 48 points ✓ MATCH
Reference Input:
Title: "Machine learning techniques"
Authors: ["Doe, A."]
Year: 2019
Volume: 12
Pages: 45-67
OpenCitations Candidate:
Title: "Machine Learning Techniques for Data Analysis"
Authors: ["Doe, Alice", "Brown, Bob"]
Year: 2019
Volume: 12
Start Page: 45
Score Calculation:
✗ DOI not available: +0 points
✓ Title 85% match: +12 points
✓ Author match (Doe): +7 points
✓ Year exact: +1 point
✓ Volume match: +3 points
✓ Page match (45): +8 points
─────────────────────────────────
TOTAL: 31 points ✓ MATCH
Reference Input:
Title: "Quantum computing review"
Year: 2021
OpenCitations Candidate:
Title: "A Comprehensive Review of Quantum Computing"
Year: 2021
Score Calculation:
✗ DOI not available: +0 points
✓ Title 80% match: +11 points
✗ No author data: +0 points
✓ Year exact: +1 point
✗ No volume: +0 points
✗ No page: +0 points
─────────────────────────────────
TOTAL: 12 points ✗ NO MATCH
Tabular format for matched references:
reference_id,article_title,matched_title,score,matched_doi,meta_id,query_type
ref_1,"Machine Learning in Healthcare","Machine Learning in Healthcare Applications",35,"10.1234/mlh.2020","https://opencitations.net/meta/br/...",author_title
ref_3,"Neural Networks in Medicine","Neural Networks in Medical Imaging",42,"10.5678/nnm.2021","https://opencitations.net/meta/br/...",doi_title
References that didn't meet the threshold:
reference_id,year,volume,first_page,first_author_lastname,article_title,volume_title,journal_title,doi,unstructured,best_score,score_original,score_after_grobid,score_without_year,grobid_attempted,threshold_failed
ref_2,2019,12,45,Doe,"Deep Learning Review",,,10.9999/dlr.2019,,12,12,N/A,N/A,No,Yes
Text file with comprehensive statistics:
Total references: 25
Matches found: 18 (72.0%)
Errors: 0
References with author: 22/25
References with title: 25/25
References with DOI: 15/25
References with year: 24/25
References with volume: 20/25
References with page: 18/25
Query Type Distribution:
author_title: 8 (44.4%)
year_and_doi: 6 (33.3%)
year_volume_page: 4 (22.2%)
GROBID fallbacks attempted: 3
GROBID successes: 2HTML report with comprehensive statistics, field contributions, query type distribution, and visualizations (generated as processing_report.html).
The tool uses 5 specialized log files for different aspects:
┌─────────────────────────────────────────────────────────────┐
│ LOG FILES STRUCTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. reference_matching_main.log │
│ ├─ All processing events │
│ ├─ Initialization messages │
│ ├─ Progress updates │
│ └─ General workflow logs │
│ │
│ 2. reference_matching_authors.log │
│ ├─ Author extraction details │
│ ├─ Name parsing and normalization │
│ ├─ Author matching results │
│ └─ Filter: Messages containing "AUTHOR" or "👤" │
│ │
│ 3. reference_matching_queries.log │
│ ├─ SPARQL query construction │
│ ├─ Query execution details │
│ ├─ API response summaries │
│ └─ Filter: Messages containing "SPARQL", "QUERY", │
│ "🔍", or "🔨" │
│ │
│ 4. reference_matching_scores.log │
│ ├─ Score calculation details │
│ ├─ Field-by-field scoring │
│ ├─ Match/no-match decisions │
│ └─ Filter: Messages containing "SCORE", "MATCH", "🎯" │
│ │
│ 5. reference_matching_errors.log │
│ ├─ All WARNING and ERROR messages │
│ ├─ Exception tracebacks │
│ ├─ API failures │
│ └─ Validation errors │
│ │
└─────────────────────────────────────────────────────────────┘
- Each log file has a maximum size of 10 MB
- Keeps 5 backup files (rotating)
- UTF-8 encoding for international characters
- Automatic timestamp and line number tracking
2025-11-07 14:23:15 - INFO - [match_reference:1234] - Starting match for reference #1
2025-11-07 14:23:15 - DEBUG - [normalize_doi:567] - 🔨 Normalized DOI: 10.1234/example
2025-11-07 14:23:16 - INFO - [execute_sparql_query:890] - 🔍 QUERY: author_title
2025-11-07 14:23:17 - DEBUG - [calculate_score:1112] - 🎯 SCORE: DOI=15, Title=13, Author=7
2025-11-07 14:23:17 - INFO - [match_reference:1245] - ✓ MATCH found with score 35/48
┌─────────────────────────────────────────────────────────────┐
│ ERROR HANDLING MATRIX │
├──────────────────┬──────────────────────────────────────────┤
│ Error Type │ Recovery Strategy │
├──────────────────┼──────────────────────────────────────────┤
│ Rate Limit (429) │ • Exponential backoff: min(60, 2^n * 5s) │
│ │ • Reset token bucket to 0 │
│ │ • Max 3 retries │
│ │ • Log retry attempts and wait time │
├──────────────────┼──────────────────────────────────────────┤
│ Server Error │ • Retry with exponential backoff + jitter│
│ (500, 502, 503, │ • Wait: 2^attempt + random(0, 1) seconds │
│ 504) │ • Max 3 retries │
│ │ • Log server status and response │
├──────────────────┼──────────────────────────────────────────┤
│ Timeout │ • Retry with fixed 2s delay │
│ │ • Same timeout value on each retry │
│ │ • Max 3 retries │
│ │ • Log timeout occurrence │
├──────────────────┼──────────────────────────────────────────┤
│ Network Error │ • Exponential backoff: 2^attempt seconds │
│ (ClientError) │ • Max 3 retries │
│ │ • Log network error details │
│ │ • Raise QueryExecutionError if persistent│
├──────────────────┼──────────────────────────────────────────┤
│ GROBID Failure │ • Log extraction failure/error │
│ │ • Continue without GROBID enrichment │
│ │ • Mark as unmatched if all attempts fail │
├──────────────────┼──────────────────────────────────────────┤
│ JSON Parse Error │ • Try multiple encodings (utf-8, latin-1)│
│ (Input file) │ • Log encoding and parse errors │
│ │ • Raise error if all encodings fail │
│ │ • No retry for malformed input files │
├──────────────────┼──────────────────────────────────────────┤
│ JSON Parse Error │ • Caught by generic Exception handler │
│ (SPARQL response)│ • Exponential backoff: 2^attempt seconds │
│ │ • Max 3 retries │
│ │ • Log error type and message │
└──────────────────┴──────────────────────────────────────────┘
Symptom: Match rate < 30%
Possible causes:
✗ Incomplete reference metadata
✗ Non-standard citation formats
✗ References not in OpenCitations
✗ PDF extraction errors
Solutions:
✓ Check reference quality in input
✓ Verify DOIs are correct
✓ Enable GROBID fallback
✓ Review author extraction logs
✓ Try different query strategies
Symptom: Frequent 429 errors
Possible causes:
✗ Too many concurrent requests
✗ Insufficient pause between batches
Solutions:
✓ Reduce requests_per_second
✓ Increase pause_duration
✓ Reduce batch_size
✓ Check rate limiting logs
Symptom: "Cannot connect to GROBID server"
Possible causes:
✗ GROBID server not running
✗ Wrong server URL in config
✗ Network/firewall issues
Solutions:
✓ Start GROBID server: docker run -d -p 8070:8070 grobid/grobid
✓ Check grobid_config.json URL
✓ Test connection: curl http://localhost:8070/api/isalive
Symptom: UnicodeDecodeError or garbled text
Possible causes:
✗ Non-UTF-8 input files
✗ Special characters in titles
Solutions:
✓ Save input CSV as UTF-8
✓ Enable text normalization
✓ Check log files for details
✓ Use unidecode for accents
- Complete Metadata: Input data with complete metadata have a higher chance to match
- Standardized Formats: Input data that follows standard citation formats have a higher chance to match
- Enable GROBID: Better extraction for difficult documents
- Adjust Threshold: Lower threshold (e.g., 22) for more matches (precision/recall tradeoff)
- Batch Processing: Default batch_size is 3, can be increased for higher throughput
- Concurrent Queries: Increase max_concurrent_queries (carefully, could actually slow down the process due to multiple errors)
- Checkpoint Frequently: Save progress every 25-50 references
- Skip Slow Queries: Set shorter timeouts for faster queries (precision/recall tradeoff)