veriXiv

veriXiv (Verify x ArXiv, pronounced Veri-kive) is a scalable and efficient research paper verifier that checks for reproducibility of your paper against semantically similar papers in your research domain.

Motivation

A study in 2023 by Magnusson et. al. in "Reproducibility in NLP: What Have We Learned from the Checklist?" conducted an empirical study on the effects of reproducibility of results of a paper with acceptance into reputable ML (NLP) conferences such as EMNLP.

In particular, they found that there existed a correlation between number of "Yes" marks in a rubric with acceptance to the conference.

veriXiv Implementation

We used the same rubric and few shot prompted a strong Large Language Model to read an input paper and score the reproducibility of the paper based on the rubric.

Our system then embeds the abstract of the paper and finds the top k most semantically similar papers and compares the input paper's reproducibility against it.

The intuition is that semantically similar papers likely involve the same or similar research area, and having this scoring mechanism for reproducibility helps maintain both a status quo as well as a baseline to see how well your paper compares in terms of being able to be reimplemented compared to other papers.

The main technologies used were:

Cloudflare Workers AI, (Embedding model, bge-base-en-v1.5 model)
Cloudflare Vectorize (Vector Database)
Google Gemini API, 2.5 Flash, 2.5 Pro
Flask (API Endpoint for Scoring)
SQLite (Caching)

Pipeline

The below image is the high level pipeline for our system.

Scoring Endpoint

The below image shows the high level scoring endpoint that is called to produce reproducibility scores for the top-k semantically similar papers as well as the input papers. For ArXiv papers, we cache these reproducibility scores against the ArXiv ID, such that we do not have to use extraneous tokens reproducing previously generated scores.

One of the unique functionalities we implemented was also a "page reference" functionality. When the model grades the paper for reproducibility using the rubric, we wanted to let researchers using the software to easily find where these pieces of implementation details might exist. To do so, we needed to find the exact page number that researchers would have to look to make this process easier.

However, we discovered that asking the LLM to do this was very prone to hallucination.

Instead, we opted for an interesting sliding window approach, where an arXiv paper would be chunked by page. Each page was then fed into the LLM for reproducibility scoring. Across each page, we took the maximum score for each field as the final rubric grade.

To find which page that the model discovered evidence for a specific rubric item, we looked for pages that had that field marked as "Complete", then "Partial". We found that making the model individually score a page and collecting information about where the model finds evidence was less noise and hallucination prone than other approaches.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Images		Images
data		data
src		src
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

veriXiv

Motivation

veriXiv Implementation

Pipeline

Scoring Endpoint

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

veriXiv

Motivation

veriXiv Implementation

Pipeline

Scoring Endpoint

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages