Dashboard stats by nschneid · Pull Request #7400 · acl-org/acl-anthology

nschneid · 2026-02-02T05:51:44Z

Per #5324, which envisions a dashboard to track statistics of the Anthology database, this adds dashboard_stats.py to collect statistics at a snapshot in time.

The script produces the CSV file (in a new directory, stats) as well as a summary:

Overall
-------

2110 events, 253 (12%) with 100+ papers
3331 volumes, 482 (14%) with DOIs
119948 papers, 59269 (49%) with DOIs
115594 people:     98150 (85%) unverified,     14299 (12%) verified with ORCID,     3145 (3%) verified without ORCID
... 47 people have a registered degree institution
... Of the ORCID-verified people, 608 (4%) have an ORCID-based ID
432224 authorships (3.6 per paper; 14358 solo-authored papers; max authors per paper = 125)
... 157772 (37%) verified
... 115497 (27%) with ORCID
... 89897 (21%) explicit author ID at paper level

2025
----
181 events, 19 (10%) with 100+ papers: aacl acl arabicnlp bea clicit coling dravidianlangtech emnlp findings ijcnlp inlg jeptalnrecital mtsummit naacl nodalida ranlp semeval wmt ws
225 volumes, 90 (40%) with DOIs
14547 papers, 10344 (71%) with DOIs
39200 people publishing that year:     27210 (69%) unverified,     11557 (29%) verified with ORCID,     433 (1%) verified without ORCID
... 37 of these people have a registered degree institution
... Of the ORCID-verified people, 544 (5%) have an ORCID-based ID
72957 authorships (5.0 per paper; 492 solo-authored papers; max authors per paper = 92)
... 29445 (40%) verified
... 28139 (39%) with ORCID
... 22253 (31%) explicit author ID at paper level

The plan is to run the script on the first of every month. Then the stats can be compared over time.

github-actions · 2026-02-02T06:13:36Z

Build successful. Some useful links:

Complete site preview: https://preview.aclanthology.org/dashboard-stats
Potential changes of interest:

This preview will be removed when the branch is merged.

mbollmann

I’m trying to be kind, but this is extremely hard to read and understand (and, by extension, to maintain) and has a ton of bad practices.

For one, the entire script should be wrapped in a __name__ guard and be better documented.

I would note that there is no rush to make a script like this to collect stats, as we can always go back in time with git (as long as the library version is compatible with the data, which is clear from the versioning).

bin/dashboard_stats.py

mbollmann · 2026-02-02T12:21:57Z

bin/dashboard_stats.py

+papers_by_year, doi_papers_by_year = Counter(), Counter()
+solo_papers_by_year, max_authors_by_year = Counter(), Counter()
+volumes_by_year, doi_volumes_by_year, events_by_year, big_events_by_year = (
+    Counter(),
+    Counter(),
+    Counter(),
+    Counter(),
+)
+authorships_by_year, explicit_authorships_by_year, verif_authorships_by_year = (
+    Counter(),
+    Counter(),
+    Counter(),
+)
+orcid_authorships_by_year, orcid_suffix_authorships_by_year = Counter(), Counter()
+papers_by_venue_year = defaultdict(Counter)
+uniq_authors_by_year, uniq_verif_authors_by_year = defaultdict(set), defaultdict(set)
+uniq_orcid_authors_by_year, uniq_orcid_suffix_authors_by_year = defaultdict(
+    set
+), defaultdict(set)
+uniq_degree_authors_by_year = defaultdict(set)
+big_event_names_by_year = defaultdict(set)


Should this rather be a dictionary than dozens of different variables?

It's not clear to me what all of these do, this should be documented.

BTW, if this was a dictionary, the keys could be natural language descriptors instead, aiding readability.

mbollmann · 2026-02-02T12:22:23Z

bin/dashboard_stats.py

+    if venue in {
+        'coling',
+        'cl',
+        'acl',
+        'aacl',
+        'eacl',
+        'naacl',
+        'starsem',
+        'semeval',
+        'tacl',
+        'wmt',
+    }:


Why only a subset of venues?

mbollmann · 2026-02-02T12:23:25Z

bin/dashboard_stats.py

+    if papers_in_event >= 100:
+        big_events_by_year[year] += 1
+        big_event_names_by_year[year].add(venue)


If we simply dump the number of papers per event, this can be derived from that information later, which has the advantage of not deciding on an arbitrary threshold here.

mbollmann · 2026-02-02T12:23:49Z

bin/dashboard_stats.py

+data['pctdoi_vols'] = data['doi_vols'] / data['vols']
+
+YR = "2025"  # latest full year
+print(


This print statement is completely unreadable to me.

mbollmann · 2026-02-02T12:34:32Z

bin/dashboard_stats.py

+data['pctdoi_papers'] = data['doi_papers'] / data['papers']
+data['pctdoi_vols'] = data['doi_vols'] / data['vols']
+
+YR = "2025"  # latest full year


Should not be hard-coded

nschneid · 2026-02-02T14:13:22Z

Yeah, sorry, this was a quick-and-dirty script that evolved as I was experimenting with new metrics. Some of the semi-ugly parts were made even uglier by black.

I'll have to go through and clean it up later, but first let me rip out the parts that I decided not to use.

mbollmann · 2026-02-02T15:16:09Z

Fair enough. I think in general it would be great if there were clearly labelled parts for each (conceptually different) statistic that is being gathered, so that it’s easy to go through the script and see what’s being computed where, in case we decide to add something later, or check the logic behind some of the figures.

Oh, another suggestion: I would encourage that we follow the convention of naming each script with a verb of what it’s doing, in this case maybe collect_dashboard_stats.py. (In contrast to analyzing or visualizing the stats, for which we maybe might have a script at a later point. :))

mbollmann requested changes Feb 2, 2026

View reviewed changes

mbollmann reviewed Feb 2, 2026

View reviewed changes

nschneid added 9 commits March 2, 2026 00:07

dashboard_stats.py and CSV output

0f11976

linting

8b3dd5d

also summarize 2025 stats

2793155

count big events (100+ papers)

3f574cc

remove unused code

d22aadb

rearrange instantiation of datastructures in a black-friendly way

cdd8e42

person.is_explicit checks if verified

747584b

from_within_repo()

79ede9b

March 1 stats and Feb, March summary stats

f8e1ba0

nschneid force-pushed the dashboard-stats branch from 9022426 to f8e1ba0 Compare March 2, 2026 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard stats#7400

Dashboard stats#7400
nschneid wants to merge 9 commits intomasterfrom
dashboard-stats

nschneid commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

mbollmann left a comment

Uh oh!

Uh oh!

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

mbollmann Feb 2, 2026

Uh oh!

nschneid commented Feb 2, 2026

Uh oh!

mbollmann commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nschneid commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

mbollmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

nschneid commented Feb 2, 2026

Uh oh!

mbollmann commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nschneid commented Feb 2, 2026 •

edited

Loading