Skip to content

Dashboard stats#7400

Open
nschneid wants to merge 9 commits intomasterfrom
dashboard-stats
Open

Dashboard stats#7400
nschneid wants to merge 9 commits intomasterfrom
dashboard-stats

Conversation

@nschneid
Copy link
Collaborator

@nschneid nschneid commented Feb 2, 2026

Per #5324, which envisions a dashboard to track statistics of the Anthology database, this adds dashboard_stats.py to collect statistics at a snapshot in time.

The script produces the CSV file (in a new directory, stats) as well as a summary:

Overall
-------

2110 events, 253 (12%) with 100+ papers
3331 volumes, 482 (14%) with DOIs
119948 papers, 59269 (49%) with DOIs
115594 people:     98150 (85%) unverified,     14299 (12%) verified with ORCID,     3145 (3%) verified without ORCID
... 47 people have a registered degree institution
... Of the ORCID-verified people, 608 (4%) have an ORCID-based ID
432224 authorships (3.6 per paper; 14358 solo-authored papers; max authors per paper = 125)
... 157772 (37%) verified
... 115497 (27%) with ORCID
... 89897 (21%) explicit author ID at paper level

2025
----
181 events, 19 (10%) with 100+ papers: aacl acl arabicnlp bea clicit coling dravidianlangtech emnlp findings ijcnlp inlg jeptalnrecital mtsummit naacl nodalida ranlp semeval wmt ws
225 volumes, 90 (40%) with DOIs
14547 papers, 10344 (71%) with DOIs
39200 people publishing that year:     27210 (69%) unverified,     11557 (29%) verified with ORCID,     433 (1%) verified without ORCID
... 37 of these people have a registered degree institution
... Of the ORCID-verified people, 544 (5%) have an ORCID-based ID
72957 authorships (5.0 per paper; 492 solo-authored papers; max authors per paper = 92)
... 29445 (40%) verified
... 28139 (39%) with ORCID
... 22253 (31%) explicit author ID at paper level

The plan is to run the script on the first of every month. Then the stats can be compared over time.

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

Build successful. Some useful links:

This preview will be removed when the branch is merged.

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m trying to be kind, but this is extremely hard to read and understand (and, by extension, to maintain) and has a ton of bad practices.

For one, the entire script should be wrapped in a __name__ guard and be better documented.

I would note that there is no rush to make a script like this to collect stats, as we can always go back in time with git (as long as the library version is compatible with the data, which is clear from the versioning).

Comment on lines +24 to +44
papers_by_year, doi_papers_by_year = Counter(), Counter()
solo_papers_by_year, max_authors_by_year = Counter(), Counter()
volumes_by_year, doi_volumes_by_year, events_by_year, big_events_by_year = (
Counter(),
Counter(),
Counter(),
Counter(),
)
authorships_by_year, explicit_authorships_by_year, verif_authorships_by_year = (
Counter(),
Counter(),
Counter(),
)
orcid_authorships_by_year, orcid_suffix_authorships_by_year = Counter(), Counter()
papers_by_venue_year = defaultdict(Counter)
uniq_authors_by_year, uniq_verif_authors_by_year = defaultdict(set), defaultdict(set)
uniq_orcid_authors_by_year, uniq_orcid_suffix_authors_by_year = defaultdict(
set
), defaultdict(set)
uniq_degree_authors_by_year = defaultdict(set)
big_event_names_by_year = defaultdict(set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should this rather be a dictionary than dozens of different variables?
  2. It's not clear to me what all of these do, this should be documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if this was a dictionary, the keys could be natural language descriptors instead, aiding readability.

Comment on lines +98 to +109
if venue in {
'coling',
'cl',
'acl',
'aacl',
'eacl',
'naacl',
'starsem',
'semeval',
'tacl',
'wmt',
}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only a subset of venues?

Comment on lines +95 to +97
if papers_in_event >= 100:
big_events_by_year[year] += 1
big_event_names_by_year[year].add(venue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we simply dump the number of papers per event, this can be derived from that information later, which has the advantage of not deciding on an arbitrary threshold here.

data['pctdoi_vols'] = data['doi_vols'] / data['vols']

YR = "2025" # latest full year
print(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This print statement is completely unreadable to me.

data['pctdoi_papers'] = data['doi_papers'] / data['papers']
data['pctdoi_vols'] = data['doi_vols'] / data['vols']

YR = "2025" # latest full year
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be hard-coded

@nschneid
Copy link
Collaborator Author

nschneid commented Feb 2, 2026

Yeah, sorry, this was a quick-and-dirty script that evolved as I was experimenting with new metrics. Some of the semi-ugly parts were made even uglier by black.

I'll have to go through and clean it up later, but first let me rip out the parts that I decided not to use.

@mbollmann
Copy link
Member

Fair enough. I think in general it would be great if there were clearly labelled parts for each (conceptually different) statistic that is being gathered, so that it’s easy to go through the script and see what’s being computed where, in case we decide to add something later, or check the logic behind some of the figures.

Oh, another suggestion: I would encourage that we follow the convention of naming each script with a verb of what it’s doing, in this case maybe collect_dashboard_stats.py. (In contrast to analyzing or visualizing the stats, for which we maybe might have a script at a later point. :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants