Conversation
|
Build successful. Some useful links:
This preview will be removed when the branch is merged. |
mbollmann
left a comment
There was a problem hiding this comment.
I’m trying to be kind, but this is extremely hard to read and understand (and, by extension, to maintain) and has a ton of bad practices.
For one, the entire script should be wrapped in a __name__ guard and be better documented.
I would note that there is no rush to make a script like this to collect stats, as we can always go back in time with git (as long as the library version is compatible with the data, which is clear from the versioning).
| papers_by_year, doi_papers_by_year = Counter(), Counter() | ||
| solo_papers_by_year, max_authors_by_year = Counter(), Counter() | ||
| volumes_by_year, doi_volumes_by_year, events_by_year, big_events_by_year = ( | ||
| Counter(), | ||
| Counter(), | ||
| Counter(), | ||
| Counter(), | ||
| ) | ||
| authorships_by_year, explicit_authorships_by_year, verif_authorships_by_year = ( | ||
| Counter(), | ||
| Counter(), | ||
| Counter(), | ||
| ) | ||
| orcid_authorships_by_year, orcid_suffix_authorships_by_year = Counter(), Counter() | ||
| papers_by_venue_year = defaultdict(Counter) | ||
| uniq_authors_by_year, uniq_verif_authors_by_year = defaultdict(set), defaultdict(set) | ||
| uniq_orcid_authors_by_year, uniq_orcid_suffix_authors_by_year = defaultdict( | ||
| set | ||
| ), defaultdict(set) | ||
| uniq_degree_authors_by_year = defaultdict(set) | ||
| big_event_names_by_year = defaultdict(set) |
There was a problem hiding this comment.
- Should this rather be a dictionary than dozens of different variables?
- It's not clear to me what all of these do, this should be documented.
There was a problem hiding this comment.
BTW, if this was a dictionary, the keys could be natural language descriptors instead, aiding readability.
bin/dashboard_stats.py
Outdated
| if venue in { | ||
| 'coling', | ||
| 'cl', | ||
| 'acl', | ||
| 'aacl', | ||
| 'eacl', | ||
| 'naacl', | ||
| 'starsem', | ||
| 'semeval', | ||
| 'tacl', | ||
| 'wmt', | ||
| }: |
| if papers_in_event >= 100: | ||
| big_events_by_year[year] += 1 | ||
| big_event_names_by_year[year].add(venue) |
There was a problem hiding this comment.
If we simply dump the number of papers per event, this can be derived from that information later, which has the advantage of not deciding on an arbitrary threshold here.
| data['pctdoi_vols'] = data['doi_vols'] / data['vols'] | ||
|
|
||
| YR = "2025" # latest full year | ||
| print( |
There was a problem hiding this comment.
This print statement is completely unreadable to me.
| data['pctdoi_papers'] = data['doi_papers'] / data['papers'] | ||
| data['pctdoi_vols'] = data['doi_vols'] / data['vols'] | ||
|
|
||
| YR = "2025" # latest full year |
|
Yeah, sorry, this was a quick-and-dirty script that evolved as I was experimenting with new metrics. Some of the semi-ugly parts were made even uglier by black. I'll have to go through and clean it up later, but first let me rip out the parts that I decided not to use. |
|
Fair enough. I think in general it would be great if there were clearly labelled parts for each (conceptually different) statistic that is being gathered, so that it’s easy to go through the script and see what’s being computed where, in case we decide to add something later, or check the logic behind some of the figures. Oh, another suggestion: I would encourage that we follow the convention of naming each script with a verb of what it’s doing, in this case maybe |
9022426 to
f8e1ba0
Compare
Per #5324, which envisions a dashboard to track statistics of the Anthology database, this adds
dashboard_stats.pyto collect statistics at a snapshot in time.The script produces the CSV file (in a new directory,
stats) as well as a summary:The plan is to run the script on the first of every month. Then the stats can be compared over time.