One third of the Elixir Core Data Resources (CDRs) provide their data as RDF, coupled with a SPARQL endpoint to query this data. While SPARQL is a powerful query language, only a minority of all data scientists and bioinformaticians are familiar with it. Therefore, to enable a wider reuse of RDF data, complementary data access interfaces are highly required.
Python and R are two of the most popular programming languages among data scientists and bioinformaticians respectively. Therefore, to enable this target audience easy access to public RDF data, we have been experimenting with generating R and Python APIs for each of these datasets in a fully automatic manner. We do so by leveraging automatically generated descriptions of each dataset, i.e. information regarding the available classes and properties, as well as their cardinalities.
This auto generation is important because:
-
- it significantly speeds up the API creation: a dataset maintainer will only need to verify the auto-generated code, without the need to actually write it by hand;
-
- it significantly enhances dataset Findability and Reuse - even Elixir CDRs have uneven representation of API packages across different programming languages, making their reusability depend on a technical hurdle - how familiar a given user is with the range of available programming languages.
An under-resourced or less well-known dataset will likely have at most one API package. Being able to quickly generate complete APIs given only an RDF file or SPARQL endpoint will help better connect data providers with their users.
Ana Claudia Sima, Jerven Bolleman