This project lays out the architecture of the ML2Grow AI project. This project was instantiated in order to leverage the power of AI in current and future LBLOD projects. This blueprint was meant to lay bare this architecture to make potential improvements apparent. [...]
Throughout this document we will make use of prefixes in order to abbreviate some links that will occur, to make this more readable.
These prefixes are explained here:
Prefix | Full link |
---|---|
ext | http://mu.semte.ch/vocabularies/ext/ |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
info | http://data.lblod.info/ |
The project uses multiple models throughout the project to achieve various functionalities. These models and their functionalities will be listed out here. [...]
NER is used to recognize various entities within sentences. For example, using the sentence "I love Berlin.", the model will recognize that Berlin is most likely a location. This can be useful for tagging entities within sentences. So these can be used for linking to related topics and such. [...]
BERTopic is used for identifying topics within a certain article. For example, [...].
This models links an embedding vector to a document based on its content. This vector can later be used to perform elastic search.
This model will be used to perform classifications on BBC Topics. These BBC topics are a way of classifying documents based on the content and matter addressed in the document.
Apache Airflow is a framework used to deploy Big Data Networks that can be used amongst multiple projects. There are certain containers that contain various scripts that can be run from CLI. These scripts perform various important tasks such as saving and loading models or data. [...]
DAGs [...]
DAGS configure the containers that run the various tasks of the project.
Link to the repository containing the scripts During these tasks. We first load the NER model. Then we perform NER related stuff.
-
Load
During this script we load the data from the triplestore and export it to a json file. This file can then later be used to perform transformations. This script loads following queryPREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX soic: <http://rdfs.org/sioc/ns#> PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT DISTINCT ?thing ?text WHERE { ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; prov:generated/dct:hasPart ?part. ?part soic:content ?text. FILTER NOT EXISTS { ?thing ext:ingestedml2GrowSmartRegulationsNer "1" } }
-
NER
Afterwards we use the data to run the NER model on. The data gets processed and the results are written to a json file on disk. -
Save
Finally the results are persisted in the triplestore. The results are saved in the triplestore like this.Predicate Description rdf:type A constant value being ext:Ner. Representing the type of the subject. ext:start The start position of the word. Relative to [...]. ext:end The end position of the word. Relative to [...]. ext:word The word that was guessed on by the AI model. ext:entity Can be either "Location", "Person" or "Organization". This is the value guessed by the AI model. Additionally we also add a predicate to the file that was used to generate the NER with.
Predicate Description ext:hasNer A link to the NER associated with the document. A document can relate to many NERs. ext:ingestedml2GrowSmartRegulationsNer Tag indicating it has already been ingested for future runs of the model.
Link to the repository containing the scripts
-
Load Script that loads the following query.
PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX soic: <http://rdfs.org/sioc/ns#> PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT DISTINCT ?thing ?part ?text WHERE { ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; prov:generated/dct:hasPart ?part. ?part soic:content ?text. }
-
Retrain & Save
[...] -
Restart
[...] -
Transform
[...] -
Save
-
Predicate Description ext:HasTopic URI to the linked topic. ext:ingestedByMl2GrowSmartRegulationsTopics Tag indicating it has been ingested by a ML2Grow model. Predicate Description rdf:type TopicScore ext:TopicURI URI to the linked topic ext:score score of the linked topic -
Predicate Description rdf:type isTopic ext:relevant_words The relevant words found by the model. A Topic can have multiple relevant words. ext:count The count of [...] ext:topic_label The label of the topic
-
Link to the repository containing these scripts
-
Load
Initiates the task by loading data from the triplestore.PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX soic: <http://rdfs.org/sioc/ns#> PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT DISTINCT ?thing ?part ?text WHERE { ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; prov:generated/dct:hasPart ?part. ?part soic:content ?text. }
-
Transform
[...] -
Save This script saves the same data as the number 3.1 of the previous section. `1
Link to the repository containing the scripts
-
PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX soic: <http://rdfs.org/sioc/ns#> PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT DISTINCT ?thing ?text WHERE { ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; prov:generated/dct:hasPart ?part. ?part soic:content ?text. FILTER NOT EXISTS { ?thing ext:ingestedByMl2GrowSmartRegulationsEmbedding "1" } }
-
Embed
[...] -
Predicate Description ext:searchEmbedding Embedding Vector linked to the file. ext:ingestedByMl2GrowSmartRegulationsEmbedding Tag indicating it has been ingested by a ML2Grow model.
Link to the repository containing the scripts
-
Load
-
PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX soic: <http://rdfs.org/sioc/ns#> PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT DISTINCT ?thing ?text WHERE { ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; prov:generated/dct:hasPart ?part. ?part soic:content ?text. FILTER NOT EXISTS { ?thing ext:ingestedMl2GrowSmartRegulationsBBC "1" } }
-
PREFIX ext: <http://mu.semte.ch/vocabularies/ext/> SELECT ?nl WHERE { <http://data.lblod.info/ML2GrowClassification> ?o ?taxo. ?taxo ext:nl_taxonomy ?nl }
-
-
ZS_BBC
[...] -
Predicate Description rdf:type One of many values ext:score Score of the models prediction File
Predicate Description ext:BBC_scoring Link to the BBC Scoring instance. A document can have multiple of these instances. ext:ingestedMl2GrowSmartRegulationsBBC Tag indicating it has been ingested by the zeroshot model.
The Data Flow in the project goes a little like this. [...]
SparQL triplestores are used for persisting most data. Though through Airflow there are also mentions of postgreSQL. The use of PostgreSQL can be an inconvenience for the purpose of linking this project to other LBLOD projects. [...]
[...]