Blueprint POC AI

Table of Content

Blueprint POC AI

Introduction

This project lays out the architecture of the ML2Grow AI project. This project was instantiated in order to leverage the power of AI in current and future LBLOD projects. This blueprint was meant to lay bare this architecture to make potential improvements apparent. [...]

SparQL Prefixes

Throughout this document we will make use of prefixes in order to abbreviate some links that will occur, to make this more readable.
These prefixes are explained here:

Prefix	Full link
ext	http://mu.semte.ch/vocabularies/ext/
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
info	http://data.lblod.info/

Models

The project uses multiple models throughout the project to achieve various functionalities. These models and their functionalities will be listed out here. [...]

NER - Named Entity Recognition

NER is used to recognize various entities within sentences. For example, using the sentence "I love Berlin.", the model will recognize that Berlin is most likely a location. This can be useful for tagging entities within sentences. So these can be used for linking to related topics and such. [...]

BERTopic

BERTopic is used for identifying topics within a certain article. For example, [...].

Embed

This models links an embedding vector to a document based on its content. This vector can later be used to perform elastic search.

Zeroshot

This model will be used to perform classifications on BBC Topics. These BBC topics are a way of classifying documents based on the content and matter addressed in the document.

Apache Airflow architecture

Apache Airflow is a framework used to deploy Big Data Networks that can be used amongst multiple projects. There are certain containers that contain various scripts that can be run from CLI. These scripts perform various important tasks such as saving and loading models or data. [...]

DAGs - Directed Acyclic Graphs

DAGs [...]
DAGS configure the containers that run the various tasks of the project.

NER

Link to the repository containing the scripts During these tasks. We first load the NER model. Then we perform NER related stuff.

Load
During this script we load the data from the triplestore and export it to a json file. This file can then later be used to perform transformations. This script loads following query

    PREFIX prov: <http://www.w3.org/ns/prov#>
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX soic: <http://rdfs.org/sioc/ns#>
    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>


    SELECT DISTINCT ?thing ?text WHERE {
        ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; 
        prov:generated/dct:hasPart ?part.    
        ?part soic:content ?text.  
        FILTER NOT EXISTS {  ?thing ext:ingestedml2GrowSmartRegulationsNer "1" }
    }

NER
Afterwards we use the data to run the NER model on. The data gets processed and the results are written to a json file on disk.

Save
Finally the results are persisted in the triplestore. The results are saved in the triplestore like this.

Example

Predicate	Description
rdf:type	A constant value being ext:Ner. Representing the type of the subject.
ext:start	The start position of the word. Relative to [...].
ext:end	The end position of the word. Relative to [...].
ext:word	The word that was guessed on by the AI model.
ext:entity	Can be either "Location", "Person" or "Organization". This is the value guessed by the AI model.

Additionally we also add a predicate to the file that was used to generate the NER with.

Predicate	Description
ext:hasNer	A link to the NER associated with the document. A document can relate to many NERs.
ext:ingestedml2GrowSmartRegulationsNer	Tag indicating it has already been ingested for future runs of the model.

BERTopic Retrain

Link to the repository containing the scripts

Load Script that loads the following query.

    PREFIX prov: <http://www.w3.org/ns/prov#> 
    PREFIX dct: <http://purl.org/dc/terms/>  
    PREFIX soic: <http://rdfs.org/sioc/ns#>  
    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>  
    
    SELECT DISTINCT ?thing ?part ?text WHERE {
        ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; 
        prov:generated/dct:hasPart ?part.    
        ?part soic:content ?text.  
    }

Retrain & Save
[...]
Restart
[...]
Transform
[...]

Save

Transform

Example

Predicate Description

ext:HasTopic URI to the linked topic.

ext:ingestedByMl2GrowSmartRegulationsTopics Tag indicating it has been ingested by a ML2Grow model.

Predicate Description

rdf:type TopicScore

ext:TopicURI URI to the linked topic

ext:score score of the linked topic

Topics

Example

Predicate	Description
rdf:type	isTopic
ext:relevant_words	The relevant words found by the model. A Topic can have multiple relevant words.
ext:count	The count of [...]
ext:topic_label	The label of the topic

BERTopic Transform

Link to the repository containing these scripts

Load
Initiates the task by loading data from the triplestore.

    PREFIX prov: <http://www.w3.org/ns/prov#> 
    PREFIX dct: <http://purl.org/dc/terms/>  
    PREFIX soic: <http://rdfs.org/sioc/ns#>  
    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>  
    
    SELECT DISTINCT ?thing ?part ?text WHERE {
        ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; 
        prov:generated/dct:hasPart ?part.    
        ?part soic:content ?text.  
    }

Transform
[...]
Save This script saves the same data as the number 3.1 of the previous section. `1

Embed

Link to the repository containing the scripts

Load

    PREFIX prov: <http://www.w3.org/ns/prov#>
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX soic: <http://rdfs.org/sioc/ns#>
    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>


    SELECT DISTINCT ?thing ?text WHERE {
        ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>;
        prov:generated/dct:hasPart ?part.    
        ?part soic:content ?text.  
        FILTER NOT EXISTS {  ?thing ext:ingestedByMl2GrowSmartRegulationsEmbedding "1" }

    }

Embed
[...]
Save

Example

Predicate Description

ext:searchEmbedding Embedding Vector linked to the file.

ext:ingestedByMl2GrowSmartRegulationsEmbedding Tag indicating it has been ingested by a ML2Grow model.

Zeroshot

Link to the repository containing the scripts

Load

Zeroshot

    PREFIX prov: <http://www.w3.org/ns/prov#>
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX soic: <http://rdfs.org/sioc/ns#>
    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>
    SELECT DISTINCT ?thing ?text WHERE {
        ?thing a <http://rdf.myexperiment.org/ontologies/base/Submission>; 
        prov:generated/dct:hasPart ?part.    
        ?part soic:content ?text.  
        FILTER NOT EXISTS {  ?thing ext:ingestedMl2GrowSmartRegulationsBBC "1" }
    }

Taxonomy

    PREFIX ext: <http://mu.semte.ch/vocabularies/ext/>

    SELECT ?nl WHERE {
        <http://data.lblod.info/ML2GrowClassification> ?o ?taxo.
        ?taxo ext:nl_taxonomy ?nl
    }

ZS_BBC
[...]

Save

Example

Predicate	Description
rdf:type	One of many values
ext:score	Score of the models prediction

File

Predicate	Description
ext:BBC_scoring	Link to the BBC Scoring instance. A document can have multiple of these instances.
ext:ingestedMl2GrowSmartRegulationsBBC	Tag indicating it has been ingested by the zeroshot model.

Data Flow Diagrams

The Data Flow in the project goes a little like this. [...]

Triplestore Mapping

SparQL triplestores are used for persisting most data. Though through Airflow there are also mentions of postgreSQL. The use of PostgreSQL can be an inconvenience for the purpose of linking this project to other LBLOD projects. [...]

Conclusion

[...]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dag-tasks		dag-tasks
triples		triples
.gitignore		.gitignore
README.md		README.md
suggestion.md		suggestion.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blueprint POC AI

Table of Content

Introduction

SparQL Prefixes

Models

NER - Named Entity Recognition

BERTopic

Embed

Zeroshot

Apache Airflow architecture

DAGs - Directed Acyclic Graphs

NER

BERTopic Retrain

BERTopic Transform

Embed

Zeroshot

Data Flow Diagrams

Triplestore Mapping

Conclusion

About

Releases

Packages

Languages

Predicate	Description
ext:HasTopic	URI to the linked topic.
ext:ingestedByMl2GrowSmartRegulationsTopics	Tag indicating it has been ingested by a ML2Grow model.

Predicate	Description
rdf:type	TopicScore
ext:TopicURI	URI to the linked topic
ext:score	score of the linked topic

Predicate	Description
ext:searchEmbedding	Embedding Vector linked to the file.
ext:ingestedByMl2GrowSmartRegulationsEmbedding	Tag indicating it has been ingested by a ML2Grow model.

LBONTEN/Analysis-ML2Grow

Folders and files

Latest commit

History

Repository files navigation

Blueprint POC AI

Table of Content

Introduction

SparQL Prefixes

Models

NER - Named Entity Recognition

BERTopic

Embed

Zeroshot

Apache Airflow architecture

DAGs - Directed Acyclic Graphs

NER

BERTopic Retrain

BERTopic Transform

Embed

Zeroshot

Data Flow Diagrams

Triplestore Mapping

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages