Generative AI and Linked Data

A detailed record of my research into the interplay of LD and ML. Mostly interested in maintaining ML provenance, from requirements to phase-out and archival, with a focus on generative AI models. The general focus of the research:

Getting generative models to spit out valid triples and maybe DL axioms
ML workflows, from end-to-end, in an operational context
GPT alternatives, including ones that can run locally and can be fine-tuned more explicitly
Prompting techniques... we already know a lot about this stuff but there are interesting fragments in the literature to consider

This repo reflects that research, organizing it as appropriate.

All linked papers saved locally as PDFs in this repo :)

Survey

On the interplay between Linked Data constructs and LLMs, Tim (@A-J-S97) has included me on a paper of his that surveys this exact topic. He extracted 5 key sub-topics of research:

Knowledge Graph Generation
Knowledge Graph Completion
Knowledge Graph Enrichment
Ontology Alignment
Language Model Probing

It is in the publication pipeline and stuck on Teams. Any paper cited here is out of the scope of Tim's paper, or else it was missed.

LLMs and LD Generation

OntoGPT GitHub - a Python package for extracting semantics and creating ontologies from raw text; three approaches, SPIRES, HALO and SPINDOCTOR
- Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES) - the main method underlying OntoGPT
KG-BERT
LLMs and SPARQL

Fine-tuning

One thought I had was fine-tuning GPT on LD instance data and DL axioms to get a model really good at spitting out triples. This could be possible but it would require the collection of prompt-response pairs that could be generated (by using GPT), to then fine-tune off of. But at the end of the day it would take manual effort, which is why most work in the literature tends to just use GPT with a few examples in a good prompt to get valid RDF and OWL output.

ML Workflows

By "ML Workflows", I mean all associated nomenclature, e.g.:

ML Operations (MLOps)
ML Engineering
AutoML
Etc.

MLOps Systems/Platforms

What platforms exist for managing ML in an operational context? I mean not just training and storing models, but tracking their creation, phasing them out, archiving them, etc. The entire lifecycle. Everyone simply assumes it's an ad-hoc process or totally proprietary. This is incorrect. Below are several tools for MLOps, including the important aspect of provenance.

Kubeflow - from Google, for deploying ML workflows on Kubernetes, which is for deploying and scaling containerized apps
MLflow / MLflow GitHub - end-to-end ML lifecycle management tool
TensorFlow Extended - mostly for deploying production ML
Metaflow / Metaflow GitHub - from Netflix, meant for any data science project (not just ML), from exploration to deployment and monitoring
Seldon / Seldon GitHub - an enterprise MLOps framework to deal with thousands of ML models
Hydra / Hydra GitHub - from Facebook, meant for configuring large apps (may not be entirely applicable)
DVC (Data Version Control) / DVC GitHub - Git-based version control for ML
Pachyderm / Pachyderm GitHub - from HP, automates data-driven pipelines for ML
Neptune / Neptune GitHub - Provenance and metadata store for ML modelst
Weights & Biases - data versioning and ML collaboration platform
Tecton - enterprise platform for the ML feature lifecycle
Allegro Trains / Allegro GitHub - system is called ClearML; touted as a CI/CD for the ML workflow
OpenML - ??
Pipeline AI - Mystic interface
More to come... and I will investigate these deeper...

ML Workflow Papers

Ontologies

PROV
Workflow Provenance in the Lifecycle of Scientific Machine Learning - defines the PROV-ML ontology
- Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

ML Model Provenance

It would be interesting to see if it is possible to embed RDF data, or at least IRIs, into a model file to carry along its provenance. This would be nice so that models transferred between workers would maintain their provenance.

Techniques

Send a zip containing the model and its provenance (e.g., its relevant provenance graph)
Embed the provenance into the model
1. Normalize the serialization regardless of serialization type with a ZIP script, embed the provenance into the header; unzipping it will produce the model that can be used without problem
  - The Python library zipfile lets you add comments to zip files that you can read when decompressing. See this script.
2. Embed it some other way, depending on serialization format (e.g., PMML is XML based but this is probably unscalable for large models with thousands of millions of parameters)
3. Embed it as simply as possible with an actual IRI and let that IRI dereference to a web resource that contains the full model info (e.g., in the filename, in the header/footer of the file, etc.)

Model Serialization Formats

Hierarchical Data Format 5 (HDF5) - binary
Open Neural Network Exchange (ONNX) - binary
Predictive Model Markup Language (PMML) - XML
Pickle - binary
SavedModel - folder structure
Model checkpoints - binary

GPT Alternatives

In a bit of Orwellian doublespeak, "OpenAI" is entirely closed-source. Directly quoting from the 100-page GPT-4 technical report:

"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

So, this section pertains to any "alternatives" to GPT, including ways of getting around its limitations. This is for the reason that a lot of military data is classified and cannot even be discussed on online platforms like ChatGPT.

A List of Large Models

There are tons of large transformer models, e.g., BERT. All of them are potential alternatives to GPT, but (disclaimer!) they are all inferior in almost every circumstance. OpenAI has some secret sauce that simply places them leagues above any other models.

Awesome Huge Models - The best resource on all of them, including GPTs, LLaMa, PaLM, BLOOM, etc.); I contributed some to it and it is a one-stop shop

Local GPTs

Hosting Services

ColossalAI

Evaluating Models

Holistic Evaluation of Language Models
- Stanford's CRFM HELM project site

Jailbroken GPT

It is possible to prompt GPT so heavily with instructional input, that it can be "persuaded" to evade some of OpenAI's restrictions (e.g., ethical ones):

Prompting Techniques

This is the definitive guide on prompt engineering, including links to papers and outside resources, as a GitHub repo:

Prompt Engineering Guide
- Prompting papers

Prompting Technique Papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
papers		papers
src		src
README.md		README.md
questioning_GPT.md		questioning_GPT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI and Linked Data

Table of Contents

Survey

LLMs and LD Generation

Fine-tuning

ML Workflows

MLOps Systems/Platforms

ML Workflow Papers

Ontologies

ML Model Provenance

Techniques

Model Serialization Formats

GPT Alternatives

A List of Large Models

Local GPTs

Hosting Services

Evaluating Models

Jailbroken GPT

Prompting Techniques

Prompting Technique Papers

About

Releases

Packages

Languages

PR0CK0/GenerativeAI_and_LinkedData

Folders and files

Latest commit

History

Repository files navigation

Generative AI and Linked Data

Table of Contents

Survey

LLMs and LD Generation

Fine-tuning

ML Workflows

MLOps Systems/Platforms

ML Workflow Papers

Ontologies

ML Model Provenance

Techniques

Model Serialization Formats

GPT Alternatives

A List of Large Models

Local GPTs

Hosting Services

Evaluating Models

Jailbroken GPT

Prompting Techniques

Prompting Technique Papers

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages