A detailed record of my research into the interplay of LD and ML. Mostly interested in maintaining ML provenance, from requirements to phase-out and archival, with a focus on generative AI models. The general focus of the research:
- Getting generative models to spit out valid triples and maybe DL axioms
- ML workflows, from end-to-end, in an operational context
- GPT alternatives, including ones that can run locally and can be fine-tuned more explicitly
- Prompting techniques... we already know a lot about this stuff but there are interesting fragments in the literature to consider
This repo reflects that research, organizing it as appropriate.
All linked papers saved locally as PDFs in this repo :)
On the interplay between Linked Data constructs and LLMs, Tim (@A-J-S97) has included me on a paper of his that surveys this exact topic. He extracted 5 key sub-topics of research:
- Knowledge Graph Generation
- Knowledge Graph Completion
- Knowledge Graph Enrichment
- Ontology Alignment
- Language Model Probing
It is in the publication pipeline and stuck on Teams. Any paper cited here is out of the scope of Tim's paper, or else it was missed.
- OntoGPT GitHub - a Python package for extracting semantics and creating ontologies from raw text; three approaches, SPIRES, HALO and SPINDOCTOR
- Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES) - the main method underlying OntoGPT
- KG-BERT
- LLMs and SPARQL
One thought I had was fine-tuning GPT on LD instance data and DL axioms to get a model really good at spitting out triples. This could be possible but it would require the collection of prompt-response pairs that could be generated (by using GPT), to then fine-tune off of. But at the end of the day it would take manual effort, which is why most work in the literature tends to just use GPT with a few examples in a good prompt to get valid RDF and OWL output.
By "ML Workflows", I mean all associated nomenclature, e.g.:
- ML Operations (MLOps)
- ML Engineering
- AutoML
- Etc.
What platforms exist for managing ML in an operational context? I mean not just training and storing models, but tracking their creation, phasing them out, archiving them, etc. The entire lifecycle. Everyone simply assumes it's an ad-hoc process or totally proprietary. This is incorrect. Below are several tools for MLOps, including the important aspect of provenance.
- Kubeflow - from Google, for deploying ML workflows on Kubernetes, which is for deploying and scaling containerized apps
- MLflow / MLflow GitHub - end-to-end ML lifecycle management tool
- TensorFlow Extended - mostly for deploying production ML
- Metaflow / Metaflow GitHub - from Netflix, meant for any data science project (not just ML), from exploration to deployment and monitoring
- Seldon / Seldon GitHub - an enterprise MLOps framework to deal with thousands of ML models
- Hydra / Hydra GitHub - from Facebook, meant for configuring large apps (may not be entirely applicable)
- DVC (Data Version Control) / DVC GitHub - Git-based version control for ML
- Pachyderm / Pachyderm GitHub - from HP, automates data-driven pipelines for ML
- Neptune / Neptune GitHub - Provenance and metadata store for ML modelst
- Weights & Biases - data versioning and ML collaboration platform
- Tecton - enterprise platform for the ML feature lifecycle
- Allegro Trains / Allegro GitHub - system is called ClearML; touted as a CI/CD for the ML workflow
- OpenML - ??
- Pipeline AI - Mystic interface
- More to come... and I will investigate these deeper...
- Managing Machine Learning Workflow Components
- ModelDB: A System for Machine Learning Model Management
- Machine Learning Operations (MLOps): Overview, Definition, and Architecture
- Automatically Tracking Metadata and Provenance of Machine Learning Experiments
- Implicit Provenance for Machine Learning Artifacts
- Time Travel and Provenance for Machine Learning Pipelines
- Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles - pg. 226
- A Provenance-based Semantic Approach to Support Understandability, Reproducibility, and Reuse of Scientific Experiments
- PROV
- Workflow Provenance in the Lifecycle of Scientific Machine Learning - defines the PROV-ML ontology
It would be interesting to see if it is possible to embed RDF data, or at least IRIs, into a model file to carry along its provenance. This would be nice so that models transferred between workers would maintain their provenance.
- Send a zip containing the model and its provenance (e.g., its relevant provenance graph)
- Embed the provenance into the model
- Normalize the serialization regardless of serialization type with a ZIP script, embed the provenance into the header; unzipping it will produce the model that can be used without problem
- The Python library zipfile lets you add comments to zip files that you can read when decompressing. See this script.
- Embed it some other way, depending on serialization format (e.g., PMML is XML based but this is probably unscalable for large models with thousands of millions of parameters)
- Embed it as simply as possible with an actual IRI and let that IRI dereference to a web resource that contains the full model info (e.g., in the filename, in the header/footer of the file, etc.)
- Normalize the serialization regardless of serialization type with a ZIP script, embed the provenance into the header; unzipping it will produce the model that can be used without problem
- Hierarchical Data Format 5 (HDF5) - binary
- Open Neural Network Exchange (ONNX) - binary
- Predictive Model Markup Language (PMML) - XML
- Pickle - binary
- SavedModel - folder structure
- Model checkpoints - binary
In a bit of Orwellian doublespeak, "OpenAI" is entirely closed-source. Directly quoting from the 100-page GPT-4 technical report:
"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."
So, this section pertains to any "alternatives" to GPT, including ways of getting around its limitations. This is for the reason that a lot of military data is classified and cannot even be discussed on online platforms like ChatGPT.
There are tons of large transformer models, e.g., BERT. All of them are potential alternatives to GPT, but (disclaimer!) they are all inferior in almost every circumstance. OpenAI has some secret sauce that simply places them leagues above any other models.
- Awesome Huge Models - The best resource on all of them, including GPTs, LLaMa, PaLM, BLOOM, etc.); I contributed some to it and it is a one-stop shop
It is possible to prompt GPT so heavily with instructional input, that it can be "persuaded" to evade some of OpenAI's restrictions (e.g., ethical ones):
This is the definitive guide on prompt engineering, including links to papers and outside resources, as a GitHub repo: