Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Store example using ml job #372

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions feature_store/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Oracle Feature Store (ADS)

[![Python](https://img.shields.io/badge/python-3.8-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/oracle-ads/) [![PysparkConda](https://img.shields.io/badge/fspyspark32_p38_cpu_v2-1.0-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) [![Notebook Examples](https://img.shields.io/badge/docs-notebook--examples-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/master/notebook_examples) [![Delta](https://img.shields.io/badge/delta-2.0.1-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://delta.io/) [![PySpark](https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://spark.apache.org/docs/3.2.1/api/python/index.html) [![Great Expectations](https://img.shields.io/badge/greatexpectations-0.17.19-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://greatexpectations.io/) [![Pandas](https://img.shields.io/badge/pandas-1.5.3-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://pandas.pydata.org/) [![PyArrow](https://img.shields.io/badge/pyarrow-11.0.0-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://arrow.apache.org/docs/python/index.html)

Managing many datasets, data sources, and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store solves many of the problems because it is a centralized way to transform and access data for training and serving time, Feature stores help define a standardised pipeline for ingestion of data and querying of data.

ADS feature store is a stack-based solution that is deployed in your tenancy using OCI Resource Manager.

Following are brief descriptions of key concepts and the main components of ADS feature store.

- ``Feature Vector``: Set of feature values for any one primary and identifier key. For example, all and a subset of features of customer ID 2536 can be called as one feature vector .
- ``Feature``: A feature is an individual measurable property or characteristic of an event being observed.
- ``Entity``: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated with features. Another way to look at it is that an entity is an object or concept that's described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.
- ``Feature Group``: A feature group in a feature store is a collection of related features that are often used together in ML models. It serves as an organizational unit within the feature store for users to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.
- ``Feature Group Job``: Feature group jobs are the processing instance of a feature group. Each feature group job includes validation results and statistics results.
- ``Dataset``: A dataset is a collection of features that are used together to either train a model or perform model inference.
- ``Dataset Job``: A dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results.

## Documentation

- [Oracle Feature Store SDK (ADS) Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/)
- [OCI Data Science and AI services Examples](https://github.com/oracle/oci-data-science-ai-samples)
- [Oracle AI & Data Science Blog](https://blogs.oracle.com/ai-and-datascience/)
- [OCI Documentation](https://docs.oracle.com/en-us/iaas/data-science/using/data-science.htm)

## Examples

### Quick start examples

| Jupyter Notebook | Description |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Feature store querying](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_querying.ipynb) | - Ingestion, querying and exploration of data. |
| [Feature store quickstart](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_quickstart.ipynb) | - Ingestion, querying and exploration of data. |
| [Schema enforcement and schema evolution](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_schema_evolution.ipynb) | - `Schema evolution` allows you to easily change a table's current schema to accommodate data that is changing over time. `Schema enforcement`, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that don't match the table's schema. |
| [Storage of medical records in feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_ehr_data.ipynb) | Example to demonstrate storage of medical records in feature store |

### Big data operations using OCI DataFlow

| Jupyter Notebook | Description |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| [Big data operations with feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_spark_magic.ipynb) | - Ingestion of data using Spark Magic, querying and exploration of data using Spark Magic. |

### LLM Use cases

| Jupyter Notebook | Description |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Embeddings in Feature Store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_embeddings.ipynb) | - `Embedding feature stores` are optimized for fast and efficient retrieval of embeddings. This is important because embeddings can be high-dimensional and computationally expensive to calculate. By storing them in a dedicated store, you can avoid the need to recalculate embeddings for the same data repeatedly. |
| [Synthetic data generation in feature store using OpenAI and FewShotPromptTemplate](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb) | - `Synthetic data` is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations. |
| [PII Data redaction, Summarise Content and Translate content using doctran and open AI](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb) | - One way to think of Doctran is a LLM-powered black box where messy strings go in and nice, clean, labelled strings come out. Another way to think about it is a modular, declarative wrapper over OpenAI's functional calling feature that significantly improves the developer experience. |
| [OpenAI embeddings in feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_embeddings_openai.ipynb) | - `Embedding feature stores` are optimized for fast and efficient retrieval of embeddings. This is important because embeddings can be high-dimensional and computationally expensive to calculate. By storing them in a dedicated store, you can avoid the need to recalculate embeddings for the same data repeatedly. |


## Contributing

This project welcomes contributions from the community. Before submitting a pull request, please [review our contribution guide](./../../CONTRIBUTING.md)

Find Getting Started instructions for developers in [README-development.md](https://github.com/oracle/accelerated-data-science/blob/main/README-development.md)

## Security

Consult the security guide [SECURITY.md](https://github.com/oracle/accelerated-data-science/blob/main/SECURITY.md) for our responsible security vulnerability disclosure process.

## License

Copyright (c) 2020, 2022 Oracle and/or its affiliates. Licensed under the [Universal Permissive License v1.0](https://oss.oracle.com/licenses/upl/)
31 changes: 31 additions & 0 deletions feature_store/feature_store_creation_ingestion_with_jobs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Feature Store Creation and Ingestion using ML Job
=====================

In this Example, you use the Oracle Cloud Infrastructure (OCI) Data Science service MLJob component to create OCI Feature store design time constructs and then ingest feature values into the offline feature store.

Tutorial picks use case of Electronic Heath Data consisting of Patient Test Results. The example demonstrates creation of feature store, entity , transformation and feature group design time constructs using a python script which is provided as job artifact. Another job artifact demonstrates ingestion of feature values into pre-created feature group.

# Prerequisites

The notebook makes connections to other OCI resources. This is done using [resource principals](https://docs.oracle.com/en-us/iaas/Content/Functions/Tasks/functionsaccessingociresources.htm). If you have not configured your tenancy to use resource principals then you can do so using the instructions that are [here](https://docs.oracle.com/en-us/iaas/data-science/using/create-dynamic-groups.htm). Alternatively, you can use API keys. The preferred method for authentication is resource principals.


# Instructions

1. Open a Data Science Notebook session (i.e. JupyterLab).
2. Open a file terminal by clicking on File -> New -> Terminal.
3. In the terminal run the following commands:
4. `odsc conda install -s fspyspark32_p38_cpu_v1` to install the feature store conda.
1. `conda activate /home/datascience/conda/fspyspark32_p38_cpu_v1` to activate the conda.
5. Copy the `notebooks` folder into the notebook session.
6. Open the notebook `notebook/feature_store_using_mljob.ipynb`.
7. Change the notebook kernel to `Python [conda env:fspyspark32_p38_cpu_v1]`.
8. Read the notebook and execute each cell.
9. Once the ml job run is completed successfully, user can validate creation of feature store construct using the feature store notebook ui extension.
10. Now open the notebook `notebook/feature_store_ingestion_via_mljob.ipynb`.
11. Change the notebook kernel to `Python [conda env:fspyspark32_p38_cpu_v1]`.
12. Read the notebook and execute each cell.
13. validate the ingestion ml job is executed successfully.
14. User can validate the ingested data and other metadata using the feature store notebook ui extension.


Loading