Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: schema validation #663

Merged
merged 9 commits into from
Feb 20, 2025
Merged

feat: schema validation #663

merged 9 commits into from
Feb 20, 2025

Conversation

nathanielc
Copy link
Collaborator

With this change models and model instance documents are validated. The validation rules are in large part the same as in composedb with a few differences:

  • Models can now be updated (more below on this)
  • Relations are only partially validated, we ensure the field is a valid stream id of the correct type and nothing more. Applications building on Ceramic can extend this validation to load the stream.

Models streams now allow data events that update the model. These are the rules of updating a model:

  • Only the name, description, implements and schema fields of a model may be modified. All other fields are immutable.
  • The schema must not make a breaking change
  • An interface may not be updated

A model instance document stream now supports a new header value modelVersion that defines which version of a model to use when validating the document. The modelVersion must be set to the CID (not stream id) of event within the model stream corresponding to the version of the model used to validate the instance. If not set the modelVersion is defined to be the CID of the init event of the model, aka the CID part of the model stream id. Note the model version is mutable so a stream that initially validated against one version of a model may change to validate against a newer version of the model. This change is explicit and required.

The result of these rules is that models can publish new backwards compatible versions. Model instances can explicitly update to those new versions by both updating their document and updating the model version header thus ensuring that the new document structure is correctly validated. Existing applications that take no action will see both old and new model instance documents however as the schema changes are backwards compatible the application will have no ill effects. Applications that update to handle the new schema can do so while handling cases where many model instances will still use the old schema.

Additionally the mechanism that requires the controller of a model instance to update to a new version of a model means that a model publisher cannot invalidate existing model instances by making changes to the model.

Copy link
Collaborator Author

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is large, to help with review there are two halves to the changes:

  1. The half that given a model or a model instance can validate it
  2. The half that given a datafusion record batch of events calls the validation code from the first half

See the pipeline/src/aggregator/validation for the first half and the src/pipeline/aggregator/mod.rs for the second half.

.collect()
.await?;
Ok(concat_batches(&schemas::event_states(), &ordered_events)?)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is the heart of all the changes. Basically the aggregator flow now includes four explicit steps.

  1. Join events with previous
  2. Apply patch/updates based on stream type specific rules
  3. Validate newly updated events
  4. Store events with their validation status


/// Helpers to validate interface details.
impl InterfaceUtil {
/// TODO: this is basically unimplemented, but https://github.com/getsentry/json-schema-diff looks super promising!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has a few TODOs. This logic is currently unused. Validating that models implement interfaces will be a follow up change as this one is large enough. This scaffold code simply shows where it will fit it.

@nathanielc nathanielc force-pushed the feat/schema-validation branch from 3446426 to a568424 Compare February 12, 2025 20:52
@nathanielc nathanielc requested a review from smrz2001 February 12, 2025 20:52
@nathanielc nathanielc force-pushed the feat/schema-validation branch from a568424 to 2452b54 Compare February 12, 2025 20:55
Base automatically changed from feat/udf-helpers to main February 13, 2025 19:41
@nathanielc nathanielc force-pushed the feat/schema-validation branch from 337c85a to 55c04d4 Compare February 14, 2025 15:25
@nathanielc nathanielc marked this pull request as ready for review February 17, 2025 16:17
@nathanielc nathanielc requested review from dav1do and a team as code owners February 17, 2025 16:17
@nathanielc nathanielc requested review from ukstv and removed request for a team February 17, 2025 16:17
dav1do and others added 7 commits February 19, 2025 16:47
With this change models and model instance documents are validated.
The validation rules are in large part the same as in composedb with a
few differences:
* Models can now be updated (more below on this)
* Relations are only partially validated, we ensure the field is a valid
  stream id of the correct type and nothing more. Applications building
  on Ceramic can extend this validation to load the stream.

Models streams now allow data events that update the model. These are
the rules of updating a model:

* Only the name, description, implements and schema fields of a model
  may be modified. All other fields are immutable.
* The schema must not make a breaking change
* An interface may not be updated

A model instance document stream now supports a new header value
`modelVersion` that defines which version of a model to use when
validating the document. The `modelVersion` must be set to the CID (not
stream id) of event within the model stream corresponding to the version
of the model used to validate the instance. If not set the modelVersion
is defined to be the CID of the init event of the model, aka the CID
part of the model stream id. Note the model version is mutable so a
stream that initially validated against one version of a model may
change to validate against a newer version of the model. This change is
explicit and required.

The result of these rules is that models can publish new backwards
compatible versions. Model instances can explicitly update to those new
versions by both updating their document and updating the model version
header thus ensuring that the new document structure is correctly
validated. Existing applications that take no action will see both old
and new model instance documents however as the schema changes are
backwards compatible the application will have no ill effects.
Applications that update to handle the new schema can do so while
handling cases where many model instances will still use the old schema.

Additionally the mechanism that requires the controller of a model
instance to update to a new version of a model means that a model
publisher cannot invalidate existing model instances by making changes
to the model.
Prior to this change it was assumed that the index column was a global
ordering value for all tables in the pipeline. However that cannot be
true as event_states have stricter order constraints than conclusion
events. As such this change breaks that pattern with explicit order
columns and their meaning.

This is prework to be able to correctly buffer and reorder event states
according to cross stream dependencies.
Copy link
Collaborator

@smrz2001 smrz2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@nathanielc nathanielc enabled auto-merge February 20, 2025 20:51
@nathanielc nathanielc added this pull request to the merge queue Feb 20, 2025
Merged via the queue into main with commit 29944ce Feb 20, 2025
5 checks passed
@nathanielc nathanielc deleted the feat/schema-validation branch February 20, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants