diff --git a/README.md b/README.md index b0e26be..cc376a5 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # Data Product Specification This repository wants to define an open specification to define data products with the following principles in mind: -- Data Product as an indipendent unit of deployment -- Technology indipendence +- Data Product as an independent unit of deployment +- Technology independence - Extensibility With an open specification it will be possible to create services for automatic deployment and interoperable components to build a Data Mesh platform. @@ -15,9 +15,9 @@ The Data Product is composed by a general section with Data Product level inform * **Output Ports**: representing all the different interfaces of the Data Product to expose the data to consumers * **Workloads**: internal jobs/processes to feed the Data Product and to perform housekeeping (GDPR, regulation, audit, data quality, etc) * **Storage Areas**: internal data storages where the Data Product is deployed, not exposed to consumers -* **Observability**: provides transparency to the data conusmer about how the Data Product is currently working. This is not declarative, but exposing runtime data. +* **Observability**: provides transparency to the data consumer about how the Data Product is currently working. This is not declarative, but exposing runtime data. -Each Data Product component trait (output ports, workloads, observabilities, etc) will have a well defined and fixed structure and a "specific" one to handle technology specific stuff. +Each Data Product component trait (output ports, workloads, observability, etc) will have a well defined and fixed structure and a "specific" one to handle technology specific stuff. The fixed structure must be technology agnostic. The first fields of teh fixed structure are more technical and linked to how the platform will handle them, while the last fields (specific excluded) are to be treated as pure metadata that will simplify the management and consumption. ### General @@ -28,10 +28,11 @@ The fixed structure must be technology agnostic. The first fields of teh fixed s * the ID is a URN of the form `urn:dmb:dp:$DPDomain:$DPName:$DPMajorVersion` * `Name: [String]*` the name of the Data Product. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Data Product ID all special characters are replaced with standard ones and spaces are replaced with dashes. * `FullyQualifiedName: [Option[String]]` human-readable name that describes the Data Product. + Following the Open Metadata convention it should be of the form `domain.dataproduct`. * `Description: [String]` detailed description about what functional area this Data Product is representing, what purpose has and business related information. * `Kind: [String]*` type of the entity. Since this is a Data Product the only allowed value is `dataproduct`. * `Domain: [String]*` the identifier of the domain this Data Product is belonging to. -* `Version: [String]*` this is representing the version of the Data Product. Displayed as `X.Y.Z` where X is the major version, Y is the minor version, and Z is the patch. Major version (X) is also shown in the Data Product ID and those fields (version and ID) must always be aligned with one another. We consider a Data Product as an indipendent unit of deployment, so if a breaking change is needed, we create a brand new version of it by chaning the major version. If we introduce a new feature (or patch) we will not create a new major version, but we can just change Y (new feature) or Z patch, thus not creating a new ID (and hence not creating a new Data Product). +* `Version: [String]*` this is representing the version of the Data Product. Displayed as `X.Y.Z` where X is the major version, Y is the minor version, and Z is the patch. Major version (X) is also shown in the Data Product ID and those fields (version and ID) must always be aligned with one another. We consider a Data Product as an independent unit of deployment, so if a breaking change is needed, we create a brand new version of it by changing the major version. If we introduce a new feature (or patch) we will not create a new major version, but we can just change Y (new feature) or Z patch, thus not creating a new ID (and hence not creating a new Data Product). Constraints: * Major version of the Data Product is always the same as the major version of all of its components ,and it is the same version that is shown in both Data Product ID and component IDs. * `Environment: [String]*`: logical environment where the Data Product will be deployed. @@ -45,7 +46,7 @@ The fixed structure must be technology agnostic. The first fields of teh fixed s * `Maturity: [Option[String]]` this is an enum to let the consumer understand if it is a tactical solution or not. It is really useful during migration from Data Warehouse or Data Lake. Allowed values are: `[Tactical|Strategic]`. * `Billing: [Option[Yaml]]` this is a free form key-value area where is possible to put information useful for resource tagging and billing. * `Tags: [Array[Yaml]]` Tag labels at DP level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). -* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific execution environment. It can also refer to an additional file. At this level we also embed all the information to provision the general infrastructure (resource groups, networking, etc) needed for a specific Data Product. For example if a company decides to create a ResourceGroup for each data product and have a subscription reference for each domain and environment, it will be specified at this level. Also it is reccommended to put general security here, Azure Policy or IAM policies, VPC/Vnet, Subnet. This will be filled merging data defined at common level with values defined specifically for the selected environment. +* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific execution environment. It can also refer to an additional file. At this level we also embed all the information to provision the general infrastructure (resource groups, networking, etc) needed for a specific Data Product. For example if a company decides to create a ResourceGroup for each data product and have a subscription reference for each domain and environment, it will be specified at this level. Also it is recommended to put general security here, Azure Policy or IAM policies, VPC/Vnet, Subnet. This will be filled merging data defined at common level with values defined specifically for the selected environment. The **unique identifier** of a Data Product is the concatenation of Domain, Name and Version. So we will refer to the `DP_UK` as a URN which ends in the following way: `$DPDomain:$DPName:$DPMajorVersion`. @@ -57,13 +58,14 @@ The **unique identifier** of a Data Product is the concatenation of Domain, Name * allowed characters are `[a-zA-Z0-9]` and `[_-]`. * the ID is a URN of the form `urn:dmb:cmp:$DPDomain:$DPName:$DPMajorVersion:$OutputPortName`. * `Name: [String]*` the name of the Output Port. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Output Port ID all special characters are replaced with standard ones and spaces are replaced with dashes. -* `FullyQualifiedName: [Option[String]]` human-readable name that describes better the Output Port. It can also contain specific details (if this is a table this field could contain also indications regarding the databse and the schema). +* `FullyQualifiedName: [Option[String]]` human-readable name that describes the Output Port. + Following the Open Metadata convention it should be of the form `domain.dataproduct.outputport`. * `Description: [String]` detailed explanation about the function and the meaning of the output port. * `Kind: [String]*` type of the entity. Since this is an Output Port the only allowed value is `outputport`. -* `Version: [String]*` specific version of the output port. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponde to the major version of the Data Product it belongs to. +* `Version: [String]*` specific version of the output port. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponds to the major version of the Data Product it belongs to. Constraints: * Major version of the Data Product is always the same as the major version of all of its components and it is the same version that is shown in both Data Product ID and component ID. -* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. +* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. * `UseCaseTemplateId: [Option[String]]*` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template. * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: @@ -78,21 +80,21 @@ Constraints: * `Schema: [Array[Yaml]]` when it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/metadata-standard/schemas/entities/table#column. Each column can have a tag array and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json. * `SLA: [Yaml]` Service Level Agreement, describe the quality of data delivery and the output port in general. It represents the producer's overall promise to the consumers. * `IntervalOfChange: [Option[String]]` how often changes in the data are reflected. - * `Timeliness: [Option[String]]` the skew between the time that a business fact occuts and when it becomes visibile in the data. + * `Timeliness: [Option[String]]` the skew between the time that a business fact occurs and when it becomes visible in the data. * `UpTime: [Option[String]]` the percentage of port availability. * `TermsAndConditions: [Option[String]]` If the data is usable only in specific environments. * `Endpoint: [Option[URL]]` this is the API endpoint that self-describe the output port and provide insightful information at runtime about the physical location of the data, the protocol must be used, etc. -* `DataSharingAgreement: [Yaml]` This part is covering usage, privacy, purpose, limitations and is indipendent by the data contract. +* `DataSharingAgreement: [Yaml]` This part is covering usage, privacy, purpose, limitations and is independent by the data contract. * `Purpose: [Option[String]]` what is the goal of this data set. * `Billing: [Option[String]]` how a consumer will be charged back when it consumes this output port. - * `Security: [Option[String]]` additional information related to security aspects, like restrictions, maskings, sensibile information and privacy. + * `Security: [Option[String]]` additional information related to security aspects, like restrictions, maskings, sensible information and privacy. * `IntendedUsage: [Option[String]]` any other information needed by the consumer in order to effectively consume the data, it could be related to technical stuff (e.g. extract no more than one year of data for good performances ) or to business domains (e.g. this data is only useful in the marketing domains). * `Limitations: [Option[String]]` If any limitation is present it must be made super clear to the consumers. * `LifeCycle: [Option[String]]` Describe how the data will be historicized and how and when it will be deleted. * `Confidentiality: [Option[String]]` Describe what a consumer should do to keep the information confidential, how to process and store it. Permission to share or report it. * `Tags: [Array[Yaml]]` Tag labels at OutputPort level, here we can have security classification for example (please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). * `SampleData: [Option[Yaml]]` provides a sample data of your Output Port (please refer to OpenMetadata specification: https://docs.open-metadata.org/metadata-standard/schemas/entities/table#tabledata). -* `SemanticLinking: [Option[Yaml]]` here we can express semantic relationships between this output port and other outputports (also coming from other domains and data products). For example we could say that column "customerId" of our SQL Output Port references the column "id" of the SQL Output Port of the "Customer" Data Product. +* `SemanticLinking: [Option[Yaml]]` here we can express semantic relationships between this output port and other output ports (also coming from other domains and data products). For example we could say that column "customerId" of our SQL Output Port references the column "id" of the SQL Output Port of the "Customer" Data Product. * `Specific: [Yaml]` this is a custom section where we must put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance. @@ -104,12 +106,13 @@ Constraints: * the ID is a URN of the form `urn:dmb:cmp:$DPDomain:$DPName:$DPMajorVersion:$WorkloadName`. * `Name: [String]*` the name of the Workload. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Workload ID all special characters are replaced with standard ones and spaces are replaced with dashes. * `FullyQualifiedName: [Optional[String]]` human-readable name that describes better the Workload. -* `Description: [String]` detailed explaination about the purpose of the workload, what sources is reading, what business logic is applying, etc. + Following the Open Metadata convention it should be of the form `domain.dataproduct.workload`. +* `Description: [String]` detailed explanation about the purpose of the workload, what sources is reading, what business logic is applying, etc. * `Kind: [String]*` type of the entity. Since this is an Output Port the only allowed value is `workload`. -* `Version: [String]*` specific version of the workload. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponde to the major version of the Data Product it belongs to. +* `Version: [String]*` specific version of the workload. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponds to the major version of the Data Product it belongs to. Constraints: * Major version of the Data Product is always the same as the major version of all of its components and it is the same version that is shown in both Data Product ID and component ID. -* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. +* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. * `UseCaseTemplateId: [Option[String]]*` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template. * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: @@ -117,9 +120,9 @@ Constraints: * `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running. * `Technology: [Option[String]]` represents which technology is used to define the workload, like: Spark, Flink, pySpark, etc. The underlying technology is useful to understand better how the workload process data. * `WorkloadType: [Option[String]]` explains what type of workload is: Ingestion ETL, Streaming, Internal Process, etc. -* `ConnectionType: [Option[String]]` an enum with allowed values: `[HouseKeeping|DataPipeline]`; `Housekeeping` is for all the workloads that are acting on internal data without any external dependency. `DataPipeline` instead is for workloads that are reading from outputport of other DP or external systems. +* `ConnectionType: [Option[String]]` an enum with allowed values: `[HouseKeeping|DataPipeline]`; `Housekeeping` is for all the workloads that are acting on internal data without any external dependency. `DataPipeline` instead is for workloads that are reading from output ports of other DP or external systems. * `Tags: [Array[Yaml]]` Tag labels at Workload level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). -* `ReadsFrom: [Array[String]]` This is filled only for `DataPipeline` workloads and it represents the list of Output Ports or external systems that the workload uses as input. Output Ports are identified with `DP_UK:$OutputPortName`, while external systems will be defined by a URN in the form `urn:dmb:ex:$SystemName`. This filed can be elaborated more in the future and create a more semantic struct. +* `ReadsFrom: [Array[String]]` This is filled only for `DataPipeline` workloads and it represents the list of Output Ports or external systems that the workload uses as input. Output Ports are identified with `DP_UK:$OutputPortName`, while external systems will be defined by a URN in the form `urn:dmb:ex:$SystemName`. This fields can be elaborated more in the future and create a more semantic struct. Constraints: * This array will only contain Output Port IDs and/or external systems identifiers. * `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance. @@ -133,10 +136,11 @@ Constraints: * the ID is a URN of the form `urn:dmb:cmp:$DPDomain:$DPName:$DPMajorVersion:$StorageAreaName`. * `Name: [String]*` the name of the Storage Area. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Storage Area ID all special characters are replaced with standard ones and spaces are replaced with dashes. * `FullyQualifiedName: [Optional[String]]` human-readable name that describes better the Storage Area. + Following the Open Metadata convention it should be of the form `domain.dataproduct.storage_area`. * `Description: [String]` detailed explanation about the function and the meaning of this storage area, * `Kind: [String]*` type of the entity. Since this is an Output Port the only allowed value is `storage`. * `Owners: [Array[String]]` It is an array of user/role/group related to LDAP/AD user. This field defines who has all permissions on this specific storage area -* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. +* `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. * `UseCaseTemplateId: [Option[String]]*` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template. * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: @@ -156,6 +160,7 @@ Anyway is good to formalize what kind of information should be included and veri * `ID: [String]*` the unique identifier of the observability API * `Name: [String]*` the name of the observability API * `FullyQualifiedName: [String]` human-readable that uniquely identifies an entity + Following the Open Metadata convention it should be of the form `domain.dataproduct.observability_api_name`. * `Description: [String]` detailed explanation about what this observability is exposing * `Endpoint: [URL]` this is the API endpoint that will expose the observability for each OutputPort * `Completeness: [Yaml]` degree of availability of all the necessary information along the entire history @@ -177,7 +182,7 @@ In general the version should be used to notify users of the changes between the - a change in the patch version means that there are no significant changes, but just bug fixes or small corrections (e.g. an improvement in the field description, a typo that was fixed, an improvement in the validation files) CUE offers also a [standard way](https://cuelang.org/docs/usecases/datadef/#validating-backwards-compatibility) to check if new versions of a schema are backwards-compatible with older versions. -It is highly reccommended to check for schema compatibilities when multiple and/or complex changes are introduced. +It is highly recommended to check for schema compatibilities when multiple and/or complex changes are introduced. In the following sections we will list all the extensions and modifications of this specification and the impact they have on the overall contract: @@ -207,7 +212,7 @@ Any change of this kind should **always** increase the major version number. Discouraged customizations are: - change in the name or type of existing fields. This kind of change breaks compatibility with previous versions, and should be performed by keeping in mind that they will impact for sure all the logics based on those fields. - moving fields as sub-fields of other sections (e.g. moving the "workload type" field as a sub-field of a new "type" field). This is actually a specific case of the one above, and should be treated accordingly. -- deletion of existing fields. This is generally somethign that will impact a lot modules that are leveraging the specification, and you must think very carefully before doing deletions. Think that you can always make a field optional, and this choice will impact the specification way less. +- deletion of existing fields. This is generally something that will impact a lot modules that are leveraging the specification, and you must think very carefully before doing deletions. Think that you can always make a field optional, and this choice will impact the specification way less. **N.B.: all the changes described above are allowed only if they do not affect reserved fields which are treated in the Forbidden customization.** diff --git a/example.yaml b/example.yaml index 5569095..5756495 100644 --- a/example.yaml +++ b/example.yaml @@ -1,6 +1,6 @@ id: urn:dmb:dp:my_domain:my_data_product:1 name: my data product -fullyQualifiedName: My Data Product +fullyQualifiedName: my_domain.my_data_product description: this data product is representing the xxx functional entity kind: dataproduct domain: my_domain @@ -20,7 +20,7 @@ specific: {} components: - id: urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port name: my raw s3 port - fullyQualifiedName: My Raw S3 Port + fullyQualifiedName: my_domain.my_data_product.my_raw_s3_port description: s3 raw output port kind: outputport version: 1.0.1 @@ -37,7 +37,7 @@ components: schema: [] SLA: intervalOfChange: 1 hours - timeliness: 1 minutes + timeliness: 1 minutes upTime: 99.9% termsAndConditions: only usable in development environment endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port @@ -45,10 +45,10 @@ components: purpose: this output port want to provide a rich set of profitability KPIs related to the customer billing: 5$ for each full scan security: In order to consume this output port an additional security check with compliance must be done - intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care + intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care limitations: is not possible to use this data without a compliance check lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january - confidentiality: if you want to store this data somewhere else, PII columns must be masked + confidentiality: if you want to store this data somewhere else, PII columns must be masked tags: - tagFQN: experimental source: Tag @@ -65,7 +65,7 @@ components: bucket: ms-datamesh-s3 - id: urn:dmb:cmp:my_domain:my_data_product:1:my_view_impala_port name: my view impala port - fullyQualifiedName: My View Impala Port + fullyQualifiedName: my_domain.my_data_product.my_view_impala_port description: impala view output port kind: outputPort version: 1.1.0 @@ -86,7 +86,7 @@ components: dataType: string SLA: intervalOfChange: 1 hours - timeliness: 1 minutes + timeliness: 1 minutes upTime: 99.9% termsAndConditions: only usable in development environment endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port @@ -94,10 +94,10 @@ components: purpose: this output port want to provide a rich set of profitability KPIs related to the customer billing: 5$ for each full scan security: In order to consume this output port an additional security check with compliance must be done - intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care + intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care limitations: is not possible to use this data without a compliance check lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january - confidentiality: if you want to store this data somewhere else, PII columns must be masked + confidentiality: if you want to store this data somewhere else, PII columns must be masked tags: [] sampleData: columns: @@ -121,7 +121,7 @@ components: format: PARQUET - id: urn:dmb:cmp:my_domain:my_data_product:1:my_spark_workload name: my spark workload - fullyQualifiedName: My Spark workload + fullyQualifiedName: my_domain.my_data_product.my_spark_workload description: spark batch workload kind: workload version: 1.1.1 @@ -151,7 +151,7 @@ components: cronExpression: 0 0 0,22 ? * * * - id: urn:dmb:cmp:my_domain:my_data_product:1:my_observability name: my observability - fullyQualifiedName: My Observability + fullyQualifiedName: my_domain.my_data_product.my_observability description: observability for my data product kind: observability infrastructureTemplateId: microservice-id-4