Annotation Design Discussion

**Questions & Todo:**
- Discuss how Annotations should be implemented in HashStore
- What format should we use to store annotation content in `/hashstore/metadata`? JSON-LD or EML?
- What is HashStore's responsibility when storing annotations?
    - Is the EML document already formed at this point?
    - Where is the content coming from?
    - Who currently creates the EML documents to be stored?
- Summarize issue discussion into substorage design document

**Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):**
- A dataset that is represented by an EML document can be broken down to 2 components: 
    - Attributes that describe the dataset (ex. title, author, method, keywordSet, etc.)
    - Attributes that represent the tables associated with the dataset (ex. dataTable, otherEntity, etc.)
- A `HashStore annotation` is a mapping document that should consist of a single parent member and a list that represents the child members
    - This document's location in `hashstore/metadata` is formed by calculating the SHA-256 hex digest of a given `pid` and `formatId`
        - The parent member's value is the _id (location)_ of the parent metadata document in `hashstore/metadata`
              - The id/location/address of this document is formed by calculating the SHA-256 hex digest of a given `pid`, `formatId` and the string "parent". `Ex. sha-256(pid + formatId + "parent")`
              - This document is composed of the attributes/content that describe the dataset (ex. title, author, method, keywordSet, etc.)
        - The List/HashMap of child members are represented with a number as the key, and the _id (location)_ of the child's metadata document in `hashstore/metadata` as the value
              - The id/address of each child is formed by calculating the SHA-256 hex digest of a given `pid`, `formatId` and `(int) key`. `Ex. sha-256(pid + formatId + 0)` where 0 is the first table in the dataset
              - Each child represents a data table in the dataset, or chunk of data that belongs to the dataset
- _Note: The format of the parent/child metadata documents to be stored/chunked requires further discussion/clarification_

```mermaid
---
title: HashStoreAnnotation Class
---
classDiagram
    direction RL
    class HashStoreAnnotation{
        +String Parent
        +List~Dict/KVP~ Children
        +setParent(string)
        +setChildren(List)
        +getContent()
        +setContent()
        +getChildrenTotal()
    }
```

```
Example/flow to store an annotation document:

hs_annotation = HashStoreAnnotation()

// Get and store parent content
// Get and store children content

// Get parent location
dataset_parent = sha-256(pid + formatId + "parent")
// Create child list
dataset_children = [
    {0: sha-256(pid + formatId + 0)},
    {1: sha-256(pid + formatId + 1)},
    ...
]
hs_annotation.setParent(dataset_parent)
hs_annotation.setChildren(dataset_children)

// getContent() will format the document to be written based on the chosen format
hs_annotation_content = hs_annotation.getContent()

hashstore.store_metadata(pid, hs_annotation_content, formatId) 

```

```
Example/flow to work with/retrieve an annotation document:

// Retrieve the mapping document
hs_annotation_stream = hashstore.retrieve_metadata(pid, formatId)
hs_annotation = HashStoreAnnotation.setContent(hs_annotation_stream)
hsa_parent = hs_annotation.parent
hsa_children = hs_annotation.children

// Iterate over the first 1000 table items
for i in range(0, 1000):
     rel_path = shard(hsa_children[i])
     location = `/hashstore/metadata/` + rel_path
     // ... Do what we will with each child element
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotation Design Discussion #56

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Annotation Design Discussion #56

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions