Skip to content

Annotation Design Discussion #56

@doulikecookiedough

Description

@doulikecookiedough

Questions & Todo:

  • Discuss how Annotations should be implemented in HashStore
  • What format should we use to store annotation content in /hashstore/metadata? JSON-LD or EML?
  • What is HashStore's responsibility when storing annotations?
    • Is the EML document already formed at this point?
    • Where is the content coming from?
    • Who currently creates the EML documents to be stored?
  • Summarize issue discussion into substorage design document

Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):

  • A dataset that is represented by an EML document can be broken down to 2 components:
    • Attributes that describe the dataset (ex. title, author, method, keywordSet, etc.)
    • Attributes that represent the tables associated with the dataset (ex. dataTable, otherEntity, etc.)
  • A HashStore annotation is a mapping document that should consist of a single parent member and a list that represents the child members
    • This document's location in hashstore/metadata is formed by calculating the SHA-256 hex digest of a given pid and formatId
      • The parent member's value is the id (location) of the parent metadata document in hashstore/metadata
        - The id/location/address of this document is formed by calculating the SHA-256 hex digest of a given pid, formatId and the string "parent". Ex. sha-256(pid + formatId + "parent")
        - This document is composed of the attributes/content that describe the dataset (ex. title, author, method, keywordSet, etc.)
      • The List/HashMap of child members are represented with a number as the key, and the id (location) of the child's metadata document in hashstore/metadata as the value
        - The id/address of each child is formed by calculating the SHA-256 hex digest of a given pid, formatId and (int) key. Ex. sha-256(pid + formatId + 0) where 0 is the first table in the dataset
        - Each child represents a data table in the dataset, or chunk of data that belongs to the dataset
  • Note: The format of the parent/child metadata documents to be stored/chunked requires further discussion/clarification
---
title: HashStoreAnnotation Class
---
classDiagram
    direction RL
    class HashStoreAnnotation{
        +String Parent
        +List~Dict/KVP~ Children
        +setParent(string)
        +setChildren(List)
        +getContent()
        +setContent()
        +getChildrenTotal()
    }
Loading
Example/flow to store an annotation document:

hs_annotation = HashStoreAnnotation()

// Get and store parent content
// Get and store children content

// Get parent location
dataset_parent = sha-256(pid + formatId + "parent")
// Create child list
dataset_children = [
    {0: sha-256(pid + formatId + 0)},
    {1: sha-256(pid + formatId + 1)},
    ...
]
hs_annotation.setParent(dataset_parent)
hs_annotation.setChildren(dataset_children)

// getContent() will format the document to be written based on the chosen format
hs_annotation_content = hs_annotation.getContent()

hashstore.store_metadata(pid, hs_annotation_content, formatId) 

Example/flow to work with/retrieve an annotation document:

// Retrieve the mapping document
hs_annotation_stream = hashstore.retrieve_metadata(pid, formatId)
hs_annotation = HashStoreAnnotation.setContent(hs_annotation_stream)
hsa_parent = hs_annotation.parent
hsa_children = hs_annotation.children

// Iterate over the first 1000 table items
for i in range(0, 1000):
     rel_path = shard(hsa_children[i])
     location = `/hashstore/metadata/` + rel_path
     // ... Do what we will with each child element

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions