Skip to content

Commit d1f6b38

Browse files
authored
Some docs cleanup before 2024-12-03 Client Success WG (#12)
* Some docs cleanup before 2024-12-03 Client Success WG * Typo fixes
1 parent 5b954e6 commit d1f6b38

File tree

3 files changed

+28
-24
lines changed

3 files changed

+28
-24
lines changed

README.md

+16-15
Original file line numberDiff line numberDiff line change
@@ -18,23 +18,23 @@
1818

1919

2020
# Purpose
21-
Houses definitions, discussions, and supporting materials/processses for Filecoin service classes, SLOs, and SLIs.
21+
Houses definitions, discussions, and supporting materials/processes for Filecoin service classes, SLOs, and SLIs.
2222

2323
# Background
2424

25-
Storage clients have a diverse set of needs and as a result, storage providers like AWS, GCP, etc. have created a plethora of storage options to meet these needs. At least as of 202410, Filecoin doesn’t articulate clearly what storage classes are supported, how we define them, and how we’re measuring against them. Filecoin makes strong guarantees of replication with its daily spacetime proofs, but there are additional dimensions that storage clients want to have visibility into (e.g, retrievability, performance). This was a topic of conversation during [FIL Dev Summit #4](https://www.fildev.io/FDS-4), and a ["PMF Targets Working Group"](https://www.notion.so/Filecoin-PMF-Targets-Working-Group-111837df73d480b6a3a9e5bfd73063de) was started in 2024Q3 in an attempt to change this so storage clients can know what to expect and so the Filecoin ecosystem can clearly see opportunities to fill or improve.
25+
Storage clients have a diverse set of needs and as a result, storage providers like AWS, GCP, etc. have created a plethora of storage options to meet these needs. At least as of 202410, Filecoin doesn’t articulate clearly what storage classes are supported, how we define them, and how we’re measuring against them. Filecoin makes strong guarantees of replication with its daily spacetime proofs, but there are additional dimensions that storage clients want to have visibility into (e.g, retrievability, performance). This was a topic of conversation during [FIL Dev Summit #4](https://www.fildev.io/FDS-4), and a ["PMF Targets Working Group"](https://protocollabs.notion.site/Filecoin-PMF-Targets-Working-Group-111837df73d480b6a3a9e5bfd73063de?pvs=4) was started in 2024Q3 in an attempt to change this so storage clients can know what to expect and so the Filecoin ecosystem can clearly see opportunities to fill or improve. In 202412, this work has shared with the "["Client Success Working Group"](https://protocollabs.notion.site/Filecoin-Client-Success-Working-Group-150837df73d480eabccff836d2553990?pvs=4) to get more feedback is what is needed by onramps and aggregators.
2626

2727
# Service Classes
2828

2929
Service Class | Status
3030
:--: | --
31-
["(TBD) Warm"](./service-classes/warm.md) | 2024-11-04: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values and even the name have not been determined.
32-
["(TBD) Cold"](./service-classes/cold.md) | 2024-11-04: This is a placeholder service class to illustrate that there should be multipel service classes including one with slower retrievability than ["warm"](./service-classes/warm.md).
31+
["(TBD) Warm"](./service-classes/warm.md) | 2024-12-03: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values have guessed placeholder values. Input is needed on what other SLOs are needed, SLO target values, and the name.
32+
["(TBD) Cold"](./service-classes/cold.md) | 2024-11-04: This is a placeholder service class to illustrate that there should be multiple service classes including one with slower retrievability than ["warm"](./service-classes/warm.md).
3333

3434
* A service class is set of dimensions that define a type of storage. “Archival” and “Hot” are a couple of examples with dimensions like "availability", "durability", and "performance". These service class dimensions have various [SLOs](#service-level-objectives) that should be met to satisfy the needs of that service class.
3535
* Service classes are defined in the [`service-classes` directory](./service-classes/).
3636
* There are intended to be many service classes.
37-
* A service class should correspond with a set of expectations that a group of storage cliens would have for certain data. This group of storage clients would expect to see all the corresponding SLOs consistently met by an SP in order to store their corresponding data with that SP.
37+
* A service class should correspond with a set of expectations that a group of storage clients would have for certain data. This group of storage clients would expect to see all the corresponding SLOs consistently met by an SP in order to store their corresponding data with that SP.
3838

3939
# Service Level Objectives
4040
* A Service Level Objective (SLO) is quality target for a service class. It defines the “acceptable” value or threshold for a [SLI](#service-level-indicator). They set expectations for a storage clients using the storage service, and then also give clear targets that storage providers need to hit and measure themselves against.
@@ -45,32 +45,33 @@ Service Class | Status
4545

4646
Service Level Indicator | Status
4747
-- | --
48-
[Spark Retrieval Success Rate](./service-level-indicators/spark-retrieval-success-rate.md) | ![review](https://img.shields.io/badge/status-review-yellow.svg?style=flat-square) this is the first SLI that has been worked to meet expectations both in terms of supporting documentation and being on chain. It is expected to serve as an example for to-be-created SLIs.
48+
[Spark Retrieval Success Rate v1](./service-level-indicators/spark-retrieval-success-rate.md) | ![review](https://img.shields.io/badge/status-review-yellow.svg?style=flat-square) this is the first SLI that has been worked to meet expectations both in terms of supporting documentation and being on chain. It is expected to serve as an example for to-be-created SLIs.
4949
["(TBD) Sector Health Rate"](./service-level-indicators/sector-health-rate.md) | ![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) while this using the original proof of spacetime (PoSt) that has always been with Filecoin, the documentation for how the metric is computed and what it does and doesn't measure hasn't been developed.
5050

5151
* A Service Level Indicator (SLI) is a metric that measures compliance with an [SLO](#service-level-objectives). SLI is the actual measurement. To meet the SLO, the SLI will need to meet or exceed the promises made by the SLO.
5252
* SLIs are defined in the [`service-level-indicators` directory](./service-level-indicators/).
5353

5454
# Tenets
5555
Below are tenets that have been guiding this work:
56-
1. SLIs must be on chain. We are holding this line because:
56+
57+
1. The "rules of the game" are knowable and discoverable - All participants in any markets created from this work should understand what is being measured, how it's being measured, what isn't being measured, etc. so they can reason appropriately about what something means and what actions to take if any.
58+
2. Flowing from #1, SLIs must be on chain. We are holding this line because:
5759
1. Forcing function for the data to actually get onchain. Compromising to allow non-chain data to start has historically made it hard to then do the lift of actually get it on chain despite all the best intentions at the start.
5860
2. We get the benefit of onchain data, as in the immutability guarantees. This is particularly important in an assumed future where these onchain “scores” will affect the reward structures of SPs.
59-
2. The "rules of the game" are knowable and discoverable - All participants in any markets created from this work should understand what is being measured, how it's being measured, what isn't being measured, etc. so they can reason appropriately about what something means and what actions to take if any.
60-
3. Gatekeep on quality but not exploration - There isn't one group of people that knows what the authoriative list of service classes should be. There should be room to explore, Peer review and approval should be applied to make sure that proposed service classes and SLIs are well documented and explained more than deciding what are the right set.
61-
4. Make room for alternatives - This is related to exploration. As a concrete example, we shouldn't discuss an unqualified "retrieval success rate" assuming there will only be a signle SLI measuring retrieval. Instead, SLIs should have proper qualification (e.g., "_Spark_ retrieval success rate") to make clear that there is opportunity for other "retrieval success rate" SLIs to emerage.
61+
3. Gatekeep on quality but not exploration - There isn't one group of people that knows what the authoritative list of service classes should be. There should be room to explore, Peer review and approval should be applied to make sure that proposed service classes and SLIs are well documented and explained more than deciding what are the right set.
62+
4. Make room for alternatives - This is related to exploration. As a concrete example, we shouldn't discuss an unqualified "retrieval success rate" assuming there will only be a single SLI measuring retrieval. Instead, SLIs should have proper qualification (e.g., "_Spark_ retrieval success rate") to make clear that there is opportunity for other "retrieval success rate" SLIs to emerge.
6263

6364
# Conventions
6465
* If something has a placeholder name, it is usually wrapped in quotes and prefixed with `(TBD)` (e.g., "(TBD) Warm" for a to-be-named service class that stores data that is "warmer" than the "(TBD) cold" service class)
6566

6667
## Abbreviations
67-
Throughout this repo, these appreviations are used:
68-
* SLI - service level indicator
69-
* SLO - service level objective
68+
Throughout this repo, these abbreviations are used:
69+
* SLI - [service level indicator](#service-level-indicators)
70+
* SLO - [service level objective](#service-level-objectives)
7071
* SP - storage provider, meaning the entity as defined in the Filecoin protocol with an individual id that commits sectors, accepts deals, etc. We're not referring to a brand/company which might compose multiple providerIds/minderIds.
7172

7273
# Improvement Proposal Process
73-
The process for proposing new service classes or SLIs, or modifying existing service classes and SLOs hasn't been determined yet. This is something we hope to get more formalized during 202411.
74+
The process for proposing new service classes or SLIs, or modifying existing service classes and SLOs hasn't been determined yet. This is something we hope to get more formalized during 202412 or 2025Q1. ([Tracking issue](https://github.com/filecoin-project/service-classes/issues/11).)
7475

7576
# FAQ
7677
## Where is performance against a service class measured and presented?
@@ -83,7 +84,7 @@ There currently isn't any way for an SP to opt in to some service classes and ou
8384
TODO; fill this in
8485

8586
## What is a "service class" vs. "storage class"?
86-
The terms are synonimous in our context, but we are using the term "service class" since that is the industry norm.
87+
The terms are synonymous in our context, but we are using the term "service class" since that is the industry norm.
8788

8889
## Why don't we use the term "SLA" currently?
8990
“Service Level Agreement” is avoided for now because it means different things to different people. For example, anyone in the storage world who uses S3 may have come to learn that [S3 only has one SLA](https://aws.amazon.com/s3/sla/), and it only pertains to service availability. S3 has [other performance dimensions that it evaluates it storage products against](https://aws.amazon.com/s3/storage-classes/#Performance_across_the_S3_storage_classes), but there are no SLAs there. An SLA is technically a legal contract that if breached will have financial penalty. Filecoin the protocol doesn’t really have this currently except for proof of replication. (Clients may have off-chain agreements with SPs.) To keep the conversation clearer for now, we’re focusing on service classes and their SLOs. When there is actually reward and penalty for meeting these SLOs, we as a group can start introducing SLA terminology.

service-classes/warm.md

+7-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11

22
# Status
3-
* 2024-11-04: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values and even the name have not been determined or ageed upon.
3+
* 2024-12-03: Placeholder SLO threshold values were set as a starting point before the 2024-12-03 "Client Success Working Group". Input is needed on what other SLOs are needed, SLO target values, and the name.
4+
* 2024-11-04: This is a sketch of a service class definition to represent data stored with Filecoin that also has an accompanying unsealed copy for retrieval. Key details like the threshold SLO values and even the name have not been determined or agreed upon.
45

56
# Intended Users
67
This service class is targeting users who 1) expect to retrieve at least some subset of their data at least weekly and 2) when they do retrieve, to have the first byte in under a second.
@@ -9,13 +10,13 @@ This service class is targeting users who 1) expect to retrieve at least some su
910
Dimension | SLI | Threshold
1011
-- | -- | --
1112
Retrievability | [Spark Retrieval Success Rate](../service-level-indicators/spark-retrieval-success-rate.md) | 90% per day
12-
"(TBD) Durability" | ["(TBD) Sector Health Rate"](../service-level-indicators/sector-health-rate.md) | 99% per day
13+
"(TBD) Durability" | ["(TBD) Sector Health Rate"](../service-level-indicators/sector-health-rate.md) | 95% per day
1314

14-
At least as of 202411, we're targeting a retrieval success rate of 90, which seems low when compared to the "availability" guarantees that other cloud providers make. This is for a few reasons:
15-
1. Retrievability in this decentralized Filecoin context is quite different from availability in a web2 context. Retrievability is being measured from a untrusted set of clents. web2 availability is being measured from the server side, and thus has less uncontrollable variables.
15+
At least as of 202411, we're targeting a retrieval success rate of 90%, which seems low when compared to the "availability" guarantees that other cloud providers make. This is for a few reasons:
16+
1. Retrievability in this decentralized Filecoin context is quite different from availability in a web2 context. Retrievability is being measured from a untrusted set of clients. web2 availability is being measured from the server side, and thus has less uncontrollable variables.
1617
2. The [Spark Retrieval Success Rate docs](../service-level-indicators/spark-retrieval-success-rate.md) do a good job enumerating the various ways that results can be poisoned by malicious actors. This lower-than-99+% target is to account for these possibilities.
1718
3. This level of Spark RSR is already significantly higher than the level of retrievability that most SPs were offering in early 2024. This SLO is moving SPs in a new direction, and it can be adjusted once a better threshold is determined.
1819

19-
This "(TBD) sector health rat"e of 99% doesn't match the "durability" targets with many 9's that web2 providers have because:
20+
This "(TBD) sector health rat"e of 95% doesn't match the "durability" targets with many 9's that web2 providers have because:
2021
1. They are different metrics. web2 providers are looking at the durability of each byte written to their service which benefits from their infrastructure setup and erasure encoding.
21-
2. Often in the cases where a Storage Provider misses a PoSt, they meet it in future prooving windows. This means the data wasn't lost, but rather that a sector was not-proven to the network within its prooving deadline.
22+
2. Often in the cases where a Storage Provider misses a PoSt, they meet it in future proving windows. This means the data wasn't lost, but rather that a sector was not-proven to the network within its proving deadline.

service-level-indicators/sector-health-rate.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,15 @@ This document is intended to become the canonical resource that is referenced in
2323
## Versions / Status
2424
SLI Version | Status | Comment
2525
-- | -- | --
26-
v1.0.0 | ![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | 2024-11-04: this was started as a placeholder to start moving the exploration work from https://github.com/davidgasquez/filecoin-data-portal/issues/79 over and to seed this repo with more than one metric definition. It needs more review, and particularly SP feedback on the caveats of this metric. It is not decided that "Sector Health Rate" is the right name or that this should be under "durability". Agains, this current iteration was done to move fast so there is more skeleton in this repo before FDS 5.
26+
v1.0.0 | ![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | 2024-11-04: this was started as a placeholder to start moving the exploration work from https://github.com/davidgasquez/filecoin-data-portal/issues/79 over and to seed this repo with more than one metric definition. It needs more review, and particularly SP feedback on the caveats of this metric. It is not decided that "Sector Health Rate" is the right name or that this should be under "durability". Again, this current iteration was done to move fast so there is more skeleton in this repo before FDS 5.
2727

2828

2929
## Support, Questions, and Feedback
3030
If you see errors in this document, please open a PR.
31+
3132
If you have a question that isn't answered by the document, then ...
32-
If you want to discuss ideas for improving this proposoal, then ...
33+
34+
If you want to discuss ideas for improving this proposal, then ...
3335

3436
# TL;DR
3537
Filecoin has a robust mechanism already for proving spacetime on chain for each sector. The proportion of successful proofs over time gives indication of the "durability" of data stored on these sectors.
@@ -51,7 +53,7 @@ There are multiple ways to compute this metric. Multiple options are outlined a
5153
Below explains the way to compute this method when using Lotus RPC:
5254

5355
* This metric is computed based on a single sampling per SP per day. This works because:
54-
1. A sector that is faulted stays in the fault state for a duration that is a multiple of 24 hours given a sector's state transitions in and out of faulted state happens during the providing dealine for the sector.
56+
1. A sector that is faulted stays in the fault state for a duration that is a multiple of 24 hours given a sector's state transitions in and out of faulted state happens during the proving deadline for the sector.
5557
2. New sectors in a given day may get missed until the next day, but sectors aren't a highly transient resource flipping into and out of existence. Since sectors tend to have a lifespan of months or years, not counting them on their first day isn't a significant impact on the metric over time.
5658
* `Number of Active Sectors` is computed by getting the SP's Raw Power ([StateMinerPower](https://lotus.filecoin.io/reference/lotus/state/#stateminerpower)) divided by the SP's sector size ([StateMinerInfo](https://lotus.filecoin.io/reference/lotus/state/#stateminerinfo))).
5759
* `Number of Faulted Sectors` is computed by daily querying for the [`StateMinerFaults`](https://lotus.filecoin.io/reference/lotus/state/#stateminerfaults) for each SP with sectors.

0 commit comments

Comments
 (0)