You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This `Data` object is used for all non-leaf nodes in Unixfs.
77
+
### IPLD `dag-pb`
78
78
79
-
For files that are comprised of more than a single block, the 'Type' field will be set to 'File', the 'filesize' field will be set to the total number of bytes in the file (not the graph structure) represented by this node, and 'blocksizes' will contain a list of the filesizes of each child node.
79
+
A very important spec for unixfs is the `dag-pb` IPLD spec: https://ipld.io/specs/codecs/dag-pb/spec/
80
80
81
-
This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object.
81
+
```protobuf
82
+
message PBLink {
83
+
// binary CID (with no multibase prefix) of the target object
84
+
optional bytes Hash = 1;
85
+
86
+
// UTF-8 string name
87
+
optional string Name = 2;
88
+
89
+
// cumulative size of target object
90
+
optional uint64 Tsize = 3; // also known as dagsize
91
+
}
92
+
93
+
message PBNode {
94
+
// refs to other objects
95
+
repeated PBLink Links = 2;
96
+
97
+
// opaque user data
98
+
optional bytes Data = 1;
99
+
}
100
+
```
101
+
102
+
The two different schemas plays together and it is important to understand their different effect,
103
+
-`dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links.
104
+
-`Message` is the "inside" protobuf message, this can be decoded by first decoding the `PBNode` object and then decoding `Message` from the `PBNode.Data` field, this will contain all the rest of information.
105
+
106
+
This mean we deal with protobuf inside protobuf.
107
+
108
+
## How to read a File
109
+
110
+
First you get some CID OOB, this will be what will we be trying to decode.
111
+
112
+
This CID MUST include:
113
+
1. A [multicodec](https://github.com/multiformats/multicodec), also called codec.
114
+
2. A [Multihash](https://github.com/multiformats/multihash) (used to specify a hashing algorithm, some hashing parameters and some digest)
115
+
116
+
### Get the block
117
+
118
+
The first step is to get the block, by get the block we mean get the actually bytes which hashed by the multihash give you the same multihash back.
119
+
This step can be achived in many ways (bitswap, downloading a car file, ...) all we care about is that you got the bytes and you confirmed that they are correct using the hashing function provided in the CID.
120
+
121
+
This step will be repeated when downloading any block and thus will be implicitly assumed to be done when downloading any block.
122
+
123
+
### Start decoding the bytes
124
+
125
+
With Unixfs we deal with two codecs which will be decoded differently:
126
+
-`Raw`, single blocks files
127
+
-`Dag-PB`, possibly multi-block files (single block is limited to 2MiB but it may point to childrens, joining them)
128
+
129
+
#### `Raw` files
130
+
131
+
The most simplest file is a `Raw` file.
132
+
133
+
They can be recognised because their CIDs have `Raw` codec.
134
+
135
+
Their contents is purely the block body.
136
+
137
+
They never have any childs, and thus are also known as single block file.
138
+
139
+
Their size (both `dagsize` and `blocksize`) is the length of the block body.
140
+
141
+
##### `Raw` Example
142
+
143
+
Let's build a `Raw` file whoses content is `test`.
Thoses nodes supports many different types (found in `decodeMessage(PBNode.Data).Type`), files are the only allowed one when decoding files.
82
171
83
-
For files comprised of a single block, the 'Type' field will be set to 'File', 'filesize' will be set to the total number of bytes in the file and the file data will be stored in the 'Data' field.
172
+
##### The sister-lists `PBNode.Links`and `decodeMessage(PBNode.Data).blocksizes`
84
173
85
-
## Metadata
174
+
The sister-lists are the key point of why `dag-pb` is important.
175
+
176
+
This allows us to concatenate files together.
177
+
178
+
Linked files might be loaded recursively with the same process.
179
+
180
+
Childs might be any file (so `Dag-PB` where type is `File` or `Raw`)
181
+
182
+
For example this example pseudo-json block:
183
+
```json
184
+
{
185
+
"Links": [{"Hash":"Qmfoo"}, {"Hash":"Qmbar"}],
186
+
"Data": {
187
+
"blocksizes": [20, 30]
188
+
}
189
+
}
190
+
```
191
+
192
+
This indicates that this file is the concatenation of the `Qmfoo` and `Qmbar` files.
193
+
194
+
So when reading this file,
195
+
the `blocksizes` array give us the size in bytes of the child files, each index in `blocksizes` give the value at the same index in `Links`.
196
+
197
+
This allows to do fast indexing into the file, for example if someone is trying to read bytes 25 to 35 we can compute an offset list by summing all previous indexes in `blocksizes`, then do a search to find which indexes contain the range we are intrested in.
198
+
199
+
For example here the offset list would be `[0, 20]` and thus we know we only need to download `Qmbar` to get the range we are intrested in.
200
+
201
+
If `blocksizes` or `Links` are not of the same length the block is invalid.
202
+
203
+
##### `decodeMessage(PBNode.Data).Data`
204
+
205
+
This field is an array of bytes, it is also file content and is appended before the links.
206
+
207
+
This must be taken into a count when doing offsets calculations (the len of the `Data.Data` field define the value of the zeroth element of `blocksizes` when computing offsets).
208
+
209
+
### Metadata
86
210
87
211
UnixFS currently supports two optional metadata fields:
88
212
@@ -112,42 +236,6 @@ UnixFS currently supports two optional metadata fields:
112
236
- When no `mtime` is specified or the resulting `UnixTime` is negative: implementations must assume `0`/`1970-01-01T00:00:00Z` ( note that such values are not merely academic: e.g. the OpenVMS epoch is `1858-11-17T00:00:00Z` )
113
237
- When the resulting `UnixTime` is larger than the targets range ( e.g. 32bit vs 64bit mismatch ) implementations must assume the highest possible value in the targets range ( in most cases that would be `2038-01-19T03:14:07Z` )
114
238
115
-
### Deduplication and inlining
116
-
117
-
Where the file data is small it would normally be stored in the `Data` field of the UnixFS `File` node.
118
-
119
-
To aid in deduplication of data even for small files, file data can be stored in a separate node linked to from the `File` node in order for the data to have a constant [CID] regardless of the metadata associated with it.
120
-
121
-
As a further optimization, if the `File` node's serialized size is small, it may be inlined into its v1 [CID] by using the [`identity`](https://github.com/multiformats/multicodec/blob/master/table.csv)[multihash].
122
-
123
-
## Importing
124
-
125
-
Importing a file into unixfs is split up into two parts. The first is chunking, the second is layout.
126
-
127
-
### Chunking
128
-
129
-
Chunking has two main parameters, chunking strategy and leaf format.
130
-
131
-
Leaf format should always be set to 'raw', this is mainly configurable for backwards compatibility with earlier formats that used a Unixfs Data object with type 'Raw'. Raw leaves means that the nodes output from chunking will be just raw data from the file with a CID type of 'raw'.
132
-
133
-
Chunking strategy currently has two different options, 'fixed size' and 'rabin'. Fixed size chunking will chunk the input data into pieces of a given size. Rabin chunking will chunk the input data using rabin fingerprinting to determine the boundaries between chunks.
134
-
135
-
136
-
### Layout
137
-
138
-
Layout defines the shape of the tree that gets built from the chunks of the input file.
139
-
140
-
There are currently two options for layout, balanced, and trickle.
141
-
Additionally, a 'max width' must be specified. The default max width is 174.
142
-
143
-
The balanced layout creates a balanced tree of width 'max width'. The tree is formed by taking up to 'max width' chunks from the chunk stream, and creating a unixfs file node that links to all of them. This is repeated until 'max width' unixfs file nodes are created, at which point a unixfs file node is created to hold all of those nodes, recursively. The root node of the resultant tree is returned as the handle to the newly imported file.
144
-
145
-
If there is only a single chunk, no intermediate unixfs file nodes are created, and the single chunk is returned as the handle to the file.
146
-
147
-
## Exporting
148
-
149
-
To read the file data out of the unixfs graph, perform an in order traversal, emitting the data contained in each of the leaves.
0 commit comments