Skip to content
38 changes: 19 additions & 19 deletions docs/croissant-spec-draft.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@ Croissant is designed to be modular and extensible. One such extension is the Cr

**Croissant dataset**: A dataset that comes with a description in the Croissant format. Note that the Croissant description of a dataset does not generally contain the actual data of the dataset (with the exception of small examples or enumerations). The data itself is contained in separate files, referenced by the Croissant dataset description.

**Data record**: A granular part of a dataset, such as an image, text file, or a row in a table.
**FileObject**: A granular part of a dataset, such as an image, text file, archive file, or a row in a table.

**Recordset**: A set of homogeneous data records, such as a collection of images, text files, or all the rows in a table.
**RecordSet**: A set of structured data records obtained from one or more data sources (typically a file or set of files), such as a collection of images, text files, or all the rows in a table.

## Format Example

Expand Down Expand Up @@ -183,7 +183,7 @@ Furthermore, we can describe the structure and the data types in the data using
- the hash of the image, extracted from its filename
- the date the image was taken, extracted from the metadata CSV file

The [RecordSets](#recordsets) section explains how to define recordsets and fields, as well as extract, transform and join their data.
The [RecordSets](#recordsets) section explains how to define RecordSets and Fields, as well as extract, transform and join their data.

## Prerequisites

Expand Down Expand Up @@ -534,7 +534,7 @@ Datasets may change over time. Versioning is hence important to enable reproduci
Croissant datasets are versioned using the `version` property defined in [schema.org](http://schema.org). The recommended versioning scheme to use for datasets is`MAJOR.MINOR.PATCH`, following [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html). More specifically:

- If the `PATCH` version is incremented, the data remains the same, although it might be serialized differently or is packaged using different file formats.
- If the `MINOR` version is incremented, the existing data is the same and can still be retrieved as it: there might be additional data (e.g. new fields, new RecordSet), or even new records in existing RecordSets, as long as the old recordSets can still be retrieved (eg: the new records are added to a different Split).
- If the `MINOR` version is incremented, the existing data is the same and can still be retrieved as it: there might be additional data (e.g. new fields, new RecordSet), or even new records in existing RecordSets, as long as the old RecordSets can still be retrieved (eg: the new records are added to a different Split).
- If the `MAJOR` version is incremented, the existing data has been changed (edited, removed or shuffled records across splits), or extended in a way which doesn't allow for easy access to the data as it was at the previous version.

#### Checksums
Expand Down Expand Up @@ -719,7 +719,7 @@ A `FileSet` is a set of files located in a container, which can be an archive `F
</thead>
<tr>
<td>containedIn</td>
<td>Reference</td>
<td>FileObject or FileSet <code>@id</code></td>
<td>MANY</td>
<td>The source of data for the <code>FileSet</code>, e.g., an archive. If multiple values are provided for <code>containedIn</code>, then the union of their contents is taken (e.g., this can be used to combine files from multiple archives).</td>
</tr>
Expand Down Expand Up @@ -871,7 +871,7 @@ In addition to `Field`s, RecordSet also supports defining a `key` for the record

### Field

A `Field` is part of a `RecordSet`. It may represent a column of a table, or a nested data structure.
A `Field` represents one or more properties of a `RecordSet`, such as a column of a table, Exif data, or a nested data structure.

`Field` is a subclass of [sc:Intangible](https://schema.org/Intangible). It defines the following additional properties:

Expand Down Expand Up @@ -917,21 +917,21 @@ A `Field` is part of a `RecordSet`. It may represent a column of a table, or a n
</tr>
<tr>
<td>references</td>
<td>Reference</td>
<td>Field <code>@id</code></td>
<td>MANY</td>
<td>Another <code>Field</code> of another <code>RecordSet</code> that this field references. This is the equivalent of a foreign key reference in a relational database.</td>
<td>A list of references to other <code>RecordSet</code> <code>Field</code>s. This is the equivalent of a foreign key reference in a relational database. Missing or circular references should result in an error.</td>
</tr>
<tr>
<td>subField</td>
<td>Field</td>
<td>Field <code>@id</code></td>
<td>MANY</td>
<td>Another <code>Field</code> that is nested inside this one.</td>
<td>A list of references to <code>Field</code>s that are nested within this one. Missing or circular references should result in an error.</td>
</tr>
<tr>
<td>parentField</td>
<td>Reference</td>
<td>Field <code>@id</code></td>
<td>MANY</td>
<td>A special case of <code>SubField</code> that should be hidden because it references a <code>Field</code> that already appears in the <code>RecordSet</code>.</td>
<td>A list of references to one or more <code>Field</code>s. A special case of <code>SubField</code> that should be hidden because it references a <code>Field</code> that already appears in the <code>RecordSet</code>. Missing or circular references should result in a warning.</td>
</tr>
<tr>
<td>annotation</td>
Expand Down Expand Up @@ -1026,21 +1026,21 @@ The ratings `RecordSet` above corresponds to a CSV table, declared elsewhere as
</thead>
<tr>
<td>fileObject</td>
<td>Reference</td>
<td>FileObject <code>@id</code></td>
<td>ONE</td>
<td>The name of the referenced <code>FileObject</code> source of the data.</td>
<td>The id of the <code>FileObject</code> source of the data.</td>
</tr>
<tr>
<td>fileSet</td>
<td>Reference</td>
<td>FileSet <code>@id</code></td>
<td>ONE</td>
<td>The name of the referenced <code>FileSet</code> source of the data.</td>
<td>The id of the <code>FileSet</code> source of the data.</td>
</tr>
<tr>
<td>recordSet</td>
<td>Reference</td>
<td>RecordSet <code>@id</code></td>
<td>ONE</td>
<td>The name of the referenced <code>RecordSet</code> source.</td>
<td>The id of the referenced <code>RecordSet</code> source.</td>
</tr>
<tr>
<td>extract</td>
Expand Down Expand Up @@ -1371,7 +1371,7 @@ If the example values cannot easily be provided directly within the Croissant de

### Joins

Croissant provides a simple mechanism to create a "foreign key" reference between fields of recordsets. The property `references` of `RecordSet` means that values in the `Field` that contains the reference are taken from the values of the target `Field`. The target is generally the key of the target `RecordSet`.
Croissant provides a simple mechanism to create a "foreign key" reference between fields of RecordSets. The property `references` of `RecordSet` means that values in the `Field` that contains the reference are taken from the values of the target `Field`. The target is generally the key of the target `RecordSet`.

For example, the `ratings` `RecordSet` below has a `movie_id` field that references the `movies` `RecordSet`.

Expand Down