feat: add ClipSet to schema #90

adgad · 2025-07-23T09:51:22Z

This is a draft PR for what I think the Clips Schema should be (based on earlier work #86 and what is currently in Workarounds). While the externals are still subject to change when we start working on cp-content-pipeline, I think the transit-tree properties are in a good place, and can be merged to help with the publishing/migration aspects.

There's a few considerations / changes to how it currently works:

Rename dataLayout to layoutWidth for consistency with other nodes (this will require a change in cp-content-pipeline / spark / possibly next-home-page / possibly ft-app)
a. ClipSet only allows a limited number of layouts, I assume that is intentional?
There is a challenge around transcripts, which is currently modelled as a nested body. In the current component in cp-content-pipeline-ui, it is expecting another RichText graphql type (which has a graphql-ish data structure with fields like raw, structured, references). I'm not really sure how we model that in content-tree, or if. If we need content-tree to be different to cp-content-pipeline (i.e. maintain a workaround), that would also mean the UI component itself isn't really transferable.
a. DECISION IRL - we should not replicate the graphql structure in content-tree, but instead make cp-content-pipeline work with this somehow. Some ideas below, but still a bit hazy.
Similarly, the transcript currently is a with only text. Our current Body node in content-tree doesn't allow Text as a top level node - should it? Or should it be a separate thing?
a. note - live events also do this, but i've been ignoring them for now!
b. DECISION IRL - Add Text to the allowed bodyNodes (will do this separately)

Appendix - Nested Bodies

As mentioned, there is a challenge around the mismatch between how cp-content-pipeline represents nested bodies (GraphQL RichText type), and how it is in content tree (Body as an attribute). I have a couple of ideas around how we can make cp-content-pipeline work with the Clip Transcript, but someone that knows it better than me might have more thoughts!

Change cp-content-pipeline to return body instead of the RichText type:

✅ bodyTree in cp-content-pipeline-api more accurately reflects content-tree
⚠️ would be a breaking api change
❌ i think it works okay for this case, because a transcript is simple. I don't know if it works as well for nested bodies that might have references (e.g. if we had the same logic for a CCC fallback that includes images). Maybe there's something around merging the references with the top level ones?? but sounds complicated

cp-content-pipeline-ui has a Clip workaround that expects a different format:

export default function Clip(props: ContentTree.full.ClipSet) {
...
}

export default function ClipWithRichTextTranscript(props: GraphQLProps) {

    const transcript = props.transcript.structured.tree;
    return <Clip ...props, transcript={transcript} />
}

✅ not a breaking change maybe?
✅ still have a shared Clip component that can be used in e.g. Spark Preview that expects content-tree format
🤔 how does it work with references?
🤔 not sure if it's logically the right place??

Testing Notes

I have tested the transformer with an article that contains a clip that is currently failing:

export CONTENT_API_READ_KEY=<redacted>
node libraries/from-bodyxml/validate.js ae37def3-7f15-46d4-8e8d-b3246ad079b4

epavlova

thought (non-blocking): The from-bodyxml transformer currently supports transforming Text nodes as children of Body, you can try with 3e535c8f-a0db-58ba-b797-3933bc45187c as an example.

epavlova · 2025-08-06T08:53:08Z

README.md

+### `ClipSet`
+
+```ts
+interface ClipSet extends Node {


question (non-blocking): How feasible would it be to support alternativeText and alternativeImage now or in a future iteration? I’m wondering if we could use the poster image as an alternative.

So based on what's in Spark Clips:
description is the equivalent of alternativeText - "Describe this clip (for those who cannot see it)"

poster should be usable as an alternativeImage. I'm now wondering why it's a string and not ImageSet (as it is in the CAPI response) 🤔

How feasible would it be to support alternativeText and alternativeImage now or in a future iteration?

Is this for distributable reasons?

If the answer is yes and the objective is to produce a valid HTML tag that is renderable in every context, then with Clipset data it's possible to create basic HTML video tag that is playable by every browser.
The following it's a simplified example extracted from this article. It could even be simplified more avoiding to use multiple sources but just one of them as src:

<video poster="<POSTER_URL>" > <source id="video-source-0-daecfa57-a12a-468b-8045-ad32cfa79b3b" src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/640x360.mp4" type="video/mp4"> <source id="video-source-1-daecfa57-a12a-468b-8045-ad32cfa79b3b" src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/1280x720.mp4" type="video/mp4"> <source id="video-source-2-daecfa57-a12a-468b-8045-ad32cfa79b3b" src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/1920x1080.mp4" type="video/mp4"> <source id="video-source-3-daecfa57-a12a-468b-8045-ad32cfa79b3b" src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/0x0.mp3" type="audio/mpeg"> <track label="English" kind="captions" srclang="en" src="https://next-media-api.ft.com/clips/captions/32065539"> </video>

@epavlova I suspect you may want to use XML for the bodyXML field. In that case you should be able to easily map the data from the Clipset model into an XML format, something like the following (Not one of our clips):

<video id="abc123" xmlns="https://example.com/video/1.0"> <title>Building a Birdhouse</title> <description>Step-by-step guide.</description> <language>en</language>  <published>2025-08-01T10:30:00Z</published>  <duration>PT4M12S</duration>  <people> <creator role="host">Pat Lee</creator> <contributor role="editor">R. Singh</contributor> </people> <content> <container>mp4</container> <videoCodec>h264</videoCodec> <audioCodec>aac</audioCodec> <width>1920</width> <height>1080</height> <frameRate>29.97</frameRate> <bitrate unit="bps">3500000</bitrate> <aspectRatio>16:9</aspectRatio> </content> <files> <file role="main" bytes="184563210" checksum="sha256:..."> <url>https://cdn.example.com/v/abc123/master.mp4</url> </file> <file role="1080p" bitrate="3500000"> <url>https://cdn.example.com/v/abc123/1080p.mp4</url> </file> <file role="720p" bitrate="1800000"> <url>https://cdn.example.com/v/abc123/720p.mp4</url> </file> </files> <tracks> <captions lang="en" kind="subtitles" format="vtt"> <url>https://cdn.example.com/v/abc123/en.vtt</url> </captions> <audio lang="en" channels="2"/> </tracks> <chapters> <chapter start="PT0S" title="Intro"/> <chapter start="PT1M10S" title="Tools"/> <chapter start="PT2M45S" title="Assembly"/> </chapters> <thumbnails> <image width="1280" height="720">https://cdn.example.com/v/abc123/cover.jpg</image> <sprite columns="10" rows="10">https://cdn.example.com/v/abc123/sprite.jpg</sprite> </thumbnails> <rights> <license>CC-BY-4.0</license> <drm scheme="fairplay" keyId="..."/> </rights> <tags> <tag>DIY</tag><tag>woodwork</tag> </tags> </video>

umbobabo · 2025-08-22T13:29:15Z

Some background on Transcript.

In the transcript field from CAPI, Spark store an html fragment

(Clipset id dfdbeb54-22aa-468d-a6b4-f7ded02befd5)

<p>Are you getting sacked for telling the truth, home secretary? </p>
<p>[ALARM BLARING IN DISTANCE] </p>
<p></p>
<p>Thank you. </p>
<p>Morning. </p>
<p>Are you going to be the next foreign secretary? </p>
<p>The new foreign secretary, David Cameron? </p>

I believe, but Amir or Ash may know better, that this was because it was the payload received by the tool that was automatically generate the transcript, so it was the easier way for Spark to store the payload as it was.

To simplify the rendering and doing everything Server Side, we added the following steps:

Convert the HTML string into a RichText model via the internal bodyXMLToTree
Pass the new structure to a Richtext component to render the list of paragraphs

This seems a convoluted solution and a tech debt we should repay and move Transcript away from RichTextSource that doesn't seem meant for this purpose.

I see some possible options:

Spark/CAPI to store transcript in a different format e.g. list of strings where each string is a paragraph and cp-pipeline to handle that accordingly in the UI. This would require some work for all the DS teams and Platform review for the changes in cp-content-pipeline, however it seems the neatest solution - storing string of HTML doesn't see to me a good sustainable choice.
We could try to go back to dangerouslySetInnerHTML. The string is trusted and should be renderable as-is without need of a parser. This will permit us to remove the transcript from RichText sources. This will generate a breaking change in the API and will need to be discussed with @apaleslimghost

umbobabo · 2025-09-22T11:52:20Z

README.md

+	autoplay: boolean
+	loop: boolean
+	muted: boolean
+	layoutWidth: 'in-line' | 'mid-grid' | 'full-grid'


@adgad Consistently with what we have done with other PRs perhaps we can use Extract from the Layout here too

There are a few changes to the current ClipSet workaround in cp-content-pipeline: - dataLayout -> layoutWidth (for consistency with theother nodes) - changes to how the Body for transcripts is represented (pending further discussion with CP) This commit also updates the from-bodyxml transformerfor clipset

adgad force-pushed the clip-set-again branch from 2c8b233 to 47e08af Compare August 5, 2025 10:09

adgad marked this pull request as ready for review August 5, 2025 11:07

adgad requested review from a team as code owners August 5, 2025 11:07

epavlova reviewed Aug 6, 2025

View reviewed changes

epavlova requested a review from a team August 6, 2025 07:04

epavlova reviewed Aug 6, 2025

View reviewed changes

adgad added this to the v1 milestone Sep 9, 2025

adgad force-pushed the clip-set-again branch from 47e08af to 8d7b087 Compare September 9, 2025 15:08

umbobabo reviewed Sep 22, 2025

View reviewed changes

adgad force-pushed the clip-set-again branch from 8d7b087 to 0b9ebbf Compare October 14, 2025 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add ClipSet to schema #90

feat: add ClipSet to schema #90

adgad commented Jul 23, 2025 •

edited

Loading

Uh oh!

epavlova left a comment

Uh oh!

epavlova Aug 6, 2025 •

edited

Loading

Uh oh!

adgad Aug 6, 2025

Uh oh!

umbobabo Aug 21, 2025 •

edited

Loading

Uh oh!

umbobabo Aug 22, 2025

Uh oh!

umbobabo commented Aug 22, 2025

Uh oh!

umbobabo Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add ClipSet to schema #90

Are you sure you want to change the base?

feat: add ClipSet to schema #90

Conversation

adgad commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Appendix - Nested Bodies

Testing Notes

Uh oh!

epavlova left a comment

Choose a reason for hiding this comment

Uh oh!

epavlova Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adgad Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

umbobabo Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umbobabo Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

umbobabo commented Aug 22, 2025

Uh oh!

umbobabo Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adgad commented Jul 23, 2025 •

edited

Loading

epavlova Aug 6, 2025 •

edited

Loading

umbobabo Aug 21, 2025 •

edited

Loading