Skip to content

Conversation

adgad
Copy link
Collaborator

@adgad adgad commented Jul 23, 2025

This is a draft PR for what I think the Clips Schema should be (based on earlier work #86 and what is currently in Workarounds). While the externals are still subject to change when we start working on cp-content-pipeline, I think the transit-tree properties are in a good place, and can be merged to help with the publishing/migration aspects.

There's a few considerations / changes to how it currently works:

  1. Rename dataLayout to layoutWidth for consistency with other nodes (this will require a change in cp-content-pipeline / spark / possibly next-home-page / possibly ft-app)
    a. ClipSet only allows a limited number of layouts, I assume that is intentional?
  2. There is a challenge around transcripts, which is currently modelled as a nested body. In the current component in cp-content-pipeline-ui, it is expecting another RichText graphql type (which has a graphql-ish data structure with fields like raw, structured, references). I'm not really sure how we model that in content-tree, or if. If we need content-tree to be different to cp-content-pipeline (i.e. maintain a workaround), that would also mean the UI component itself isn't really transferable.
    a. DECISION IRL - we should not replicate the graphql structure in content-tree, but instead make cp-content-pipeline work with this somehow. Some ideas below, but still a bit hazy.
  3. Similarly, the transcript currently is a with only text. Our current Body node in content-tree doesn't allow Text as a top level node - should it? Or should it be a separate thing?
    a. note - live events also do this, but i've been ignoring them for now!
    b. DECISION IRL - Add Text to the allowed bodyNodes (will do this separately)

Appendix - Nested Bodies

As mentioned, there is a challenge around the mismatch between how cp-content-pipeline represents nested bodies (GraphQL RichText type), and how it is in content tree (Body as an attribute). I have a couple of ideas around how we can make cp-content-pipeline work with the Clip Transcript, but someone that knows it better than me might have more thoughts!

  1. Change cp-content-pipeline to return body instead of the RichText type:

bodyTree in cp-content-pipeline-api more accurately reflects content-tree
⚠️ would be a breaking api change
❌ i think it works okay for this case, because a transcript is simple. I don't know if it works as well for nested bodies that might have references (e.g. if we had the same logic for a CCC fallback that includes images). Maybe there's something around merging the references with the top level ones?? but sounds complicated

  1. cp-content-pipeline-ui has a Clip workaround that expects a different format:
export default function Clip(props: ContentTree.full.ClipSet) {
...
}

export default function ClipWithRichTextTranscript(props: GraphQLProps) {

    const transcript = props.transcript.structured.tree;
    return <Clip ...props, transcript={transcript} />
}

✅ not a breaking change maybe?
✅ still have a shared Clip component that can be used in e.g. Spark Preview that expects content-tree format
🤔 how does it work with references?
🤔 not sure if it's logically the right place??

Testing Notes

I have tested the transformer with an article that contains a clip that is currently failing:

export CONTENT_API_READ_KEY=<redacted>
node libraries/from-bodyxml/validate.js ae37def3-7f15-46d4-8e8d-b3246ad079b4

@adgad adgad marked this pull request as ready for review August 5, 2025 11:07
@adgad adgad requested review from a team as code owners August 5, 2025 11:07
Copy link
Contributor

@epavlova epavlova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): The from-bodyxml transformer currently supports transforming Text nodes as children of Body, you can try with 3e535c8f-a0db-58ba-b797-3933bc45187c as an example.

@epavlova epavlova requested a review from a team August 6, 2025 07:04
### `ClipSet`

```ts
interface ClipSet extends Node {
Copy link
Contributor

@epavlova epavlova Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): How feasible would it be to support alternativeText and alternativeImage now or in a future iteration? I’m wondering if we could use the poster image as an alternative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on what's in Spark Clips:
description is the equivalent of alternativeText - "Describe this clip (for those who cannot see it)"

poster should be usable as an alternativeImage. I'm now wondering why it's a string and not ImageSet (as it is in the CAPI response) 🤔

Copy link
Contributor

@umbobabo umbobabo Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How feasible would it be to support alternativeText and alternativeImage now or in a future iteration?

Is this for distributable reasons?

If the answer is yes and the objective is to produce a valid HTML tag that is renderable in every context, then with Clipset data it's possible to create basic HTML video tag that is playable by every browser.
The following it's a simplified example extracted from this article. It could even be simplified more avoiding to use multiple sources but just one of them as src:

<video
    poster="<POSTER_URL>"
   >
    <source id="video-source-0-daecfa57-a12a-468b-8045-ad32cfa79b3b"
        src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/640x360.mp4"
        type="video/mp4">
    <source id="video-source-1-daecfa57-a12a-468b-8045-ad32cfa79b3b"
        src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/1280x720.mp4"
        type="video/mp4">
    <source id="video-source-2-daecfa57-a12a-468b-8045-ad32cfa79b3b"
        src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/1920x1080.mp4"
        type="video/mp4">
    <source id="video-source-3-daecfa57-a12a-468b-8045-ad32cfa79b3b"
        src="https://spark-clips-prod.s3.eu-west-1.amazonaws.com/optimised-media-files/16984229396750/0x0.mp3"
        type="audio/mpeg">
    <track label="English" kind="captions" srclang="en" src="https://next-media-api.ft.com/clips/captions/32065539">
</video>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epavlova I suspect you may want to use XML for the bodyXML field. In that case you should be able to easily map the data from the Clipset model into an XML format, something like the following (Not one of our clips):

<video id="abc123" xmlns="https://example.com/video/1.0">
  <title>Building a Birdhouse</title>
  <description>Step-by-step guide.</description>
  <language>en</language> <!-- BCP 47 -->
  <published>2025-08-01T10:30:00Z</published> <!-- RFC 3339 -->
  <duration>PT4M12S</duration> <!-- ISO 8601 -->
  <people>
    <creator role="host">Pat Lee</creator>
    <contributor role="editor">R. Singh</contributor>
  </people>

  <content>
    <container>mp4</container>
    <videoCodec>h264</videoCodec>
    <audioCodec>aac</audioCodec>
    <width>1920</width>
    <height>1080</height>
    <frameRate>29.97</frameRate>
    <bitrate unit="bps">3500000</bitrate>
    <aspectRatio>16:9</aspectRatio>
  </content>

  <files>
    <file role="main" bytes="184563210" checksum="sha256:...">
      <url>https://cdn.example.com/v/abc123/master.mp4</url>
    </file>
    <file role="1080p" bitrate="3500000">
      <url>https://cdn.example.com/v/abc123/1080p.mp4</url>
    </file>
    <file role="720p" bitrate="1800000">
      <url>https://cdn.example.com/v/abc123/720p.mp4</url>
    </file>
  </files>

  <tracks>
    <captions lang="en" kind="subtitles" format="vtt">
      <url>https://cdn.example.com/v/abc123/en.vtt</url>
    </captions>
    <audio lang="en" channels="2"/>
  </tracks>

  <chapters>
    <chapter start="PT0S" title="Intro"/>
    <chapter start="PT1M10S" title="Tools"/>
    <chapter start="PT2M45S" title="Assembly"/>
  </chapters>

  <thumbnails>
    <image width="1280" height="720">https://cdn.example.com/v/abc123/cover.jpg</image>
    <sprite columns="10" rows="10">https://cdn.example.com/v/abc123/sprite.jpg</sprite>
  </thumbnails>

  <rights>
    <license>CC-BY-4.0</license>
    <drm scheme="fairplay" keyId="..."/>
  </rights>

  <tags>
    <tag>DIY</tag><tag>woodwork</tag>
  </tags>
</video>

@umbobabo
Copy link
Contributor

Some background on Transcript.

In the transcript field from CAPI, Spark store an html fragment

(Clipset id dfdbeb54-22aa-468d-a6b4-f7ded02befd5)

<p>Are you getting sacked for telling the truth, home secretary? </p>
<p>[ALARM BLARING IN DISTANCE] </p>
<p></p>
<p>Thank you. </p>
<p>Morning. </p>
<p>Are you going to be the next foreign secretary? </p>
<p>The new foreign secretary, David Cameron? </p>

I believe, but Amir or Ash may know better, that this was because it was the payload received by the tool that was automatically generate the transcript, so it was the easier way for Spark to store the payload as it was.

To simplify the rendering and doing everything Server Side, we added the following steps:

This seems a convoluted solution and a tech debt we should repay and move Transcript away from RichTextSource that doesn't seem meant for this purpose.

I see some possible options:

  1. Spark/CAPI to store transcript in a different format e.g. list of strings where each string is a paragraph and cp-pipeline to handle that accordingly in the UI. This would require some work for all the DS teams and Platform review for the changes in cp-content-pipeline, however it seems the neatest solution - storing string of HTML doesn't see to me a good sustainable choice.
  2. We could try to go back to dangerouslySetInnerHTML. The string is trusted and should be renderable as-is without need of a parser. This will permit us to remove the transcript from RichText sources. This will generate a breaking change in the API and will need to be discussed with @apaleslimghost

@adgad adgad added this to the v1 milestone Sep 9, 2025
There are a few changes to the current ClipSet workaround in
cp-content-pipeline:

- dataLayout -> layoutWidth (for consistency with theother nodes)
- changes to how the Body for transcripts is represented (pending
  further discussion with CP)

This commit also updates the from-bodyxml transformerfor clipset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants