Feature/tokenizer pipeline #2

marpng · 2025-10-08T10:15:37Z

No description provided.

Copilot

Pull Request Overview

This PR introduces a tokenizer pipeline for implementing text-to-tokenized sign language video translation using NVIDIA Cosmos Tokenizers. The pipeline handles video tokenization from the PHOENIX-2014-T dataset and includes infrastructure for logging and metrics tracking.

Key changes include:

Implementation of sample tokenization scripts for testing individual video sequences
Full dataset tokenization pipeline for processing the PHOENIX-2014-T dataset
Project configuration and CI/CD setup for the tokenizer pipeline module

Reviewed Changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tokenize_sample.py	Sample tokenization script that processes single PHOENIX sequences with both CV8x8x8 and DV8x16x16 models
tokenize_dataset.py	Full dataset tokenization pipeline that processes all PHOENIX-2014-T splits and saves discrete tokens
text_to_tokenized_video/tokenizer_pipeline/	Organized tokenizer pipeline module with duplicate sample and dataset scripts
text_to_tokenized_video/tokenizer_pipeline/pyproject.toml	Project configuration for the tokenizer pipeline package
Various requirements/metadata files	Package dependencies and metadata for NVIDIA Cosmos integration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-08T13:03:43Z

tokenize_dataset.py

+        for model_name in model_names:
+            encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
+            decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"
+
+            tokenizer = CausalVideoTokenizer(
+                checkpoint_enc=encoder_ckpt,
+                checkpoint_dec=decoder_ckpt,
+                device="cuda",
+                dtype="bfloat16",
+            )


Creating a new tokenizer instance for each model inside the sequence loop is inefficient. The tokenizer should be created once per model outside the sequence loop and reused for all sequences.

Copilot · 2025-10-08T13:03:43Z

tokenize_sample.py

+for model_name in model_names:
+    encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
+    decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"
+
+    print(f"\n=== Running model {model_name} ===")
+    t0 = time.time()
+
+    tokenizer = CausalVideoTokenizer(
+        checkpoint_enc=encoder_ckpt,
+        checkpoint_dec=decoder_ckpt,
+        device="cuda",
+        dtype="bfloat16",  # change to float32 if GPU complains
+    )


The tokenizer is recreated for each model iteration, which is inefficient. Consider creating tokenizers once and reusing them, or moving the initialization outside the timing measurement if you need fresh instances.

AmitMY

you are pushing a directory with this repo, not this repo. notice - all your files are under text_to_tokenized_video/tokenizer_pipeline so there are duplicates.

you also have twice tokenize_sample and tokenize_dataset

My recommendation is:

git clone into a directory.
copy the files you need into this directory
push and make a new pull request.

AmitMY · 2025-10-08T13:02:43Z

.gitignore

+output/
+runs/
+wandb/
+cosmos_output/


add *.egg-info and remove the egg-info stuff from git.

…oding

AmitMY · 2025-10-23T11:51:40Z

Please only commit to git our project files and use cosmos as a CLI -
please do not upload the data here, rather, tell people how to acquire it -

# Download
wget ...
unzip ...
# Run tokenizer
python -m ....

…irectory

AmitMY · 2025-10-24T07:57:39Z

Missing README changes, remove the checkpoint from git (add instruction how to clone it)

…oding

… README

…irectory

Add tokenizer pipeline scripts with cropping + z_indices

173d0c7

AmitMY requested a review from Copilot October 8, 2025 13:02

Copilot AI reviewed Oct 8, 2025

View reviewed changes

AmitMY requested changes Oct 8, 2025

View reviewed changes

.gitignore Outdated

output/

runs/

wandb/

cosmos_output/

Copy link

Contributor

AmitMY Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add *.egg-info and remove the egg-info stuff from git.

Trying to fix inconsistent token shapes in Cosmos video tokenizer enc…

6434b4e

…oding

marpng force-pushed the feature/tokenizer-pipeline branch from d7ec0d4 to 6434b4e Compare October 22, 2025 12:58

marpng added 2 commits October 22, 2025 17:41

encoder works yeahy

3480e68

adding example datafiles

050fb28

Delete datafiles/example_frames/09April_2010_Friday_tagesschau-7631 d…

6c0d2be

…irectory

marpng and others added 8 commits October 28, 2025 15:11

Trying to fix inconsistent token shapes in Cosmos video tokenizer enc…

a9efa78

…oding

Trying to fix inconsistent token shapes in Cosmos video tokenizer enc…

3da6bde

…oding

added datafiles of example_frames

2ee574a

\ encode/decode pipeline and work-in-progress fine-tuning and updated…

aae6658

… README

Delete datafiles/example_frames/09April_2010_Friday_tagesschau-7631 d…

db11675

…irectory

Delete example.mp4

22010e9

Delete tokenize_dataset.py

d315781

Delete tokenize_sample.py

0f4b850

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/tokenizer pipeline #2

Feature/tokenizer pipeline #2

Uh oh!

marpng commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

AmitMY left a comment

Uh oh!

AmitMY Oct 8, 2025

Uh oh!

AmitMY commented Oct 23, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/tokenizer pipeline #2

Are you sure you want to change the base?

Feature/tokenizer pipeline #2

Uh oh!

Conversation

marpng commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

AmitMY Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

AmitMY commented Oct 23, 2025

Uh oh!

AmitMY commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants