-
Notifications
You must be signed in to change notification settings - Fork 1
Feature/tokenizer pipeline #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a tokenizer pipeline for implementing text-to-tokenized sign language video translation using NVIDIA Cosmos Tokenizers. The pipeline handles video tokenization from the PHOENIX-2014-T dataset and includes infrastructure for logging and metrics tracking.
Key changes include:
- Implementation of sample tokenization scripts for testing individual video sequences
- Full dataset tokenization pipeline for processing the PHOENIX-2014-T dataset
- Project configuration and CI/CD setup for the tokenizer pipeline module
Reviewed Changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tokenize_sample.py | Sample tokenization script that processes single PHOENIX sequences with both CV8x8x8 and DV8x16x16 models |
| tokenize_dataset.py | Full dataset tokenization pipeline that processes all PHOENIX-2014-T splits and saves discrete tokens |
| text_to_tokenized_video/tokenizer_pipeline/ | Organized tokenizer pipeline module with duplicate sample and dataset scripts |
| text_to_tokenized_video/tokenizer_pipeline/pyproject.toml | Project configuration for the tokenizer pipeline package |
| Various requirements/metadata files | Package dependencies and metadata for NVIDIA Cosmos integration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tokenize_dataset.py
Outdated
| for model_name in model_names: | ||
| encoder_ckpt = f"checkpoints/{model_name}/encoder.jit" | ||
| decoder_ckpt = f"checkpoints/{model_name}/decoder.jit" | ||
|
|
||
| tokenizer = CausalVideoTokenizer( | ||
| checkpoint_enc=encoder_ckpt, | ||
| checkpoint_dec=decoder_ckpt, | ||
| device="cuda", | ||
| dtype="bfloat16", | ||
| ) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new tokenizer instance for each model inside the sequence loop is inefficient. The tokenizer should be created once per model outside the sequence loop and reused for all sequences.
tokenize_sample.py
Outdated
| for model_name in model_names: | ||
| encoder_ckpt = f"checkpoints/{model_name}/encoder.jit" | ||
| decoder_ckpt = f"checkpoints/{model_name}/decoder.jit" | ||
|
|
||
| print(f"\n=== Running model {model_name} ===") | ||
| t0 = time.time() | ||
|
|
||
| tokenizer = CausalVideoTokenizer( | ||
| checkpoint_enc=encoder_ckpt, | ||
| checkpoint_dec=decoder_ckpt, | ||
| device="cuda", | ||
| dtype="bfloat16", # change to float32 if GPU complains | ||
| ) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer is recreated for each model iteration, which is inefficient. Consider creating tokenizers once and reusing them, or moving the initialization outside the timing measurement if you need fresh instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are pushing a directory with this repo, not this repo. notice - all your files are under text_to_tokenized_video/tokenizer_pipeline so there are duplicates.
you also have twice tokenize_sample and tokenize_dataset
My recommendation is:
- git clone into a directory.
- copy the files you need into this directory
- push and make a new pull request.
.gitignore
Outdated
| output/ | ||
| runs/ | ||
| wandb/ | ||
| cosmos_output/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add *.egg-info and remove the egg-info stuff from git.
d7ec0d4 to
6434b4e
Compare
|
Please only commit to git our project files and use cosmos as a CLI - |
|
Missing README changes, remove the checkpoint from git (add instruction how to clone it) |
No description provided.