Fixing Aggregate Episodes for >1 Video File Aggregation #2212

nicholas-maselli · 2025-10-15T22:19:38Z

What this does

When combining more then 1 video file, the offset incorrectly resets back to the "latest_duration". This is a fairly major bug because it prevents training on the files at all

This is because we set the offset to 0 when rotating to a new file, we set the current_offset to the src_duration but then outside of the loop we reset the current_offset to the large value of latest_duration again.

This PR removes the latest_duration, adds an episode offset variable for clean current_offset saving, and removes from dead code around the latest_offset when saving the data.

How it was tested

This code was tested on a set of 25 datasets with 10 episodes per dataset and 4 cameras active. Before the changes the metadata index's were reseting to 1000+ after the first episode of the second video file.

I can give the datasets at request but its very large for github

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

You can use the dataset add aggregator to test. I wrote a custom script for this (I can add this if you want)

michel-aractingi · 2025-10-16T07:55:43Z

Thanks @nicholas-maselli ! I'll test it out.

I have a script to stress test the merging https://gist.github.com/michel-aractingi/d9e7a41f1738bf976518c7abdd63fe20

Maybe it wasn't comprehensive enough to catch this bug

fracapuano · 2025-10-16T19:19:13Z

Hey @michel-aractingi 👋 I actually remember testing this thing out myself back when I worked on #1264. I had a fairly comprehensive set of tests to verify aggregation of multiple data & video files, and tests check out things frame by frame so I am fairly surprised to read this @nicholas-maselli, but I am sure something is off. Could you perhaps provide a MRE of the issue so that we can see from there what exactly is off? Thank you so much for looking into this btw 🤗

nicholas-maselli · 2025-10-16T19:50:18Z

Hey @michel-aractingi 👋 I actually remember testing this thing out myself back when I worked on #1264. I had a fairly comprehensive set of tests to verify aggregation of multiple data & video files, and tests check out things frame by frame so I am fairly surprised to read this @nicholas-maselli, but I am sure something is off. Could you perhaps provide a MRE of the issue so that we can see from there what exactly is off? Thank you so much for looking into this btw 🤗

Is there someplace I can send you the series of datasets I tested this on.

They are fairly large with custom cameras / robots so its possible there were some corner cases as a result of that but ill be using this code a lot so ill be able to catch any oddities I see and let you guys know

nicholas-maselli · 2025-10-16T19:55:53Z

src/lerobot/datasets/aggregate.py

                )
                current_offset += src_duration

+            videos_idx[key]["episode_offset"] = current_offset


Current offset must be saved for each iteration of the loop in the function above this.

Currently the current offset is not saved after it resets to 0, it continues on from the total frame count in the next iteration of the outer loop which makes the bug tricky to spot

brysonjones · 2025-10-18T13:34:59Z

Thanks for working on this, @nicholas-maselli. I've noticed the a very similar problem.

I'm not sure if I'm hitting a different edge case than you're describing here, but on both main and this branch, I get the following error when training after merging:

Traceback (most recent call last):
  File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
    res = io_context.call_finalized_function()
  File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
    res = self.finalized_function.callable(*args, **kwargs)
  File "/root/manipulation/main.py", line 306, in train_policy
    batch = next(dl_iter)
  File "/root/lerobot/datasets/utils.py", line 898, in cycle
    yield next(iterator)
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
    data = self._next_data()
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1489, in _next_data
    return self._process_data(data, worker_id)
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1551, in _process_data
    data.reraise()
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_utils.py", line 769, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 874, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 851, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s, self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File "/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in __call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 370967

I've merged 2 datasets, one with ~1600 episodes and one with ~30 or so as a test here.

Hoping to explore and solve this issue with you!

michel-aractingi · 2025-10-18T19:37:40Z

src/lerobot/datasets/aggregate.py

    data_idx = {"chunk": 0, "file": 0}
    videos_idx = {
-        key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in video_keys
+        key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} } for key in video_keys


Suggested change

key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} } for key in video_keys

key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} }

for key in video_keys

To fix the Quality test

michel-aractingi · 2025-10-18T19:48:32Z

Hey @brysonjones could you confirm if this PR fixes your issue?

brysonjones · 2025-10-19T15:58:36Z

Hey @brysonjones could you confirm if this PR fixes your issue?

Hey @michel-aractingi, I just tested and got this same error with the most recent changes:

IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py",
line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 881, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 858, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s,
self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in 
decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.
py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in
__call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 
370967

To give a bit more color about the test here:

I have 2 datatsets being merged
1 large dataset (~1600 episodes), that was originally recorded on the V21 datasets format, and converted to V30 using the script in the repo -- I can confirm that the dataset is valid and usable after conversion
The second dataset is a smaller dataset (~30 episodes) recorded natively in V30
The test I am running here is if I can properly extend my existing large dataset after converting it to V30 to make sure I don't have any problems going forward
The dataset has 4 total camera views, with 3 simple webcams using the OpenCV config, and 1 realsense

michel-aractingi · 2025-10-19T20:23:51Z

Thanks @brysonjones

I'll try to reproduce your issue by merging a converted dataset with a recorded dataset.
We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.

If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

brysonjones · 2025-10-22T23:11:29Z

Thanks @brysonjones

I'll try to reproduce your issue by merging a converted dataset with a recorded dataset. We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.

If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

@michel-aractingi I did try using the most recent conversion script updates to try again and unfortunately saw no change. Will continue to run some experiments and let you know what I find. If there's additional info in the stacktrace or logging that may be helpful, let me know and I can get that!

nicholas-maselli · 2025-10-26T05:23:08Z

Thanks @brysonjones
I'll try to reproduce your issue by merging a converted dataset with a recorded dataset. We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.
If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

@michel-aractingi I did try using the most recent conversion script updates to try again and unfortunately saw no change. Will continue to run some experiments and let you know what I find. If there's additional info in the stacktrace or logging that may be helpful, let me know and I can get that!

My apologies i was away let me send you guys the test file's I've been using (custom robot with 4 cameras, 25 datasets, 10 episodes per set)

IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py",
line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 881, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 858, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s,
self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in 
decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.
py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in
__call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 
370967

This error you posted was exactly the same error I was getting.

The reason is because episode the video file moves to file-001, the episode metadata starts at frame 0 again (which is correct) but then after the first episode into that new video file, it goes back to 1000+ with the frame index because the current index is only temporarily reset to 0 not for the entire loop

nicholas-maselli · 2025-10-26T05:25:14Z

This was my explanation from in the code but ill add it to the conversation here:

"
Current offset must be saved for each iteration of the loop in the function above this.

Currently the current offset is not saved after it resets to 0, it continues on from the total frame count in the next iteration of the outer loop which makes the bug tricky to spot
"

brysonjones · 2025-10-29T17:31:28Z

@nicholas-maselli @michel-aractingi I've continued to do more experimentation and the way that I've had this recreate this error is by having a large V21 dataset, converting it to V30, and then merging with another V30 dataset.

Something in this process is causing the merge to be corrupted, where the frame indices are incorrect along with some potential video decoding problems.

Unfortunately, this update doesn't seem to fix that issue from what I see.

brysonjones · 2025-11-04T21:31:32Z

I ran another experiment where I re-encoded each episode frame by frame into a new V30 dataset, and then tried to split it to only use a small percentage of the dataset.

Consistently, when there are enough episodes that the video frame get split into multiple files, this same error results.

michel-aractingi · 2025-11-11T21:03:43Z

Hello @nicholas-maselli and @brysonjones . Any news on this? I tried it on a large dataset that we collected and it did solve a bug we had with the merging script.

brysonjones · 2025-11-11T21:48:49Z

Hey @michel-aractingi and @nicholas-maselli

I still get the same error as shown above when using this update. The issue does seem to be when merging a dataset that has a video split into multiple files, because when I take the same dataset, but increase the file size limits for the video, this error doesn't appear.

That is a decent work around for my work right now, so if this fix is functional for your dataset tests, we could open a new issue outlining this separate issue

nicholas-maselli · 2025-11-17T04:49:20Z

Hey @michel-aractingi and @nicholas-maselli

I still get the same error as shown above when using this update. The issue does seem to be when merging a dataset that has a video split into multiple files, because when I take the same dataset, but increase the file size limits for the video, this error doesn't appear.

That is a decent work around for my work right now, so if this fix is functional for your dataset tests, we could open a new issue outlining this separate issue

Hello!

Bumping this as I am prepping to add a PR to add new Hardware (new robot!) This appears to still be an issue with the main repo

episode offset fix

cdf991b

michel-aractingi self-requested a review October 16, 2025 07:54

michel-aractingi self-assigned this Oct 16, 2025

michel-aractingi added bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets labels Oct 16, 2025

nicholas-maselli commented Oct 16, 2025

View reviewed changes

Merge branch 'main' into episode-offset

43b39a3

michel-aractingi reviewed Oct 18, 2025

View reviewed changes

Merge branch 'main' into episode-offset

8d64bd4

Merge branch 'main' into episode-offset

627436f

	key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} } for key in video_keys
	key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} }
	for key in video_keys

Fixing Aggregate Episodes for >1 Video File Aggregation #2212

Are you sure you want to change the base?

Fixing Aggregate Episodes for >1 Video File Aggregation #2212

Conversation

nicholas-maselli commented Oct 15, 2025

What this does

How it was tested

How to checkout & try? (for the reviewer)

Uh oh!

michel-aractingi commented Oct 16, 2025

Uh oh!

fracapuano commented Oct 16, 2025

Uh oh!

nicholas-maselli commented Oct 16, 2025

Uh oh!

nicholas-maselli Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brysonjones commented Oct 18, 2025

Uh oh!

michel-aractingi Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

michel-aractingi commented Oct 18, 2025

Uh oh!

brysonjones commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michel-aractingi commented Oct 19, 2025

Uh oh!

brysonjones commented Oct 22, 2025

Uh oh!

nicholas-maselli commented Oct 26, 2025

Uh oh!

nicholas-maselli commented Oct 26, 2025

Uh oh!

brysonjones commented Oct 29, 2025

Uh oh!

brysonjones commented Nov 4, 2025

Uh oh!

michel-aractingi commented Nov 11, 2025

Uh oh!

brysonjones commented Nov 11, 2025

Uh oh!

nicholas-maselli commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicholas-maselli Oct 16, 2025 •

edited

Loading

brysonjones commented Oct 19, 2025 •

edited

Loading