Skip to content

Conversation

@nicholas-maselli
Copy link

What this does

When combining more then 1 video file, the offset incorrectly resets back to the "latest_duration". This is a fairly major bug because it prevents training on the files at all

This is because we set the offset to 0 when rotating to a new file, we set the current_offset to the src_duration but then outside of the loop we reset the current_offset to the large value of latest_duration again.

This PR removes the latest_duration, adds an episode offset variable for clean current_offset saving, and removes from dead code around the latest_offset when saving the data.

How it was tested

This code was tested on a set of 25 datasets with 10 episodes per dataset and 4 cameras active. Before the changes the metadata index's were reseting to 1000+ after the first episode of the second video file.

I can give the datasets at request but its very large for github

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

You can use the dataset add aggregator to test. I wrote a custom script for this (I can add this if you want)

@michel-aractingi michel-aractingi self-requested a review October 16, 2025 07:54
@michel-aractingi michel-aractingi self-assigned this Oct 16, 2025
@michel-aractingi michel-aractingi added bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets labels Oct 16, 2025
@michel-aractingi
Copy link
Collaborator

Thanks @nicholas-maselli ! I'll test it out.

I have a script to stress test the merging https://gist.github.com/michel-aractingi/d9e7a41f1738bf976518c7abdd63fe20

Maybe it wasn't comprehensive enough to catch this bug

@fracapuano
Copy link
Collaborator

Hey @michel-aractingi 👋 I actually remember testing this thing out myself back when I worked on #1264. I had a fairly comprehensive set of tests to verify aggregation of multiple data & video files, and tests check out things frame by frame so I am fairly surprised to read this @nicholas-maselli, but I am sure something is off. Could you perhaps provide a MRE of the issue so that we can see from there what exactly is off? Thank you so much for looking into this btw 🤗

@nicholas-maselli
Copy link
Author

Hey @michel-aractingi 👋 I actually remember testing this thing out myself back when I worked on #1264. I had a fairly comprehensive set of tests to verify aggregation of multiple data & video files, and tests check out things frame by frame so I am fairly surprised to read this @nicholas-maselli, but I am sure something is off. Could you perhaps provide a MRE of the issue so that we can see from there what exactly is off? Thank you so much for looking into this btw 🤗

Is there someplace I can send you the series of datasets I tested this on.

They are fairly large with custom cameras / robots so its possible there were some corner cases as a result of that but ill be using this code a lot so ill be able to catch any oddities I see and let you guys know

)
current_offset += src_duration

videos_idx[key]["episode_offset"] = current_offset
Copy link
Author

@nicholas-maselli nicholas-maselli Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current offset must be saved for each iteration of the loop in the function above this.

Currently the current offset is not saved after it resets to 0, it continues on from the total frame count in the next iteration of the outer loop which makes the bug tricky to spot

@brysonjones
Copy link
Contributor

Thanks for working on this, @nicholas-maselli. I've noticed the a very similar problem.

I'm not sure if I'm hitting a different edge case than you're describing here, but on both main and this branch, I get the following error when training after merging:

Traceback (most recent call last):
  File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
    res = io_context.call_finalized_function()
  File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
    res = self.finalized_function.callable(*args, **kwargs)
  File "/root/manipulation/main.py", line 306, in train_policy
    batch = next(dl_iter)
  File "/root/lerobot/datasets/utils.py", line 898, in cycle
    yield next(iterator)
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
    data = self._next_data()
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1489, in _next_data
    return self._process_data(data, worker_id)
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1551, in _process_data
    data.reraise()
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_utils.py", line 769, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 874, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 851, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s, self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File "/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in __call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 370967

I've merged 2 datasets, one with ~1600 episodes and one with ~30 or so as a test here.

Hoping to explore and solve this issue with you!

data_idx = {"chunk": 0, "file": 0}
videos_idx = {
key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in video_keys
key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} } for key in video_keys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} } for key in video_keys
key: {"chunk": 0, "file": 0, "episode_duration": 0, "episode_offset": 0, "src_to_offset": {} }
for key in video_keys

To fix the Quality test

@michel-aractingi
Copy link
Collaborator

Hey @brysonjones could you confirm if this PR fixes your issue?

@brysonjones
Copy link
Contributor

brysonjones commented Oct 19, 2025

Hey @brysonjones could you confirm if this PR fixes your issue?

Hey @michel-aractingi, I just tested and got this same error with the most recent changes:

IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py",
line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 881, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 858, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s,
self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in 
decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.
py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in
__call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 
370967

To give a bit more color about the test here:

  • I have 2 datatsets being merged
  • 1 large dataset (~1600 episodes), that was originally recorded on the V21 datasets format, and converted to V30 using the script in the repo -- I can confirm that the dataset is valid and usable after conversion
  • The second dataset is a smaller dataset (~30 episodes) recorded natively in V30
  • The test I am running here is if I can properly extend my existing large dataset after converting it to V30 to make sure I don't have any problems going forward
  • The dataset has 4 total camera views, with 3 simple webcams using the OpenCV config, and 1 realsense

@michel-aractingi
Copy link
Collaborator

Thanks @brysonjones

I'll try to reproduce your issue by merging a converted dataset with a recorded dataset.
We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.

If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

@brysonjones
Copy link
Contributor

Thanks @brysonjones

I'll try to reproduce your issue by merging a converted dataset with a recorded dataset. We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.

If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

@michel-aractingi I did try using the most recent conversion script updates to try again and unfortunately saw no change. Will continue to run some experiments and let you know what I find. If there's additional info in the stacktrace or logging that may be helpful, let me know and I can get that!

@nicholas-maselli
Copy link
Author

Thanks @brysonjones
I'll try to reproduce your issue by merging a converted dataset with a recorded dataset. We changed the aggregate.py in the dataset tools PR #2100 so maybe its a subtle bug in the converted dataset.
If you can provide any minimal way of reproduce the buggy behaviour your getting it would be great help, @brysonjones @nicholas-maselli

@michel-aractingi I did try using the most recent conversion script updates to try again and unfortunately saw no change. Will continue to run some experiments and let you know what I find. If there's additional info in the stacktrace or logging that may be helpful, let me know and I can get that!

My apologies i was away let me send you guys the test file's I've been using (custom robot with 4 cameras, 25 datasets, 10 episodes per set)

IndexError: Caught IndexError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py",
line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", 
line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/lerobot/datasets/lerobot_dataset.py", line 881, in __getitem__
    video_frames = self._query_videos(query_timestamps, ep_idx)
  File "/root/lerobot/datasets/lerobot_dataset.py", line 858, in _query_videos
    frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s,
self.video_backend)
  File "/root/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
  File "/root/lerobot/datasets/video_utils.py", line 259, in 
decode_video_frames_torchcodec
    frames_batch = decoder.get_frames_at(indices=frame_indices)
  File 
"/lerobot/.venv/lib/python3.10/site-packages/torchcodec/decoders/_video_decoder.
py", line 235, in get_frames_at
    data, pts_seconds, duration_seconds = core.get_frames_at_indices(
  File "/lerobot/.venv/lib/python3.10/site-packages/torch/_ops.py", line 829, in
__call__
    return self._op(*args, **kwargs)
IndexError: Invalid frame index=391218 for streamIndex=0; must be less than 
370967

This error you posted was exactly the same error I was getting.

The reason is because episode the video file moves to file-001, the episode metadata starts at frame 0 again (which is correct) but then after the first episode into that new video file, it goes back to 1000+ with the frame index because the current index is only temporarily reset to 0 not for the entire loop

@nicholas-maselli
Copy link
Author

This was my explanation from in the code but ill add it to the conversation here:

"
Current offset must be saved for each iteration of the loop in the function above this.

Currently the current offset is not saved after it resets to 0, it continues on from the total frame count in the next iteration of the outer loop which makes the bug tricky to spot
"

@brysonjones
Copy link
Contributor

@nicholas-maselli @michel-aractingi I've continued to do more experimentation and the way that I've had this recreate this error is by having a large V21 dataset, converting it to V30, and then merging with another V30 dataset.

Something in this process is causing the merge to be corrupted, where the frame indices are incorrect along with some potential video decoding problems.

Unfortunately, this update doesn't seem to fix that issue from what I see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn’t working correctly dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants