Skip to content

dvc exp run: git-submodule destroys workspace #10823

@hemker

Description

@hemker

Bug Report

Description

If your repository contains a git submodule and a dvc stage executes a file from that submodule, producing an output in the project's root folder, dvc exp run does not run the experiment successfully but leaves a cluttered/broken git repository behind.

Reproduce

  1. Create a local git repository, initialize with dvc.
  2. Use another git repository, in my case I use a simple script run.py with the content:
import os

path = "model"
filename = "output.txt"

if not os.path.exists(path):
    os.makedirs(path)

with open(os.path.join(path, filename), 'wb') as temp_file:
    temp_file.write(b"Output")

to create a "fake" output.
3. Setup your dvc.yaml pipeline as:

stages:
  run-test:
    cmd: python external/run.py
    deps:
      - external/
    outs:
      - model/output.txt
  1. Include the former subrepository with git submodule add <my-subrepo-url> external/
  2. Commit the added submodule and all changes to your local main repository
  3. Run dvc exp run
Reproducing experiment 'meaty-teff'
Building workspace index                                                                                                                                                               
Comparing indexes                                                                                                                                                                     
Applying changes                                                                                                                                                                       
Stage 'run-test' didn't change, skipping
ERROR: unexpected error - invalid data in index - invalid entry

After termination, the workspace is heavily cluttered by git changes like:

      new file:   .git/objects/ff/b6ed23131c48887b35e4e0d9e9bb8954b547bb
      new file:   .git/objects/ff/b81afb96b85488c92dd2a4ce0c7ebf68f533f6
      new file:   .git/objects/ff/b87e48727d95d793d8bb17ed20bc847107339f
      new file:   .git/objects/ff/b8e5f7ed10543ab4940212978c3eea6dd6d19f
      new file:   .git/objects/ff/b999f452746ae926fe6592eb4fc499804072c9
      new file:   .git/objects/ff/ba93b2cb261ac8c0235c4608cb6d0e79087078
      new file:   .git/objects/ff/bc49cd878675079f05cc65d0ebfa42590d52d3
      new file:   .git/objects/ff/bd5f4a623f5a0dd293e555e73f2f6c17be9cb2
      new file:   .git/objects/ff/bdc6ad64e18a10a82e7ea828fbdb2ed1c4fab6
      new file:   .git/objects/ff/c0207458f1ea7d2b70e9f4ba6d81107ac32147
      new file:   .git/objects/ff/c1d5324ceae9a256cd0a5180ed4b05012eb26c
      new file:   .git/objects/ff/c1dc8332b015572fe01371fde74173a6087aaf
      new file:   .git/objects/ff/c629a3e3484340fc326324889e4aa705164517
      new file:   .git/objects/ff/c7535b45a35b7e39f18332bdbf2e722a3104a6
      new file:   .git/objects/ff/c79bd1e5b4f91c81a40816d090880962d4a746

As well as changes like

        modified:   .git/HEAD
        modified:   .git/index
        modified:   .git/logs/HEAD
        deleted:    .git/refs/exps/exec/EXEC_BASELINE
        deleted:    .git/refs/exps/exec/EXEC_MERGE
  1. See the full stacktrace attached at the end

Expected

This setup used to work in the past. I reactivated a training-pipeline with a setup like this recently and found it broken. dvc exp run should execute the job and successfully add the experiment to the git refs for further consumption.

Environment information

Tested with Ubuntu 24.04 and python 3.10 as well as Windows WSL2 Ubuntu 24.04 Python 3.12.

Output of dvc doctor:

DVC version: 3.61.0 (pip)
-------------------------
Platform: Python 3.12.3 on Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Subprojects:
        dvc_data = 3.16.10
        dvc_objects = 5.1.1
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.4.0
Supports:
        http (aiohttp = 3.12.15, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.12.15, aiohttp-retry = 2.9.1),
        s3 (s3fs = 2025.7.0, boto3 = 1.39.8)
Config:
        Global: /home/hemker/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdd
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sdd
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cf372b30375dfe9c40a81de6a73671ef

Additional Information (if any):
Hint: I modified the scmrepo files to temporarily exclude the pygit2-backend. This solves the error as scmrepo then iterates over the remaining dulwich and gitpython backends:

File changed: lib/python3.12/site-packages/scmrepo/git/__init__.py

class GitBackends(Mapping):
    DEFAULT: ClassVar[dict[str, BackendCls]] = {
        "dulwich": DulwichBackend,
        #"pygit2": Pygit2Backend, <-------
        "gitpython": GitPythonBackend,
    }

Output of dvc exp run -vvv.

Traceback (most recent call last):
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
          ^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/cli/command.py", line 30, in do_run
    return self.run()
           ^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/commands/experiments/run.py", line 14, in run
    self.repo.experiments.run(
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/__init__.py", line 354, in run
    return run(self.repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/run.py", line 77, in run
    return repo.experiments.reproduce_one(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/__init__.py", line 126, in reproduce_one
    results = self._reproduce_queue(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/utils.py", line 62, in wrapper
    ret = f(exp, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/__init__.py", line 249, in _reproduce_queue
    exec_results = queue.reproduce(copy_paths=copy_paths, message=message)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/queue/workspace.py", line 93, in reproduce
    self._reproduce_entry(
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/queue/workspace.py", line 137, in _reproduce_entry
    executor.cleanup(infofile)
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/dvc/repo/experiments/executor/local.py", line 251, in cleanup
    with self._detach_stack:
  File "/usr/lib/python3.12/contextlib.py", line 610, in __exit__
    raise exc_details[1]
  File "/usr/lib/python3.12/contextlib.py", line 595, in __exit__
    if cb(*exc_details):
       ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/scmrepo/git/__init__.py", line 468, in detach_head
    self.reset()
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/scmrepo/git/__init__.py", line 308, in _backend_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/scmrepo/git/backend/pygit2/__init__.py", line 868, in reset
    self.repo.index.read(False)
    ^^^^^^^^^^^^^^^
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/pygit2/repository.py", line 650, in index
    check_error(err, io=True)
  File "/home/hemker/dev/submodule-test/env/lib/python3.12/site-packages/pygit2/errors.py", line 66, in check_error
    raise GitError(message)
_pygit2.GitError: invalid data in index - invalid entry

Metadata

Metadata

Assignees

No one assigned

    Labels

    gitRelated to git and git backendsp3-nice-to-haveIt should be done this or next sprint

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions