fix checkpointing and validation with InfoNCE#26
Open
hummuscience wants to merge 1 commit intopaninski-lab:mainfrom
Open
fix checkpointing and validation with InfoNCE#26hummuscience wants to merge 1 commit intopaninski-lab:mainfrom
hummuscience wants to merge 1 commit intopaninski-lab:mainfrom
Conversation
Author
|
Hmmm, I am not sure this fixes things. The checkpoint saving depends on the validation working (unless one activates saving checkpoints every N epochs). But somehow I am not getting validation losses while training. Will try to dig deeper |
Four issues fixed: 1. ContrastBatchSampler bug: pos_indices from extract_anchor_indices are local (subset) indices, but were compared against global dataset_indices in the valid_positives filter. This caused all positives to be rejected for non-zero-based subsets (val/test), producing empty batches. Removed the incorrect check since local indices are valid by construction. 2. val_dataloader now uses ContrastBatchSampler and contrastive_collate_fn when use_sampler=True, matching the training dataloader. Both dataloaders use batch_sampler= (not sampler=) since ContrastBatchSampler yields batches of indices, not individual indices. 3. Add save_last=True to ModelCheckpoint so a last.ckpt is always maintained, preventing total loss of training progress on interruption. 4. Fix validation never triggering with ContrastBatchSampler. Even with batch_sampler=, Lightning 2.6.1 sets max_batches=inf when use_distributed_sampler=False, preventing epoch-boundary validation. Workaround: use step-based val_check_interval (dataloader_len * check_val_every_n_epoch) with check_val_every_n_epoch=None. Also make num_sanity_val_steps configurable (default: 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcf9b68 to
8571d61
Compare
Author
|
this took a while to dig through, I must admit after a while I couldnt follow where it was going since I am not familiar with the details but the robots seem to agree. here is a repro script |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Same as what was described here: #21
Three issues fixed:
ContrastBatchSampler bug: pos_indices from extract_anchor_indices are
local (subset) indices, but were compared against global dataset_indices
in the valid_positives filter. This caused all positives to be rejected
for non-zero-based subsets (val/test), producing empty batches. Removed
the incorrect check since local indices are valid by construction.
val_dataloader now uses ContrastBatchSampler and contrastive_collate_fn
when use_sampler=True, matching the training dataloader. This enables
proper InfoNCE computation during validation, so val_loss is logged and
the best-model checkpoint callback can trigger.
Add save_last=True to ModelCheckpoint so a last.ckpt is always
maintained, preventing total loss of training progress on interruption.