Skip to content

can not load datasets #191

Description

@qsh-zh

follow the setup, python train.py can not load dataset, raise error

--- Train Config ---
TrainConfig(lr_mp=0.00512, lr_vision_backbone=5e-05, lr_language_backbone=5e-05, val_size=50000, batch_size=2, gradient_accumulation_steps=8, max_grad_norm=1.0, eval_in_epochs=True, eval_interval=500, stats_log_interval=100, max_training_steps=40000, max_images_per_example=4, max_images_per_knapsack=18, max_sample_length=4096, compile=False, resume_from_vlm_checkpoint=False, train_dataset_path='HuggingFaceM4/FineVision_concat_shuffled_2', train_dataset_name=('default',), stream_dataset=True, relevance_min_rating=1, image_correspondence_min_rating=1, visual_dependency_min_rating=1, formatting_min_rating=1, wandb_entity='qsh-team', log_wandb=True, use_lmms_eval=True, lmms_eval_tasks='mmstar,mmmu_val,ocrbench,textvqa_val,docvqa_val,scienceqa,mme,infovqa_val,chartqa', lmms_eval_limit=None, lmms_eval_batch_size=64)
Getting dataloaders from HuggingFaceM4/FineVision_concat_shuffled_2
Resize to max side len: True
Loading dataset: default
Warning: Failed to load dataset config 'default' from 'HuggingFaceM4/FineVision_concat_shuffled_2'. Error: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 38f977ad-6afb-4cda-84c2-a842cb1f9ca9)')
Traceback (most recent call last):
  File "/mnt/scratch/plays/nanoVLM/train.py", line 702, in <module>
    main()
  File "/mnt/scratch/plays/nanoVLM/train.py", line 696, in main
    train(train_cfg, vlm_cfg)
  File "/mnt/scratch/plays/nanoVLM/train.py", line 265, in train
    train_loader, val_loader, iter_train_loader, iter_val_loader = get_dataloaders(train_cfg, vlm_cfg)
                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/scratch/plays/nanoVLM/train.py", line 155, in get_dataloaders
    raise ValueError("No valid datasets were loaded. Please check your dataset path and configurations.")
ValueError: No valid datasets were loaded. Please check your dataset path and configurations.

while I try to use another config

    train_dataset_path: str = 'HuggingFaceM4/FineVision'
    train_dataset_name: tuple[str, ...] = ("LLaVA_Instruct_150K", ) 

it complains

Loading dataset: LLaVA_Instruct_150K
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:02<00:00, 23.01it/s]
Warning: Failed to load dataset config 'LLaVA_Instruct_150K' from 'HuggingFaceM4/FineVision'. Error: BuilderConfig ParquetConfig(name='LLaVA_Instruct_150K', version=0.0.0, data_dir=None, data_files={'train': ['LLaVA_Instruct_150K/train-*']}, description=None, batch_size=None, columns=None, features=None, filters=None) doesn't have a 'on_bad_files' key.
Traceback (most recent call last):
  File "/mnt/scratch/plays/nanoVLM/train.py", line 702, in <module>
    main()
  File "/mnt/scratch/plays/nanoVLM/train.py", line 696, in main
    train(train_cfg, vlm_cfg)
  File "/mnt/scratch/plays/nanoVLM/train.py", line 265, in train
    train_loader, val_loader, iter_train_loader, iter_val_loader = get_dataloaders(train_cfg, vlm_cfg)
                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/scratch/plays/nanoVLM/train.py", line 155, in get_dataloaders
    raise ValueError("No valid datasets were loaded. Please check your dataset path and configurations.")
ValueError: No valid datasets were loaded. Please check your dataset path and configurations.

train_dataset_path: str = 'HuggingFaceM4/FineVision_concat_shuffled_2'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions