Training a Vision Model with Text-only Inputs #1590

jjhoow · 2025-01-28T18:28:12Z

I need to train the vision model using only text inputs. I tried using Colab notebooks but noticed that images are mandatory in the data. After researching a bit more, I found a Colab notebook that trains with images, and I modified it. The code now looks like this:

# Create a data collator to encode text and image pairs
def collate_fn(examples):
    # Extract the messages in the correct format
    processed_examples = [example['messages'] for example in examples]
    
    # Apply the chat template to each example
    texts = [tokenizer.apply_chat_template(messages, tokenize=False) 
            for messages in processed_examples]

    # Tokenize the texts
    batch = tokenizer(
        text=texts, 
        images=None, 
        return_tensors="pt", 
        padding=True
    )

    # Create labels from the input_ids
    labels = batch["input_ids"].clone()
    labels[labels == tokenizer.tokenizer.pad_token_id] = -100

    batch["labels"] = labels

    return batch

from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = collate_fn, # Must use!
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 120,
        num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

I am not entirely sure if this is 100% correct, but it seems to work (it's training).

Is there a proper way to train using only text datasets?

Additionally, can someone suggest a way to train on tools using the original chat template from LLaMA? I couldn’t understand how to structure the dataset so it works with the tokenizer.

danielhanchen · 2025-01-29T10:55:42Z

Apologies on the issue - I'm still trying to work out the best way for people to do text only (or a mixture of text + images) - for now your code looks correct!

yukiarimo · 2025-02-06T14:55:58Z

Any updates? Does text-only already work?

jjhoow · 2025-02-06T16:37:01Z

Any updates? Does text-only already work?

This approach above works in Qwen based on my tests.

yukiarimo · 2025-02-06T17:23:06Z

Great to hear that! Will try this later!

Will something like this code (raw text in JSONL) work? #1505 (comment)

jjhoow · 2025-02-06T18:06:21Z

If the text is already in the expected structure for the model with the system, user, and assistant tags, I believe it will work. I trained it using ShareGPT and Qwen2VL because 2.5VL is not supported in VLLM.

Edit:
There's one detail: this is my model: JJhooww/Fluxi_AI_Small_Vision.

If you want to check the datasets used, I had to modify the structure a bit to support the "type": "text", "text": {content} format.

{"conversations":{[
  { "role": "system", "content": [ { "type": "text", "text": system_content } ] },
  { "role": "user", "content": [ { "type": "text", "text": user_content } ] },
  { "role": "assistant", "content": [ { "type": "text", "text": assistant_content } ] }
]}

def collate_fn(examples):
    # Extrair as mensagens do formato correto
    processed_examples = [example['conversations'] for example in examples]
    
    # Aplicar o template de chat para cada exemplo
    texts = [processor.apply_chat_template(messages, tokenize=False) 
            for messages in processed_examples]

    # Tokenize os textos
    batch = processor(
        text=texts, 
        return_tensors="pt", 
        padding=True
    )

    # Criar labels a partir dos input_ids
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100

    batch["labels"] = labels

    return batch

yukiarimo · 2025-02-06T18:56:58Z

Yey! Time to do some magic! By the way, is Qwen2.5-VL 7B supported for RLHF and Full Training, too, or only LoRA/QLoRA?

Also, I would like to do Question and Right/Wrong RHLF. It is called DPO, right? But where there so many others? Difference?

jjhoow · 2025-02-06T19:27:22Z

Unsloth is great also because it's hackable! So I tested making the changes above however I wanted. As for the multiple RLHF methods on Hugging Face, there is a brief explanation about each method and their differences.

yukiarimo · 2025-02-09T18:51:56Z

Hey! Just tried again. New error: #1651

Do you know how to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a Vision Model with Text-only Inputs #1590

Training a Vision Model with Text-only Inputs #1590

jjhoow commented Jan 28, 2025 •

edited

Loading

danielhanchen commented Jan 29, 2025

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025 •

edited

Loading

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025

yukiarimo commented Feb 9, 2025

Training a Vision Model with Text-only Inputs #1590

Training a Vision Model with Text-only Inputs #1590

Comments

jjhoow commented Jan 28, 2025 • edited Loading

danielhanchen commented Jan 29, 2025

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025 • edited Loading

yukiarimo commented Feb 6, 2025

jjhoow commented Feb 6, 2025

yukiarimo commented Feb 9, 2025

jjhoow commented Jan 28, 2025 •

edited

Loading

jjhoow commented Feb 6, 2025 •

edited

Loading