Back to topics

Vision-tuning LLaMA-3.2-11B on text-only data: can Axolotl preserve multimodal abilities?

1 min read
257 words
Opinions on LLMs Vision-tuning LLaMA-3.2-11B

A Reddit post lays out a bold bet: vision-tuning alpindale/Llama-3.2-11B-Vision-Instruct with a text-only dataset, hoping to keep multimodal chops intact. The author leans on Axolotl to pull this off without images in the training data. [1]

What they're trying The goal is to fine-tune a vision-capable model on a purely text dataset of domain-specific Q&As, aiming to sharpen instruction-following while still handling OCR and image queries. [1]

basemodel - alpindale/Llama-3.2-11B-Vision-Instruct [1]processortype - AutoProcessor [1]chattemplate - llama32vision [1]datasets - HuggingFaceH4/llava-instruct-mix-vsft via the HuggingFace platform [1]lorar - 32 [1]loraalpha - 16 [1]loratargetmodules - complex regex for attention/projection blocks [1]sequencelen - 8192 [1]loradropout - 0.05 [1]adapter - LoRA [1]tokenizertype - AutoTokenizer [1]tokenizerconfig - [1]outputdir - [1]

Why this matters (the challenges) The plan signals a tug-of-war between staying faithful to vision inputs and training on text-only data. Tokenizer handling, dataset format, and template alignment all matter for multimodal behavior to survive this cross-domain shift. [1]

Community take (what people are watching for) Discussion centers on cross-domain training viability and what practical results look like when vision-capable models are trained with pure text. The outcome remains uncertain, but the experiment highlights real-world constraints and engineering specifics. [1]

Closing thought No published results yet—this is an ongoing experiment to see if Axolotl can keep multimodal access alive when you feed it only text. Watch how tokenizer choices and dataset formats evolve in future updates. [1]

References

[1]
Reddit

[D] Training a Vision model on a Text-Only Dataset using Axolotl

Discusses LLaMA-3.2-11B vision-tuning using Axolotl; tokenizer, dataset formats, errors, and strategies to retain multimodal abilities.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started