Lavender: Diffusion Instruction Tuning

Boosting SoTA vision-language model with Stable Diffusion

Preprint available on arXiv
1AstraZeneca   2Google DeepMind

Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.

Diffusion Instruction Tuning
“The visual expertise of image generators in Stable Diffusion can be transferred to enhance text generation in SoTA VLMs.”

Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks.

Key highlights:

  • Significant Gains: +30% on 20 tasks, +68% on OOD WorldMedQA.
  • Data-Efficient: Needs only 0.13M samples (~2.5% of typical VLM datasets).
  • Low Compute: Finetunes in ~1 day on 8 NVIDIA A10G GPUs.
  • Model-Agnostic: Works with Llama-3.2-11B, MiniCPM-Llama3-v2.5 & more.
  • Precise Alignment: Transfers strong text-vision alignment from Stable Diffusion.
  • Open-Source: Code, data & finetuned models will be available.

Key Insights and Figures

Lavender Gains
Figure 1. Average Performance on 20 Vision-Language Reasoning Benchmarks (Grouped into 4 Categories).
Lavender Gains
Figure 2. Image generation models (Stable Diffusion on the left) exhibit stronger word-to-region attention alignment than VLMs (Open-Flamingo on the right). Per-word average attention maps suggest that diffusion models may be closer to an ideal distribution correlating image regions with textual tokens.
Lavender Gains
Figure 3. Lavender: Diffusion Instruction Tuning. Lavender uses the text-vision attention maps of a Stable Diffusion Model, \(Attention_{SDM}\), as a guiding objective for the attention of the target vision-language model (VLM), \(Attention_{VLM}\). The Attention Alignment module employs a 3-Layer ConvNet to transform \(Attention_{VLM}\) to match \(Attention_{SDM}\) via an MSE loss, acting as a regularisation term during supervised fine-tuning.

Method

Lavender is built on the simple idea of harnessing text-to-image (T2I) diffusion models for image-to-text (I2T) generation tasks. We hypothesize that the cross-attention in T2I models is more fine-grained for spatial alignment, while the VLM’s objective is purely next-token prediction, leading to weaker text-region alignment. By adding a mean-squared error (MSE) loss between the VLM and diffusion cross-attention on the same data, we shift the VLM attention distribution closer to an ideal alignment.

fig 3

We propose an Aligner Network (a few light convolution layers) to transform the raw VLM attention into a distribution that can be directly matched to the Stable Diffusion attention. When used with parameter-efficient finetuning (LoRA), we see strong results without destabilizing the original VLM’s capabilities.

Performance Across Benchmarks

We evaluate Lavender on 16 mainstream vision-language benchmarks, covering chart/diagram/document understanding, perception and multi-discipline reasoning, real-world visual tasks, and visual hallucination detection. The figure below shows a typical relative improvement pattern for Lavender Llama-3.2, where we see up to 50% gains over Small Budget-Constrained SOTA.

Lavender Gains
Lavender improves MiniCPM-V-2.5 and Llama-3.2-11B, surpassing Small Budget-Constrained SOTA by up to 50% with minimal finetuning data (~0.13M).
Lavender Gains
Among Large SOTA Models, Lavender-Llama-3.2-11B demonstrates comparable performance to certain High-Resource State-of-the-Art Models at least an order of magnitude larger.
Lavender Gains
Lavender boosts the cross-attention-equipped Llama-3.2-11B by up to 30% on 19/20 benchmarks, while mitigating catastrophic forgetting. Raw scores are shown on the bars.
Lavender Gains
Lavender enhances the self-attention-only MiniCPM-Llama3-V-2.5 by up to 4% on 16/20 benchmarks despite further fine-tuning on its pre-trained dataset. Raw scores are shown on the bars.

We observe that aligning attention with diffusion models also helps reduce visual hallucinations, especially on tasks that require domain-specific knowledge.

Lavender Gains
Without tuning on medical dataset, Lavender boosts Llama-3.2-11B's performance on the out-of-distribution benchmark WorldMedQA by 68%. Raw scores are shown on the bars.

Qualitative Examples

Below are some sample VQA questions comparing original Llama-3.2-11B and Lavender-Llama-3.2-11B:

Example VQA 1

BibTeX


        @misc{jin2025diffusioninstructiontuning,
          title={Diffusion Instruction Tuning}, 
          author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
          year={2025},
          eprint={2502.06814},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.06814}, 
    }
      

Acknowledgements

We thank the creators of Stable Diffusion, Llama, MiniCPM and Open-Flamingo for providing the foundation codes used in Lavender. We also appreciate the open-source community around PEFT frameworks which made this project feasible under constrained data and compute resources.

Usage and License Notices: Our data and code are for research use only. They are also restricted by the licenses of Llama, Stable Diffusion, and other upstream models. See our GitHub repository for license details.