Lavender: Diffusion Instruction Tuning

Boosting SoTA vision-language model with Stable Diffusion

ICML 2025

¹AstraZeneca ²Google DeepMind

Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.

Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks.

Key highlights:

Significant Gains: +30% on 20 tasks, +68% on OOD WorldMedQA.
Data-Efficient: Needs only 0.13M samples (~2.5% of typical VLM datasets).
Low Compute: Finetunes in ~1 day on 8 NVIDIA A10G GPUs.
Model-Agnostic: Works with Llama-3.2-11B, MiniCPM-Llama3-v2.5 & more.
Precise Alignment: Transfers strong text-vision alignment from Stable Diffusion.
Open-Source: Code, data & finetuned models will be available.

Training frontier foundation models from scratch costs millions of dollars at minimum, requiring hundreds of GPUs and millions to billions of data. This challenge is even more pronounced in multimodal settings: vision-language models (VLMs) often face data scarcity because collecting paired image-text datasets is expensive. A common workaround is to apply supervised fine-tuning (SFT) on a pretrained large language model (LLM), leveraging its abundant text-only pretraining and adjusting bridging layers or additional encoders with limited image-text pairs. However, these methods typically overlook the importance of transformer-level attention alignment within the LLM core---a key component for effectively expanding text-based models into the visual domain.

Precise visual-text alignment is crucial for advanced multimodal reasoning. While both VLMs and diffusion models (DMs) process text and images, they diverge in their generation objectives. We observe that DMs, such as Stable Diffusion, which reconstructs images at the pixel level, appear to learn more precise text-vision attention maps than VLMs that are optimised solely for text token generation (Figure 2).

In this work, we demonstrate that the high-quality cross-attention maps from these DMs indeed offer a useful target for guiding the text-vision attention in VLMs during SFT, thus improving word-to-region alignment and the overall performance. We introduce Lavender (Language-and-Vision fine-tuning with Diffusion Aligner), the first framework to directly align VLM transformer attention layers with those of Stable Diffusion (Figure 3). Specifically, during SFT, Lavender transfers diffusion-based attention distributions to VLMs, enhancing core visual-textual interactions. To mitigate catastrophic forgetting, we additionally propose several attention aggregation methods and training strategies that preserve existing VLM competencies.

See Full Paper on ArXiv

Lavender Gains — *Figure 1. Average Performance on 20 Vision-Language Reasoning Benchmarks (Grouped into 4 Categories).*

Method

Lavender is built on the simple idea of harnessing text-to-image (T2I) diffusion models for image-to-text (I2T) generation tasks. We hypothesize that the cross-attention in T2I models is more fine-grained for spatial alignment, while the VLM’s objective is purely next-token prediction, leading to weaker text-region alignment. By adding a mean-squared error (MSE) loss between the VLM and diffusion cross-attention on the same data, we shift the VLM attention distribution closer to an ideal alignment.

We propose an Aligner Network (a few light convolution layers) to transform the raw VLM attention into a distribution that can be directly matched to the Stable Diffusion attention. When used with parameter-efficient finetuning (LoRA), we see strong results without destabilizing the original VLM’s capabilities.

BibTeX

@misc{jin2025diffusioninstructiontuning, title={Diffusion Instruction Tuning}, author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare}, year={2025}, eprint={2502.06814}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.06814}, }

Acknowledgements

We thank the creators of Stable Diffusion, Llama, MiniCPM and Open-Flamingo for providing the foundation codes used in Lavender. We also appreciate the open-source community around PEFT frameworks which made this project feasible under constrained data and compute resources.

Usage and License Notices: Our data and code are for research use only. They are also restricted by the licenses of Llama, Stable Diffusion, and other upstream models. See our GitHub repository for license details.

Lavender: Diffusion Instruction Tuning

Boosting SoTA vision-language model with Stable Diffusion

ICML 2025

Diffusion Instruction Tuning
“The visual expertise of image generators in Stable Diffusion can be transferred to enhance text generation in SoTA VLMs.”

Key Insights and Figures

Method

Performance Across Benchmarks

Qualitative Examples

BibTeX

Acknowledgements

Lavender: Diffusion Instruction Tuning

Boosting SoTA vision-language model with Stable Diffusion

ICML 2025

Diffusion Instruction Tuning “The visual expertise of image generators in Stable Diffusion can be transferred to enhance text generation in SoTA VLMs.”

Key Insights and Figures

Method

Performance Across Benchmarks

Qualitative Examples

BibTeX

Acknowledgements

Diffusion Instruction Tuning
“The visual expertise of image generators in Stable Diffusion can be transferred to enhance text generation in SoTA VLMs.”