Method
Lavender is built on the simple idea of harnessing text-to-image (T2I) diffusion models for
image-to-text (I2T) generation tasks. We hypothesize that the cross-attention in T2I models
is more fine-grained for spatial alignment, while the VLM’s objective is purely next-token
prediction, leading to weaker text-region alignment. By adding a mean-squared error (MSE) loss
between the VLM and diffusion cross-attention on the same data, we shift the VLM attention distribution
closer to an ideal alignment.
We propose an Aligner Network (a few light convolution layers) to transform
the raw VLM attention into a distribution that can be directly matched to the Stable Diffusion
attention. When used with parameter-efficient finetuning (LoRA), we see strong results
without destabilizing the original VLM’s capabilities.