CLoRA: A Contrastive Approach to Compose Multiple LoRA Models

1Virginia Tech
2ETH Zürich - DALAB
3Google
4TUM
Equal contribution.

CLoRA is a training-free method that works on test-time, and uses contrastive learning to compose multiple concept and style LoRAs simultaneously. Using pre-trained LoRA models, such as L4 for a black and white cat and L2 for a specific type of flower, the goal is to create an image that accurately represents both concepts described by the respective LoRAs. However, directly combining these LoRA models to craft the image often leads to poor outcomes (see LoRA Merge). This failure primarily arises because the attention mechanism fails to create coherent attention maps for subjects and their corresponding attributes. CLoRA revises the attention maps to clearly separate the attentions associated with distinct concept LoRAs (depicted as Ours).

Abstract

Low-Rank Adaptations (LoRAs) have emerged as a powerful and popular technique in the field of image generation, offering a highly effective way to adapt and refine pre-trained deep learning models for specific tasks without the need for comprehensive retraining. By employing pre-trained LoRA models, such as those representing a specific cat and a particular dog, the objective is to generate an image that faithfully embodies both animals as defined by the LoRAs. However, the task of seamlessly blending multiple concept LoRAs to capture a variety of concepts in one image proves to be a significant challenge. Common approaches often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). To overcome these issues, CLoRA addresses them by updating the attention maps of multiple LoRA models and leveraging them to create semantic masks that facilitate the fusion of latent representations. Our method enables the creation of composite images that truly reflect the characteristics of each LoRA, successfully merging multiple concepts or styles. Our comprehensive evaluations, both qualitative and quantitative, demonstrate that our approach outperforms existing methodologies, marking a significant advancement in the field of image generation with LoRAs. Furthermore, we share our source code, benchmark dataset, and trained LoRA models to promote further research on this topic.

Motivation

Attention overlap and attribute binding issues in merging multiple LoRA models (a). The integration of multiple LoRA approaches often results in the failure to accurately represent both concepts, as their attention mechanisms tend to overlap. Our technique, however, adjusts the attention maps during test-time to distinctly segregate different LoRA models, thereby producing compositions that accurately reflect the intended concepts (b).

Framework

Illustration of our method, CLoRA, that combines LoRAs (Low-Rank Adaptation models) using attention maps to guide Stable Diffusion image generation with user-defined concepts. The process involves prompt breakdown, attention-guided diffusion, and contrastive loss for consistency.

Ablation Study

CLoRA Ablation Study. Using the L1 cat and L2 dog LoRAs, the effects of two techniques (latent update and latent masking) can be observed.

BibTeX


        @misc{meral2024clora,
          title={CLoRA: A Contrastive Approach to Compose Multiple LoRA Models},
          author={Tuna Han Salih Meral and Enis Simsar and Federico Tombari and Pinar Yanardag},
          year={2024},
          eprint={2403.19776},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
      }