LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

Jian Jin, Zhenbo Yu, Yang Shen, Zhenyong Fu, Jian Yang
CVPR 2025
Teaser Image

LatexBlend simultaneously addresses two key challenges in scaling multi-concept generation: ensuring high generation quality (including concept fidelity and layout coherence) and maintaining computational efficiency.

Abstract

Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LatexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LatexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LatexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LatexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency.

Method

We propose LatexBlend, a novel framework for effectively and efficiently scaling multi-concept customized text-to-image generation. The core idea of LatexBlend is to represent single concepts and blend multiple concepts within a latent textual space, which is positioned after the text encoder and a linear projection. We identify that the latent textual space is a pivotal point in conditional diffusion models for customized generation, as it encompasses sufficient customized information without being too deep to induce costly merging. Besides, blending customized concepts in this space can eliminate their interference in the earlier textual encoding process, thereby reducing denoising deviation. Therefore, LatexBlend can efficiently integrate multiple customized concepts with high subject fidelity and coherent layouts.

Overall framework of the proposed LatexBlend. LatexBlend customizes each concept individually and stores them in a concept bank with a compact representation of latent textual features. At inference, concepts from the bank can be seamlessly combined in the latent textual space on the fly for multi-concept generation, without needing any additional tuning.

Comparisons

Visual Comparisons.

Visual comparison of generations with additional layout conditioning.

BibTeX

@inproceedings{jin2025latexblend,
        title={LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending},
        author={Jin, Jian and Zhenbo, Yu and Yang, Shen and Fu, Zhenyong and Yang, Jian},
        booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
        year={2025}
      }