Skip to main content

Lumina-T2X is a unified
framework for Text to Any Modality

Lumina-T2X is a unified
framework
for Text to Any Modality

It helps you build a Generation Model with any modailty. Fast.

lumina infer "A snowman of ..."

Lumina-T2I - Image Generation

Architecture 🏗️

We are excited to unveil Lumina-T2X, a unified framework that seamlessly transforms text into a variety of modalities, including images, videos, multi-view images, and audio.

At the heart of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT), which supports up to 7B parameters and 512K token generation. We will be open-sourcing both the training codes and pre-trained models to foster further research and development.

Flow-based Large Diffusion Transformer (Flag-DiT)

Lumina-T2X is trained with the flow matching objective and is equipped with many techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.

Any Modalities with one Framework

the model can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.

Any Aspect Ratio with one framework

Lumina-T2X can naturally encode any modality—regardless of resolution, aspect ratios, and temporal durations into a unified 1-D token sequence akin to LLMs, by utilizing Flag-DiT with text conditioning to iteratively transform noise into outputs across any modality, resolution, and duration during inference time.

Any Duration with one fraemwork

By introducing the nextline and nextframe tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training.

Low Training Resource

Our Large-DiT reduces the total number of training iterations needed, thus minimizing overall training time and computational resources. the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 20% of the computational resources needed by Pixelart-$\alpha$.