Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. While AR models have been highly successful in the text domain, they have been found suboptimal for processing images, videos, and audio due to the high correlation between adjacent tokens which waste inference-time compute by separately predicting each one. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in the text domain alone. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model, which is capable of jointly processing text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models of similar capacity, demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off of inference time versus generation quality.
UniDisc is a unified multimodal discrete diffusion model that can jointly process and generate text and images. First, each modality is converted into a sequnece of discrete tokens and we randomly replace a subset of these tokens with the [MASK] token according to a noise schedule and denoted in the figure with grey boxes. We jointly denoise the image and text and supervise with a weighted cross-entropy loss. At inference time we begin with a set of [MASK] tokens and iteratively unmask tokens.
Scaling Analysis for AR and UniDisc models: (Left) IsoFLOP curves for UniDisc, plotting varying model size for a fixed FLOP budget. (Right) Estimating optimal parameter size for each budget - minima of fitted parabola, we plot scaling laws for both AR and UniDisc. We find 13.2x more compute is required for UniDisc to achieve the same overall loss as AR.
UniDisc can generate images with a lot lesser forward passes than AR models.
Conditional generation results for both FID and CLIP metrics, across a range of CFG values.} We find that AR is more sensitive to the CFG weighting, with a narrower optimal range. We find UniDisc achieves better FiD and CLIP score than Unifiied Autoregressive models such as Chameleon.
UniDisc can automatically improve a user provided image and caption. We adopt a best-of-n sampling strategy with n distinct noise masks. We unroll each generation until completion and use the model's own likelihood to determine select the best generation.
We augment real images by overlaying random objects from the COCO dataset. Similarly, we augment captions by asking an LLM to generate purposely incorrect variations. We then randomly mask the image and text inputs and unmask as described above, automatically removing these undesired image artifacts and generating the correct caption. There is no human intervention or masking in any examples. In the final row, we fix the text prompt, and only allow updates to the image.
Intermediate Steps during Joint Infilling of Image and Text. UniDisc jointly infills both image and text during generation.
To quantitatively analyze the generation order, we use an language-grounded segmentation model (Grounded SAM 2) to segment the image given the text prompt. We then record the order of token decoding when using confidence-based sampling and plot the progression of each region. We observe that the model generates uniformly over concepts and modalities. In AR this is not possible as the model must generate in a specific order (e.g., text first, then raster-order), and thus the model cannot jointly reason over modalities and multiple parts of the image.
L2 distance between unconditional and conditional logits on currently masked tokens as sampling steps increase.
Steps | CLIP Score |
---|---|
[1-3] | 0.301 |
[12-14] | 0.293 |
[22-24] | 0.283 |
All (24) | 0.312 |
Comparing CLIP scores by applying CFG only on specific steps. This shows CFG has the most impact on the initial denoising steps (total steps = 24).
CFG significantly impacts the performance difference between UniDisc and Autoregressive model. To analyze this, we compare UniDisc with an AR model. The left figure shows the difference between conditional and unconditional predictions at various decoding stages. We observe that (a) the difference decreases as more tokens are decoded and (b) UniDisc maintains higher logit distances than AR throughout the process. We believe UniDisc's flexibility to generate tokens in any order allows it to keep a higher logit distance between unconditional and conditional predictions, thus allowing it to leverage CFG more effectively. The right table supports this, showing that applying CFG in just the first 3 steps achieves similar CLIP scores to applying it throughout all steps, while later-step CFG has minimal impact, which also correlates with the conditional and unconditional distance reducing as more tokens are decoded.
Visualization of the conditional and unconditional prediction, p(x_0) of UniDisc. Click to the right to see the prediction change as the denoising step increases. We see that at the 0th timestep, the unconditional prediction significantly differs from the conditional prediction. However, at just the 3rd step, the unconditional prediction appears to closely match the conditional prediction, demonstrating why autoregressive models obtain a smaller benefit compared to discrete diffusion models. For a quantitative analysis, please refer to the CFG Analysis Section.
0.15 steps/token: contributions the forest sun Sun fruit untouchical undorn trees unicorn prancing through beauty fruit fruit fruit under dapp sun sun unappear lightlight field on fruit fruit sun everywhere
0.50 steps/token: sunlight breaking through a forest of trees and into a cushappled meadow where a unicorn runs free among the fruitland vegetables
1.00 steps/token: A lush green forest with a sun blue sky brightly, shining through the trees and casting long shadows. A unicorn with a spiraled horn and a green mane is visible in the foreground. In the bottom right corner, there are various fruits and leaves including apples, oranges. There are also some yellow flowers scattered in the lush green grass.
0.15 steps/token: panda with waving paw wavingaving nature panda withaving human thumb weaving Hawaiian print shirt
0.50 steps/token: cute adult panda in a Hawaiian shirt waving baby gorilla in a Hawaiian shirt
1.00 steps/token: Panda with floral shirt, waving to zoo visitor, tropical background
0.15 steps/token: intricatecut clock clock clock clock clock drawing drawing clock drawing drawing drawing detailed carv stone cathedral building facade
0.50 steps/token: etching of a medieval cathedral clock's intricate stone carvings
1.00 steps/token: Detailed drawing of clock on the facade of a historic, medieval-style monastery, with intricate stone carvings.
0.15 steps/token: futurical burgereburgerurgerurger azure motor bike neon cityss chrome
0.50 steps/token: Silver motorcycle, hamburger sculpture, neon signs, cyberpunk cityscape
1.00 steps/token: A metallic hamburger motorcycle on a neon-lit cityscape
Above: UniDisc's text generation quality varies with steps-per-token ratio. At low ratios (0.15), text is incoherent with repeated words and grammar errors. As the ratio increases (0.50), coherence improves but descriptions remain brief. At higher ratios (1.00), generations become detailed and descriptive with proper structure, demonstrating the quality-compute tradeoff available with discrete diffusion models.
a cozy cabin's interior, including a wooden bed and a stone fireplace
0.01 steps/token
0.03 steps/token
0.05 steps/token
0.10 steps/token
a sculpture of a bird by Paolo Uccello
0.01 steps/token
0.03 steps/token
0.05 steps/token
0.10 steps/token
colorful small to massive vases artfully arranged in front of a colorful street art mural
0.01 steps/token
0.03 steps/token
0.05 steps/token
0.10 steps/token
A stylized digital rendering of Mount Olympus from ancient Greek mythology
0.01 steps/token
0.03 steps/token
0.05 steps/token
0.10 steps/token
Above: UniDisc's text-to-image generation quality varies with steps-per-token ratio. At very low ratios (0.01), the image is reasonable but lower-quality but by 0.03 - 0.05, the image quality is saturated with a higher ratio of 0.10 resulting in no further improvements.
To take advantage of this, we design a novel multimodal caching mechanism that allows UniDisc to reuse the same denoising steps for specific modalities, reducing overall inference time.
We maintain different noising schedules for image and text tokens, effectively setting a larger \(dt_{\text{image}}\) and a smaller \(dt_{\text{text}}\).
Configuration | DataComp1B Validation PPL |
---|---|
UniDisc | 93.8 |
w/o QK Norm | 92.7 |
w/ Zero-linear init | 93.8 |
w/o RMSNorm | 93.8 |
w/o -inf for invalid tokens | 94.7 |
w/o Softmin SNR | 109.6 |
None (Baseline/AR) | 111.2 |
Table 4. Ablation w/115M parameter model of QK Norm, zero initialization of linear layers, RMSNorm, setting invalid tokens to \(-\infty\) during training and generation, and Softmin SNR.