UniDisc: Unified Multimodal Discrete Diffusion

Anonymous Authors

Abstract

Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. While AR models have been highly successful in the text domain, they have been found suboptimal for processing images, videos, and audio due to the high correlation between adjacent tokens which waste inference-time compute by separately predicting each one. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in the text domain alone. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model, which is capable of jointly processing text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models of similar capacity, demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off of inference time versus generation quality.

Input
Input Image

Build your sentence:

a tall a happy <mask>
giraffe puppy <mask>
wearing a with a <mask>
green shirt top hat <mask>
detailed cartoon <mask>
Click word pairs above to build your sentence!
Output
Output Image

Model Response:

Select words and interact with the input image to see results here.

Demo Video

Demo Video: UniDisc can jointly inpaint images and text pairs.

UniDisc Overview

Modality Plot

UniDisc is a unified multimodal discrete diffusion model that can jointly process and generate text and images. First, each modality is converted into a sequnece of discrete tokens and we randomly replace a subset of these tokens with the [MASK] token according to a noise schedule and denoted in the figure with grey boxes. We jointly denoise the image and text and supervise with a weighted cross-entropy loss. At inference time we begin with a set of [MASK] tokens and iteratively unmask tokens.

UniDisc Training Scaling Laws

Scaling Laws
NAR vs AR Scaling

Scaling Analysis for AR and UniDisc models: (Left) IsoFLOP curves for UniDisc, plotting varying model size for a fixed FLOP budget. (Right) Estimating optimal parameter size for each budget - minima of fitted parabola, we plot scaling laws for both AR and UniDisc. We find 13.2x more compute is required for UniDisc to achieve the same overall loss as AR.

UniDisc vs. Autoregressive at Inference

UniDisc can generate images with a lot lesser forward passes than AR models.

Concept Generation Concept Generation

Conditional generation results for both FID and CLIP metrics, across a range of CFG values.} We find that AR is more sensitive to the CFG weighting, with a narrower optimal range. We find UniDisc achieves better FiD and CLIP score than Unifiied Autoregressive models such as Chameleon.

Zero-shot Multimodal Editing with UniDisc

Zero-shot Multimodal Editing Results

Image Inpainting

UniDisc can automatically improve a user provided image and caption. We adopt a best-of-n sampling strategy with n distinct noise masks. We unroll each generation until completion and use the model's own likelihood to determine select the best generation.


We augment real images by overlaying random objects from the COCO dataset. Similarly, we augment captions by asking an LLM to generate purposely incorrect variations. We then randomly mask the image and text inputs and unmask as described above, automatically removing these undesired image artifacts and generating the correct caption. There is no human intervention or masking in any examples. In the final row, we fix the text prompt, and only allow updates to the image.

Generation Analysis

Joint Infilling

Intermediate Steps during Joint Infilling of Image and Text. UniDisc jointly infills both image and text during generation.

Uniform Concept Generation

Concept Generation

To quantitatively analyze the generation order, we use an language-grounded segmentation model (Grounded SAM 2) to segment the image given the text prompt. We then record the order of token decoding when using confidence-based sampling and plot the progression of each region. We observe that the model generates uniformly over concepts and modalities. In AR this is not possible as the model must generate in a specific order (e.g., text first, then raster-order), and thus the model cannot jointly reason over modalities and multiple parts of the image.

Classifier-Free Guidance Analysis

CFG Distance vs Percent Tokens

L2 distance between unconditional and conditional logits on currently masked tokens as sampling steps increase.

Steps CLIP Score
[1-3] 0.301
[12-14] 0.293
[22-24] 0.283
All (24) 0.312

Comparing CLIP scores by applying CFG only on specific steps. This shows CFG has the most impact on the initial denoising steps (total steps = 24).

CFG significantly impacts the performance difference between UniDisc and Autoregressive model. To analyze this, we compare UniDisc with an AR model. The left figure shows the difference between conditional and unconditional predictions at various decoding stages. We observe that (a) the difference decreases as more tokens are decoded and (b) UniDisc maintains higher logit distances than AR throughout the process. We believe UniDisc's flexibility to generate tokens in any order allows it to keep a higher logit distance between unconditional and conditional predictions, thus allowing it to leverage CFG more effectively. The right table supports this, showing that applying CFG in just the first 3 steps achieves similar CLIP scores to applying it throughout all steps, while later-step CFG has minimal impact, which also correlates with the conditional and unconditional distance reducing as more tokens are decoded.

Qualitative CFG Visualization

Visualization of the conditional and unconditional prediction, p(x_0) of UniDisc. Click to the right to see the prediction change as the denoising step increases. We see that at the 0th timestep, the unconditional prediction significantly differs from the conditional prediction. However, at just the 3rd step, the unconditional prediction appears to closely match the conditional prediction, demonstrating why autoregressive models obtain a smaller benefit compared to discrete diffusion models. For a quantitative analysis, please refer to the CFG Analysis Section.

Qualitative Generation with Varying Steps/Token Ratio

Scaling Laws

0.15 steps/token: contributions the forest sun Sun fruit untouchical undorn trees unicorn prancing through beauty fruit fruit fruit under dapp sun sun unappear lightlight field on fruit fruit sun everywhere

0.50 steps/token: sunlight breaking through a forest of trees and into a cushappled meadow where a unicorn runs free among the fruitland vegetables

1.00 steps/token: A lush green forest with a sun blue sky brightly, shining through the trees and casting long shadows. A unicorn with a spiraled horn and a green mane is visible in the foreground. In the bottom right corner, there are various fruits and leaves including apples, oranges. There are also some yellow flowers scattered in the lush green grass.

Generation 1

0.15 steps/token: panda with waving paw wavingaving nature panda withaving human thumb weaving Hawaiian print shirt

0.50 steps/token: cute adult panda in a Hawaiian shirt waving baby gorilla in a Hawaiian shirt

1.00 steps/token: Panda with floral shirt, waving to zoo visitor, tropical background

Generation 2

0.15 steps/token: intricatecut clock clock clock clock clock drawing drawing clock drawing drawing drawing detailed carv stone cathedral building facade

0.50 steps/token: etching of a medieval cathedral clock's intricate stone carvings

1.00 steps/token: Detailed drawing of clock on the facade of a historic, medieval-style monastery, with intricate stone carvings.

Generation 3

0.15 steps/token: futurical burgereburgerurgerurger azure motor bike neon cityss chrome

0.50 steps/token: Silver motorcycle, hamburger sculpture, neon signs, cyberpunk cityscape

1.00 steps/token: A metallic hamburger motorcycle on a neon-lit cityscape

Above: UniDisc's text generation quality varies with steps-per-token ratio. At low ratios (0.15), text is incoherent with repeated words and grammar errors. As the ratio increases (0.50), coherence improves but descriptions remain brief. At higher ratios (1.00), generations become detailed and descriptive with proper structure, demonstrating the quality-compute tradeoff available with discrete diffusion models.

a cozy cabin's interior, including a wooden bed and a stone fireplace

0.01 steps/token

Cabin interior 0.15

0.03 steps/token

Cabin interior 0.50

0.05 steps/token

Cabin interior 0.75

0.10 steps/token

Cabin interior 1.00

a sculpture of a bird by Paolo Uccello

0.01 steps/token

Bird sculpture 0.15

0.03 steps/token

Bird sculpture 0.50

0.05 steps/token

Bird sculpture 0.75

0.10 steps/token

Bird sculpture 1.00

colorful small to massive vases artfully arranged in front of a colorful street art mural

0.01 steps/token

Vases and mural 0.15

0.03 steps/token

Vases and mural 0.50

0.05 steps/token

Vases and mural 0.75

0.10 steps/token

Vases and mural 1.00

A stylized digital rendering of Mount Olympus from ancient Greek mythology

0.01 steps/token

Mount Olympus 0.15

0.03 steps/token

Mount Olympus 0.50

0.05 steps/token

Mount Olympus 0.75

0.10 steps/token

Mount Olympus 1.00

Above: UniDisc's text-to-image generation quality varies with steps-per-token ratio. At very low ratios (0.01), the image is reasonable but lower-quality but by 0.03 - 0.05, the image quality is saturated with a higher ratio of 0.10 resulting in no further improvements.

Multimodal Caching

Modality Plot

To take advantage of this, we design a novel multimodal caching mechanism that allows UniDisc to reuse the same denoising steps for specific modalities, reducing overall inference time.

Text Generation from Image

We maintain different noising schedules for image and text tokens, effectively setting a larger \(dt_{\text{image}}\) and a smaller \(dt_{\text{text}}\).

Caching Inference Analysis

Text Generation from Image

Figure 27: Latency vs. Seq Length for our caching approach - image-to-text tokens ratio = 1/4. We empirically find k = 10 from Fig. 7 based on the saturation steps for image to text.

Design Choices Ablation Study

Configuration DataComp1B Validation PPL
UniDisc 93.8
w/o QK Norm 92.7
w/ Zero-linear init 93.8
w/o RMSNorm 93.8
w/o -inf for invalid tokens 94.7
w/o Softmin SNR 109.6
None (Baseline/AR) 111.2

Table 4. Ablation w/115M parameter model of QK Norm, zero initialization of linear layers, RMSNorm, setting invalid tokens to \(-\infty\) during training and generation, and Softmin SNR.

Dataset Visualization - Reconstruction with VQ16 Tokenizer

Dataset Visualization with Tokenizer

Reconstruction of samples from the synthetic dataset, encoded from 512x512 with a VQ16 tokenizer.

Clevr VQA Evaluation

Clevr VQA

Validation loss on Clevr VQA dataset comparing UniDisc (purple) and AR (red) models. The plot demonstrates how UniDisc achieves and maintains lower validation loss.

Qualitative Effect of Classifier-Free Guidance

Classifier-Free Guidance

Effect of classifier-free guidance in UniDisc, from left to right.