DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

Fig. 1: Selected samples from our class-conditional DPAR-384-XL model trained on ImageNet.

Video

Abstract

Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to the generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81× and 2.06× on ImageNet 256 and 384 generation resolution, respectively, leading to a reduction of up to 40.4% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 29.6% relative to baseline models.

Background

Autoregressive (AR) image generation with Transformer backbones faces two core challenges:

Quadratic complexity of attention
Quadratic token count growth as image resolution grows: For example, HD images require ~4,024 tokens and 4K images need ~65K tokens for generation.

Prior works [1,2,3] alleviate computational demands through architectural improvements, but these approaches suffer from efficiency–performance tradeoffs. In contrast, direct strategies aimed at reducing token count remain underexplored. DPAR addresses this gap by proposing a novel method for dynamically merging tokens based on local information content (see Fig. 1).

Method

We hypothesize that images can be dynamically represented using a variable number of tokens based on their information content, motivated by two observations:

Images often contain redundant low-information regions (e.g., sky) that can be represented with fewer tokens.
Low-detail regions should be generated with lower computational cost — an idea recently explored in language models like BLT [4].

Our method estimates next-token prediction entropy using a lightweight, unsupervised AR model as a proxy for local image information content, then merges tokens into patches for efficient transformer-based AR generation (see Fig. 3).

Fig. 2 — Entropy-guided Token Aggregation. (Top) Input images with increasing information content. (Bottom) Corresponding entropy heatmaps, where warmer colors indicate higher next-token prediction entropy. Low-information regions yield low-entropy tokens that are merged into patches, while high-information regions retain fine-grained token-level granularity. Black outlines denote final patch boundaries for 256×256 images.

Fig. 3 — Overview of DPAR. (a) Conventional AR image generation employs decoder-only transformers that operate on a fixed number of tokens per image, where the token count scales quadratically with image resolution. (b) DPAR dynamically aggregates image tokens based on information content, producing a variable number of patches per image. Decoder-only transformers then operate on this reduced set of patches, lowering both computational and memory overhead of attention. DPAR requires minimal modifications to the standard decoder architecture, preserving compatibility with multimodal generation frameworks.

Key Results

Result 1 — Efficient Model Scaling

1.81× / 2.06× token reduction on ImageNet-256 / 384, with the efficiency increasing with resolution.

Fig. 4 — Efficient Model Scaling. DPAR maintains competitive FID across model sizes (S → XL) while using significantly fewer FLOPs per training step, showing dynamic patchification scales favorably with model capacity.

Result 2 — Efficient Resolution Scaling

Sub-quadratic patch count growth with resolution — along with consistently lower FID at every resolution.

Fig. 5 — Efficient Resolution Scaling. Comparison of DPAR-B and LlamaGen-B at resolutions 128–512. DPAR yields sub-quadratic growth in patch count and achieves consistently lower FID at every resolution.

Result 3 — Improved Generation Quality

Up to 29.6% FID improvement vs. LlamaGen — dynamic patching acts as an implicit regularizer that improves image generation.

Fig. 6 — Improved Generation Quality. DPAR achieves lower FID than fixed-token LlamaGen baselines at matched compute budgets. Training with variable-length patches acts as an implicit regularizer, improving image generation.

Result 4 — Faster Training Convergence

Faster convergence across all model sizes — DPAR consistently outperforms LlamaGen throughout training on ImageNet-384.

Fig. 7 — Faster Training Convergence. FID vs. training epochs on ImageNet-384 for B, L, and XL variants. DPAR consistently achieves lower FID throughout training — demonstrating faster convergence and quality.

Result 5 — Adaptive Patching at Inference

Zero-cost inference scaling — patch size can be increased at test time for further efficiency gain with minimal impact on quality without retraining.

Fig. 8 — Adaptive Patching at Inference. A model trained with one patch configuration generalizes to larger patches at test time, offering a flexible speed–quality tradeoff without retraining.

References

Havtorn, Jakob Drachmann, et al. "Msvit: Dynamic mixed-scale tokenization for vision transformers." ICCV 2023.
Shen, Junhong, et al. "Cat: Content-adaptive image tokenization." arXiv:2501.03120 (2025).
Ma, Xu, et al. "Token-shuffle: Towards high-resolution image generation with autoregressive models." arXiv:2504.17789 (2025).
Pagnoni, Artidoro, et al. "Byte latent transformer: Patches scale better than tokens." arXiv:2412.09871 (2024).
Pang, Ziqi, et al. "Randar: Decoder-only autoregressive visual generation in random orders." CVPR 2025.

Acknowledgements

We thank the authors of LlamaGen for releasing their code, which served as the starting point for our implementation.

@InProceedings{Srivastava_2026_CVPR,
  author    = {Srivastava, Divyansh and Mehra, Akshay and Maneriker, Pranav and Sanyal, Debopam and Raj, Vishnu and Kamarshi, Vijay and Du, Fan and Kimball, Joshua},
  title     = {DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {23215-23226}
}