TUNA: Taming Unified Visual Representations for
Native Unified Multimodal Models

1Meta BizAI 2HKU 3University of Waterloo 4KAUST
Joint first authors, listed alphabetically by last name Core contributors *Joint project lead

Introducing TUNA, a family of native unified multimodal models

  • TUNA leverages unified visual representations to enable image/video understanding, image/video generation, and image editing within a single framework.
  • Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving state-of-the-art performance across multiple multimodal understanding and generation tasks.
  • Our comprehensive ablation studies demonstrate the superiority of our unified visual representation design over existing methods with unified representations and other models employing decoupled representations.

Text-to-Video Generation

All videos have a resolution of 384×672 and a frame rate of 12 fps. Hover over each video to see the corresponding text prompt.




Citation

If you find our work helpful, please cite our paper:

@misc{liu2025tunatamingunifiedvisual,
        title={TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models}, 
        author={Zhiheng Liu and Weiming Ren and Haozhe Liu and Zijian Zhou and Shoufa Chen and Haonan Qiu and Xiaoke Huang and Zhaochong An and Fanny Yang and Aditya Patel and Viktar Atliha and Tony Ng and Xiao Han and Chuyan Zhu and Chenyang Zhang and Ding Liu and Juan-Manuel Perez-Rua and Sen He and Jürgen Schmidhuber and Wenhu Chen and Ping Luo and Wei Liu and Tao Xiang and Jonas Schult and Yuren Cong},
        year={2025},
        eprint={2512.02014},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.02014}, 
  }