TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Introducing TUNA, a family of native unified multimodal models

TUNA leverages unified visual representations to enable image/video understanding, image/video generation, and image editing within a single framework.
Our extensive experiments show that TUNA's unified visual representation is highly effective, achieving state-of-the-art performance across multiple multimodal understanding and generation tasks.
Our comprehensive ablation studies demonstrate the superiority of our unified visual representation design over existing methods with unified representations and other models employing decoupled representations.

Text-to-Video Generation

All videos have a resolution of 384×672 and a frame rate of 12 fps. Hover over each video to see the corresponding text prompt.

A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh

A cat wearing sunglasses at a pool

The video shows a capybara holding a matte cardboard sign that reads “Tuna! Tuna!” in bold, clear lettering. It grips the sign firmly with both hands and begins twisting its upper body in wide, rhythmic motions, rotating left and right in a lively, dance-like sway while keeping the sign steady and centered. The movement is energetic and expressive, driven entirely by the capybara’s torso as the sign tilts slightly with each twist. Behind the action, bright stage lights sweep across a small performance space, with soft haze catching the colors and adding a subtle concert-like glow.

a bird building a nest from twigs and leaves

The video shows a small rabbit wearing round glasses, holding a book titled “How to Use Tuna” on its cover. The rabbit keeps the book lifted in front of its face, gently turning pages as it reads, its head making small, focused nods that match the rhythm of careful study. Its posture remains steady as both paws grip the book, and its ears tilt slightly with each subtle movement. Soft, warm lighting and a simple blurred background stay secondary to the close-up focus on the rabbit, its glasses, and the reading motion.

A beautiful coastal beach in spring, waves lapping on sand, pixel art

A bigfoot walking in the snowstorm

a banana on the top of an apple, front view

A beautiful coastal beach in spring, waves lapping on sand, Van Gogh style

An elderly woman with short, curly white hair and a beige sweater sits on a gray couch, holding a light blue mug in her right hand. She rests her left arm on a green pillow, smiling as she looks off to the right. Behind her, a white shelf holds a glass vase with a gray candle, a stack of books, and a wicker basket. A large green plant stands to the left, and the background is a plain white wall. The woman's gaze shifts to the camera, and she raises the mug to her lips, taking a sip. She then sets the mug down on the couch beside her, her eyes never leaving the camera. The scene is well-lit, with soft, natural light illuminating the room.

The video shows a close-up view of a hand placing colorful magnetic letters and numbers on a white refrigerator in a kitchen setting. The hand, belonging to a fair-skinned individual, enters the frame from the bottom left corner and places a red '8' next to other magnets already on the fridge, including a green 'A', blue '2', yellow 'A', and orange '0'. The kitchen background features light brown cabinets, a brown tile backsplash, and a white electric kettle on the countertop. The hand then places a red '8' next to the existing magnets. The video captures the hand's movement and the interaction with the fridge magnets, set against the blurred kitchen environment.

The video shows a medium close-up shot of a young woman walking down a city sidewalk on a sunny day. The woman has fair skin and long, straight red hair. She wears a brown fur coat with a hood, a black top underneath, and large round sunglasses. She walks towards the camera with a slight smile on her face, her hair blowing gently in the wind. As she approaches, she removes her sunglasses and looks directly at the camera with a confident expression. The background is out of focus but appears to be a city street with buildings, trees, and parked cars. The sun shines brightly from behind the woman, casting a warm glow and creating lens flares that add a dynamic effect to the scene. The video captures the woman's casual stroll and her interaction with the camera, emphasizing her confidence and style.

A green chameleon wearing a Santa hat is perched on a Christmas tree branch, looking directly at the camera. The chameleon has a light green body with darker green stripes and yellow accents around its eyes and mouth. Its long tail extends behind it, and its front legs are bent at the elbows, with the right leg resting on a black wire ornament with a white pom-pom. The Santa hat is orange with a white pom-pom on top. The background is a blurred Christmas tree with green branches and some silver tinsel visible on the left side. The overall atmosphere is festive and playful, with the chameleon's expression appearing calm and curious.

A woman with long, flowing purple hair and a strapless top floats underwater, gazing upwards with her hands placed on her chest. The dark background is illuminated by a soft, blue-purple light that casts an ethereal glow on her face and hair. Small, shimmering particles are scattered throughout the water, adding to the dreamy atmosphere. The woman's expression is serene, with a subtle smile playing on her lips. As the video progresses, her hair flows gently around her, and her hands remain still on her chest. The camera captures her from a medium close-up angle, with the darkness of the water serving as a striking backdrop.

The video shows a close-up view of a tarantula crawling on a bed of dry, brown dirt and twigs. The tarantula has a light brown body with darker brown stripes on its legs and a distinctive white stripe running along its back. It has eight long, thin legs and a small, rounded abdomen. As the video progresses, the tarantula slowly moves to the right, its legs twitching as it navigates through the dirt and twigs. The background is a blurred mix of brown dirt, twigs, and other plant material, with no discernible features or textures. The overall atmosphere of the video is one of quiet, methodical movement, with the tarantula's slow crawl through the dirt and twigs creating a sense of deliberate, almost meditative action.

The video shows an astronaut in a full spacesuit, helmet visor reflecting light as the figure becomes the main focus of the frame. The astronaut walks forward in steady, weight-shifted steps, each movement slightly slow and deliberate, the suit’s joints flexing with a natural, grounded rhythm. Subtle arm swings accompany the motion as the astronaut maintains balance while moving across uneven ground. Behind the astronaut, a wide desert landscape stretches out—soft dunes, pale sand, and distant heat haze forming a minimal, open backdrop that stays secondary to the figure’s slow, purposeful walk.

The video shows a single red balloon drifting gently forward, its string hanging loosely as it floats at a steady, unhurried pace. The balloon rises and dips slightly with the movement of the air, shifting left or right in small, natural sways as it continues down the street. The scene unfolds along a European old-town street treated as a full visual object rather than simple background: narrow cobblestone paths, tall stone buildings with ornate window frames, iron balconies, and warm-toned façades lining both sides. Soft afternoon light casts long shadows across café tables, bicycles, and shop signs, all subtly passing by as the balloon moves. The architecture’s depth, the texture of the street, and the layered rows of buildings create a clear sense of space, framing the slow, floating journey of the red balloon through the historic European street.

A person stands at the edge of an infinity pool, arms outstretched, gazing out at a vibrant sunset over the ocean. The individual is silhouetted against the colorful sky, wearing a long-sleeved shirt and shorts. The pool's edge is barely visible, with ripples on the water's surface reflecting the sky's hues. In the background, a small island with palm trees and rocks meets the horizon, where the sun sets over the calm ocean waters. The sky transitions from blue to orange, yellow, and pink, with scattered clouds. The scene exudes a serene, peaceful atmosphere, capturing a moment of contemplation or joy at the natural beauty of the surroundings.

The aerial video captures a serene landscape featuring a heart-shaped body of water surrounded by trees and a small house with a brown roof on the left side. The water's color transitions from deep turquoise in the center to light blue-green towards the edges, with visible rocks and sand beneath the surface. The surrounding trees, mostly bare, have some greenery, indicating the scene might be during early spring or late fall. The house sits on a grassy area, with a small dirt patch nearby. The video begins with a high-angle shot of the heart-shaped lake, showcasing the water's clarity and the surrounding landscape. As the video progresses, the camera zooms out, revealing more of the lake's surroundings, including additional trees and the house's proximity to the water's edge. The video ends with a broader view of the lake, emphasizing its unique shape and the natural beauty of the area.

The video shows a snowy mountain landscape with a large snow-covered peak in the background. The foreground features a snow-covered ground with two large snow sculptures resembling people holding hands. The sculptures are made of compacted snow and have a rough texture. They are positioned in the center of the frame, with the one on the left slightly shorter than the one on the right. The sculptures are surrounded by snow-covered hills and mountains, with a few trees visible in the distance. The sky is blue with some clouds, and the sun is shining brightly from the top-right corner of the frame, casting a warm glow over the scene. The overall atmosphere is serene and peaceful, with the snow-covered landscape and the sculptures creating a sense of tranquility.

The video shows a medium close-up shot of a man using his smartphone. He has fair skin, brown hair, and a beard, and is wearing a gray t-shirt. The background is a blurred white bookshelf with books and a window on the right side. The man holds the phone in his right hand, looking at the screen with a neutral expression. He scrolls through the phone with his thumb, his eyes moving down the screen. The camera remains static throughout the video.

A young girl with fair skin and light brown hair, adorned with colorful hair clips, sits at a wooden table in a dimly lit room. She wears a pink short-sleeved shirt with a ruffled front and has her left elbow resting on the table, with her hand supporting her head. Her eyes are cast downward, focused on something out of frame. The background is blurred but appears to be a domestic setting with a dark wall, possibly with a door or closet, and a light-colored floor. The overall atmosphere is quiet and introspective, suggesting the girl is engaged in a thoughtful or creative activity.

A serene scene unfolds as a gray heron wades through shallow, algae-covered water, its long neck and legs a striking silhouette against the soft blue background. The heron's dark gray plumage glistens in the sunlight, its orange-red bill a vibrant contrast. As it walks, the water ripples gently, reflecting the heron's movements. In the foreground, patches of green algae and light-colored rocks dot the wet sand, while a small rock formation peeks out from the top left corner. The atmosphere is tranquil, with the heron's calm demeanor mirroring the peaceful surroundings.

A close-up shot of a green metallic beetle resting on the vibrant red petals of a flower, with yellow stamens at its center. The flower's delicate texture and intricate details are visible, with soft, blurred greenery in the background. The beetle's shiny green exoskeleton reflects sunlight, and its long antennae extend outward from its body. As the video progresses, the beetle remains stationary, while the flower's petals gently sway in the breeze. The background remains blurred, with hints of additional flowers and foliage visible. The overall atmosphere is serene, with warm sunlight casting a gentle glow on the scene.

A man is brushing his teeth in front of a bathroom mirror, wearing a red and black checked shirt and black-framed glasses. He has short brown hair and a beard. The mirror reflects his image, and he looks at himself while brushing his teeth with a blue toothbrush. The background is a white bathroom with a white door and a white wall. The man is standing in front of the mirror, holding the toothbrush in his right hand and brushing his teeth with a focused expression. He moves the toothbrush up and down, covering his entire mouth. The lighting is soft and natural, with no harsh shadows or highlights. The overall atmosphere is one of quiet morning routine, with the man going about his daily hygiene with a sense of calm and concentration.

A person, visible from the waist down, crouches on snowy ground, wearing a dark blue jacket, jeans, and black boots. They hold a stick with a sausage on the end over a small campfire, surrounded by scattered twigs and branches. The fire burns brightly, with orange and yellow flames, and a few larger sticks are arranged in a teepee shape. The background is a snowy landscape with scattered branches, and the overall atmosphere is cozy and intimate. The person slowly rotates the sausage over the flames, occasionally adjusting its position. The camera remains static, capturing the warm glow of the fire and the person's actions.

A medium shot of a young, fair-skinned woman in teal blue scrubs and a stethoscope around her neck, holding a black tablet and stylus while standing on a city street. She has brown hair pulled back into a ponytail and wears a white long-sleeve shirt under her scrubs. The background features a blurred red brick building with rows of windows and an ambulance parked in front of it. Initially, she looks down at the tablet, her right hand holding the stylus and her left hand supporting the tablet. As the video progresses, she taps on the screen with the stylus, her expression focused. The scene remains unchanged, with the woman remaining the central focus throughout.

A serene scene unfolds in a lush green meadow, where a majestic white cow stands prominently in the foreground. The cow, adorned with a distinctive black nose and ears, faces leftward, its body positioned slightly toward the camera. A rustic wooden fence, weathered to a warm brown, stretches behind the cow, separating the meadow from the rugged terrain beyond. In the background, a majestic mountain range rises, its slopes covered in a verdant blanket of green, punctuated by patches of snow that cling to the peaks. Above, a brilliant blue sky stretches out, devoid of clouds, adding to the tranquil ambiance of the scene. The cow's calm demeanor and the picturesque landscape create a sense of peaceful coexistence between nature and its gentle inhabitant.

The video shows a panoramic view of a snowy mountain landscape with two snowmobiles parked in the foreground. The snowmobiles are green and black, facing away from the camera towards the mountains. A person wearing a black jacket and pants stands beside the rightmost snowmobile, looking out at the view. In the background, snow-covered mountains stretch out under a blue sky with a few clouds and a bright sun shining down. The snowmobiles have 'POLARIS' written on them in black letters. The ground is covered in deep snow with tracks from previous snowmobile use. The overall atmosphere suggests a crisp, cold winter day with clear skies and a sense of adventure.

The video shows a capybara swimming in a body of water, with its head and ears visible above the surface. The capybara has light-brown fur with darker patches around its eyes and ears, and its whiskers are wet and dripping with water. The water is murky and reflects the surrounding environment, including trees and fallen leaves. The capybara's reflection is visible in the water, creating a sense of symmetry. The background is blurred, but it appears to be a natural setting, possibly a lake or river, with trees and vegetation surrounding the water. The overall atmosphere is peaceful and serene, with the warm sunlight casting a golden glow on the scene. The capybara seems to be enjoying its swim, with its ears perked up and its eyes alert. The video captures the tranquility of the moment, inviting the viewer to appreciate the beauty of nature and the simple joys of life.

The video shows a close-up view of a Christmas tree adorned with string lights. The tree's branches, covered in green needles, are densely packed and extend from the top-left corner to the bottom-right corner of the frame. A silver-colored string of lights is wrapped around the branches, with each bulb emitting a warm, yellow glow. The lights are positioned at varying distances from the camera, creating a sense of depth and dimensionality. In the background, the tree's foliage is blurred, while the lights remain sharply focused, drawing attention to their warm, inviting light. The overall atmosphere is cozy and festive, evoking a sense of holiday cheer.

The video shows a small penguin wearing tiny roller skates, its round body and flippers forming the main focus of the frame. It glides forward with quick, playful motions, shifting its weight from side to side as it pushes off the ground, the skates rolling smoothly beneath its feet. The penguin makes small balancing swings with its body, giving each glide a lively, rhythmic wobble. Soft lighting and a simple, slightly blurred background stay secondary to the close-up view of the penguin’s energetic roller-skating movements.

The video features a close-up shot of a fair-skinned woman with long, wavy blonde hair, wearing a sleeveless top and a ring on her left hand, posing against a beige wall with a subtle texture. She initially looks directly at the camera with her hands raised to her hair, gently grasping the strands. As the video progresses, she slowly lowers her hands, running her fingers through her hair, and then looks down, her gaze drifting off-camera. The background remains static, with the beige wall providing a neutral backdrop that highlights the subject's features and actions. The overall composition is intimate and focused on the woman's expression and movements.

A close-up shot of a yellow and black fish swimming in an aquarium with green plants in the background. The fish has a yellow body with black spots and a black stripe running along its side, and its fins are translucent. It swims from right to left, its mouth opening and closing as it moves. The background is filled with green plants, some of which are out of focus. The lighting is bright and even, with no harsh shadows or highlights. The camera is handheld and tracks the fish as it swims.

The video shows a black duck swimming on a river with ice floes. The duck is in the center-left of the frame, facing right, with its head turned slightly towards the camera. It has a distinctive black beak and dark eyes. The duck's feathers are sleek and shiny, with a subtle sheen in the light. The river is wide and calm, with gentle ripples on the surface. The water is a murky brown color, reflecting the overcast sky above. Several large ice floes are scattered across the river, some of which are partially submerged. The largest ice floe is in the center-right of the frame, with a small mound of snow on top. The background is a blurred view of the riverbank, with some trees and bushes visible in the distance. The overall atmosphere is peaceful and serene, with the duck gliding effortlessly across the water.

The video shows a small white and brown dog lying in the grass, looking up at something out of frame. The dog is positioned in front of a blue and gray stone wall with a red brick archway behind it. The dog's fur is predominantly white with brown patches on its back, ears, and face. It has a black nose and dark eyes. The dog is wearing a collar around its neck. The dog lies on the grass, which is slightly out of focus in the foreground. The background is a blue and gray stone wall with a red brick archway. The wall has a rough texture and appears to be weathered. The archway is partially visible behind the dog. The video is shot from a low angle, looking up at the dog. The camera remains static throughout the video.

A medium close-up shot of a man and woman standing side-by-side on a sandy beach, both covered in colorful powder paint. The man, with short brown hair, wears a white polo shirt with a pocket on the left side of his chest. The woman, with long red dreadlocks, wears a white blouse. They both have green, pink, and purple powder paint on their faces and clothes. The background is a sandy beach with a small patch of grass visible behind them. The sky is overcast, and the overall atmosphere suggests a fun and playful celebration, likely Holi. The couple stands still, looking directly at the camera with neutral expressions. The camera is static, capturing the scene from a medium close-up perspective.

The video shows a man wearing protective goggles and a fitted flight jacket, his face and upper body filling most of the frame. He extends both arms outward as he flies through the sky, his torso angled forward in a steady, controlled motion, giving the impression of active, high-energy flight. His body shifts slightly with the airflow, adding a sense of movement and lift. Behind him, bright sunlight filters through open blue sky and scattered clouds, creating a wide, airy backdrop that stays secondary to the close-up focus on his face, goggles, clothing, and flying posture.

A brown horse with a black mane grazes on a grassy hillside dotted with rocks. The horse's head is lowered to the ground as it eats, its ears perked up and alert. In the background, a line of trees and a mountain range are visible under a cloudy sky. The horse's coat is a warm chestnut brown, and its mane is long and flowing. The grassy hillside is dotted with large rocks and boulders, and the sky above is overcast with clouds. The horse's movements are slow and deliberate as it grazes, its head bobbing up and down as it searches for the best patches of grass. The camera remains static, capturing the peaceful scene from a slight low angle.

A medium close-up shot of a young woman with long brown hair and glasses, wearing a light blue collared shirt, holding a clipboard and pencil in an office setting. She is standing in front of a whiteboard with a wooden frame, and a brick wall with framed pictures and a glass door with wooden frames are visible in the background. The woman is initially looking down at the clipboard, then looks up and to the right, smiling slightly as she begins to write on the paper attached to the clipboard with the pencil in her right hand. The camera remains static throughout the scene.

The video shows a serene landscape with a sunset or sunrise above a mountain range. The sky is painted with hues of orange, pink, and purple, with the sun shining brightly at the top center. Below the sky, fluffy white clouds are scattered across the horizon, partially obscuring the silhouette of the mountains. In the foreground, the dark shapes of trees and bushes are visible on the left side of the frame. The overall atmosphere is peaceful and tranquil, evoking a sense of awe and wonder. As the video progresses, the camera pans slightly to the right, revealing more of the mountain range and the clouds. The sun remains stationary, casting a warm glow over the entire scene. The video ends with a shot of the clouds and mountains, still bathed in the soft light of the sun.

The video shows a woman with red hair and a green coat with a fur-lined hood standing on a rocky path in a mountain valley. She has a large gray backpack on her back and is facing away from the camera, looking down the path. The path is surrounded by steep cliffs and mountains, with some greenery and small trees scattered throughout. In the distance, a mountain peak is visible under a clear blue sky. The woman's arms are crossed, and she appears to be contemplating or taking in the scenery. The video is a static shot, capturing the serene and natural beauty of the surroundings.

A majestic male lion with a thick, dark-brown mane and golden-brown fur lies on the ground, facing right. His front paws are stretched out in front of him, and his head is slightly raised, gazing into the distance. The lion's mane is long and shaggy, framing his face and neck. In the background, a tree trunk and some dry grass are visible, along with a large rock formation to the right. The ground is covered in light-brown dirt, and the overall atmosphere suggests a sunny day in a savannah or zoo enclosure. The lion's expression is calm and regal, exuding a sense of power and serenity.

The video shows a close-up shot of a young woman sitting in front of a window, looking down at something out of frame. She has fair skin and dark brown hair pulled back into a ponytail, and is wearing a light-colored collared shirt. The background is blurred but appears to be an office setting with a large window behind her, allowing natural light to pour in. The woman's head is tilted downward, and her eyes are focused on something below the frame. Her expression is neutral, and her hands are not visible. The overall atmosphere suggests a moment of contemplation or concentration. The video is shot in a cinematic style, with a shallow depth of field that blurs the background and emphasizes the subject. The lighting is soft and natural, with a warm glow from the window. The camera remains static throughout the video, capturing the woman's introspective moment.

A serene scene unfolds as an orange and white cat sits by a window, gazing outside at the snow-covered landscape. The cat, positioned on the right side of the frame, faces left, its head turned slightly towards the camera. Its fur is a vibrant orange hue with white accents on its chest, paws, and nose. The cat's ears are perked up, and its eyes are fixed intently on something outside. The window, framed in white, takes up most of the left side of the frame, with a sheer white curtain pulled back to reveal the snowy view. A gray curtain with a subtle leaf pattern hangs behind the cat, adding a touch of elegance to the scene. The overall atmosphere is one of quiet contemplation, as if the cat is lost in thought, observing the winter wonderland outside.

TUNA: Taming Unified Visual Representations for
Native Unified Multimodal Models

Text-to-Video Generation

Text-to-Image Generation

Image Editing

Image and Video Understanding

Citation