Can Sound Replace Vision in LLaVA With Token Substitution?

A Systematic Investigation of Audio-Visual Alignment Trade-offs in Multimodal Systems

Ali Vosoughi Jing Bi Pinxin Liu Yunlong Tang Chenliang Xu

Computer Science Department, University of Rochester, NY, USA

📄 Read Paper 💻 Code 🗂️ Dataset 🎬 Demo 📚 Citation

Abstract

What happens when we push audio-visual alignment to its absolute limits? To systematically investigate this question, we needed datasets with granular alignment quality annotations, but existing datasets treat alignment as binary, either synchronized or not. To address this limitation, we developed a comprehensive dataset featuring detailed alignment scores that reveal the hidden spectrum of audio-visual perceptual correspondence. Using these precise scores, we create "superaligned" representations by training exclusively on the most perfectly matched audio-visual pairs, then conduct our systematic investigation into how this extreme alignment transforms perceptual model behavior across retrieval and generation tasks.

Our findings reveal that the initial architectural type of the encoder determines how it responds to the alignment process. Image-centric encoders demonstrate exceptional performance in cross-modal retrieval, but this intensive alignment causes compression of unique linguistic information and reduces the quality of their text description generation. In contrast, text-centric encoders maintain better balance between the two objectives, revealing a fundamental trade-off where excessive alignment with the visual manifold leads to improved retrieval capabilities, but simultaneously reduces the richness of acoustic and linguistic information necessary for quality text description generation.

AVE-2 Dataset

We introduce AudioVisual Event Evaluation (AVE-2), a dataset of 570,138 three-second audiovisual clips with fine-grained alignment annotations. Unlike existing datasets that treat alignment as binary, AVE-2 provides detailed alignment scores across five dimensions, enabling systematic investigation of audio-visual correspondence quality.

570K

Audio-Visual Clips

Alignment Dimensions

Seconds per Clip

Demo Examples

🚀 Getting Started with AVE-2

AVE-2 is now available on HuggingFace! Get started with our comprehensive dataset in just a few lines of code:

📥 Step 1: Basic Loading

Load the dataset metadata and explore 570K samples with alignment scores instantly.

🎬 Step 2: Download Media

Download chunked video files (237GB total) for complete audio-visual analysis.

🔍 Step 3: Start Analyzing

Filter by quality scores, explore alignment dimensions, and build your models.

# Install required packages
pip install datasets opencv-python librosa

# Load dataset with metadata (instant)
from datasets import load_dataset
dataset = load_dataset("ali-vosoughi/ave-2")

# Explore a sample
sample = dataset["train"][0]
print(f"Video ID: {sample['youtube_id']}")
print(f"Temporal Alignment: {sample['temporal_alignment_score']}/10")
print(f"Video Caption: {sample['video_caption'][:100]}...")
                

# For complete dataset with video files:
# 1. Download media parts
huggingface-cli download ali-vosoughi/ave-2 --include="ave2_media_part_*" --local-dir ./

# 2. Reconstruct media archive
cat ave2_media_part_* > ave2_media.zip && unzip ave2_media.zip

# 3. Verify reconstruction
md5sum -c ave2_media.md5

# 4. Use with media files
dataset = load_dataset("ali-vosoughi/ave-2")  # Now includes video paths!
                

# Quality-aware filtering
high_quality = dataset.filter(lambda x: all([
    x['temporal_alignment_score'] >= 8,
    x['spatial_coherence_score'] >= 8,
    x['physical_causality_score'] >= 8
]))

# Analyze invisible sound sources
invisible_samples = dataset.filter(lambda x: len(x['invisible_active_sources']) > 0)
print(f"Samples with invisible sources: {len(invisible_samples['train'])}")
                

🔗 Dataset Links:

Interactive Demo

Explore how different audio encoders (Raw vs Projected) generate different captions for the same audiovisual content. Click on any encoder mode button to see the generated caption.

Example 1

Video ID: rvbmYs4Kl3Y (Segment 01) Alignment Score: 47.0/50

Visual Description

The video takes place in a workshop setting, specifically at Finney's. The background features various tools and materials typically found in woodworking or similar crafts, such as wooden planks, a workbench, and shelves with additional items. There is also signage indicating "Finney's Expert Advice Simple Solutions Beautiful Work" on the wall. In the first scene, an individual wearing glasses and a black shirt with a white apron stands behind a workbench. They are holding a can of paint and appear to be preparing it for use. On the workbench, there is a rectangular piece of wood that looks like a cutting board, along with other tools and materials. A bottle labeled "Paint" is placed nearby, suggesting that painting might be part of their task. The second scene shows the same person continuing to prepare the paint can. This time, they are using their hands to apply the paint onto the cutting board. The environment remains consistent throughout, reinforcing the idea that this is a dedicated workspace for painting tasks. The third scene continues with the individual still engaged in painting. The focus remains on the process of applying the paint, which appears to be a light color given its sheen. The surrounding area includes more tools and materials, maintaining the workshop ambiance. The fourth scene captures another moment where the individual is seen working on the painting process. The lighting seems slightly dimmer than in previous scenes, possibly due to the nature of the work being done. The overall mood remains focused and industrious. The fifth scene returns to the individual standing behind the workbench, now holding a small container that likely contains additional paint or a different type of material. The arrangement of objects and characters suggests a continuation of the painting activity. The final scene shifts to a different angle, showing the individual from a side perspective. The lighting is brighter compared to earlier scenes, emphasizing the ongoing painting activity. The surroundings remain unchanged, reinforcing the continuity of the workshop environment. Overall, the video provides a detailed look into a typical day in a woodworking workshop, focusing on the activities involved in painting.

Audio Description

Labels with acoustic features: Deep, resonant, authoritative, and assertive -> Male speech, man speaking; Characterized by its loudness, pitch, and timbre -> Speech; Generally high frequency, short duration -> Animal

Alignment Analysis

Temporal

10/10

Spatial

10/10

Contextual

10/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Man, Cutting board, Paint canister

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 2

Video ID: IwqD859w2_E (Segment 02) Alignment Score: 44.0/50

Visual Description

The video begins with a close-up of a person playing a red acoustic guitar on a wooden floor. The lighting is dim, creating shadows and highlighting the guitar's details. The background is dark, emphasizing the subject in focus. As the video progresses, the scene transitions to a wider shot showing the same person from behind, still holding the guitar. The lighting remains consistent, casting soft shadows around the person and the guitar. The setting appears to be indoors, possibly a room or studio, given the presence of what looks like a microphone stand in the background. In the final part of the video, the camera shifts slightly to show more of the person's hands as they play the guitar. The lighting continues to be dim, maintaining the mood of the earlier scenes. The person's movements are deliberate and focused, suggesting a practice session or performance. Throughout the video, there are no significant changes in the environment or actions apart from the progression of the shots. The overall atmosphere is one of concentration and musical expression.

Audio Description

Labels with acoustic features: Rich and varied -> Musical instrument; Bright and percussive -> Plucked string instrument; Full, bright, and complex -> Guitar; Rich and dynamic -> Music; Strong and resonant -> Piano; Bright and percussive -> Percussion; Warm and full-bodied -> Acoustic guitar; Bright and percussive -> Strum

Alignment Analysis

Temporal

9/10

Spatial

10/10

Contextual

8/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Guitar

👁️ Visible Silent: Microphone stand

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 3

Video ID: DDer7K8WG4I (Segment 02) Alignment Score: 46.0/50

Visual Description

The video begins with a view of an old, tall brick tower with pointed arches and decorative elements. The tower has three levels, each adorned with ornate details such as statues and intricate patterns. At the top level, there are two bells hanging from what appears to be a small balcony or ledge. Below this, on the second level, there is another set of bells, also hanging from a similar structure. In the foreground, there's a building with a sign that reads "PITI" in red letters against a white background. To the left side of the frame, part of another building can be seen, which seems to have a more modern appearance compared to the one in the foreground. The scene transitions smoothly into a festive atmosphere, indicated by garlands draped across the lower portion of the image. The sky remains overcast, suggesting it might be a cloudy day. As the video progresses, the focus shifts back to the tower, now showing its full height and detailed architecture. The bells remain visible, and the overall setting remains consistent throughout the video. Towards the end of the video, the camera angle changes slightly, revealing more of the surroundings, including additional buildings and trees. The lighting dims slightly, indicating either early morning or late afternoon light. Throughout the video, the visual elements include the brick tower, the bell towers, the greenery, and the architectural style of the surrounding structures. There are no significant movements apart from the slight change in perspective.

Audio Description

Audio caption: A church bell is ringing in a town square.

Alignment Analysis

Temporal

10/10

Spatial

10/10

Contextual

9/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Church bells

👁️ Visible Silent: Tower, surrounding structures

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 4

Video ID: UorSpZVnX_M (Segment 02) Alignment Score: 46.0/50

Visual Description

The video begins with a close-up view of a red stand mixer on a kitchen countertop. The mixer is in the process of mixing dough, which appears to be light yellow and slightly sticky. The background shows a granite countertop and some kitchen utensils, indicating that this scene takes place in a home kitchen. As the video progresses, the focus shifts to the dough being mixed by the mixer. The dough is thick and smooth, with visible air bubbles indicating it has been kneaded. The mixer's motor is spinning rapidly as the dough is being incorporated into the bowl. The lighting is bright, illuminating the dough and making its texture more visible. Towards the end of the video, the dough is fully mixed and ready for the next step in the recipe. The mixer continues to spin at high speed, ensuring the dough is well-mixed and evenly spread out. The background remains consistent throughout, showing the same kitchen countertop and utensils. Throughout the video, there are no significant changes or transitions between scenes; the setting and actions remain constant. The overall mood conveyed by the visual elements is one of preparation and cooking, suggesting that the person preparing the dough is likely following a recipe or instructions provided in the video.

Audio Description

Labels with acoustic features: High-pitched and whirring -> Power tool; Rich in frequency modulation -> Speech

Alignment Analysis

Temporal

9/10

Spatial

10/10

Contextual

10/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: mixer, speaker

👁️ Visible Silent: kitchen utensils, granite countertop

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 5

Video ID: EQHrQIaQNv8 (Segment 03) Alignment Score: 45.0/50

Visual Description

The video begins with a close-up of a person's hands playing an electric guitar. The individual is wearing a black shirt and brown pants, seated on what appears to be a bench or chair. The background is out of focus, but it seems to be an indoor setting with artificial lighting. The scene transitions smoothly into the next frame, where the same person is seen holding the same electric guitar. This time, the guitar has a unique design that resembles a long-necked instrument, possibly a custom-built model known as a "Guitar Hero." The guitar features a prominent logo at the bottom that reads "Guitar Hero," indicating its brand identity. The person continues to play the guitar, adjusting their fingers on the fretboard and strumming the strings. In the final part of the video, the person is still engaged in playing the Guitar Hero guitar. They are shown from a slightly different angle, maintaining the same posture and focus on the guitar. The background remains consistent throughout this segment, reinforcing the continuity between scenes. Throughout the video, there are no significant changes in the environment or actions; the setting remains unchanged, and the main action revolves around the guitar player and their interaction with the instrument.

Audio Description

Labels with acoustic features: Rich in frequency modulation -> Speech; Bright and clear -> Harmonica; Smooth and resonant -> Guitar

Alignment Analysis

Temporal

10/10

Spatial

10/10

Contextual

8/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Guitar

👂 Invisible Active: Harmonica

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 6

Video ID: GJhYkfI7jpU (Segment 03) Alignment Score: 42.0/50

Visual Description

The video begins with a close-up of a person's hand holding a black speaker. The background shows a wooden block and a green plastic container on a table, suggesting an indoor setting possibly in a kitchen or workshop. The scene transitions to the speaker being turned off, indicating that it is not currently playing music. Next, the focus shifts to a small circuit board placed on the same table as the speaker. A blue LED light is illuminated, which could be part of an electronic project or experiment. The circuit board has several wires connected to it, some of which are visible and appear to be part of the electronics setup. The video then moves forward to show the same circuit board now powered up by a battery, with the blue LED light still on. This suggests that the circuitry is functioning correctly and is likely part of a larger electronic device or system. In the final segment, the video continues with the circuit board now powered up, showing the blue LED light glowing brightly. The surrounding environment remains consistent with the previous scenes, maintaining continuity between the different stages of the project. Throughout the video, there are no significant changes in the actions or events occurring; instead, the sequence of images captures the progression from turning off the speaker to powering up the circuit board and finally illuminating the LED light.

Audio Description

Labels with acoustic features: Rich in frequency modulation -> Speech; Bright and sharp -> Alarm clock

Alignment Analysis

Temporal

8/10

Spatial

10/10

Contextual

7/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Speaker

👁️ Visible Silent: Green plastic container

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Example 7

Video ID: AVL7Kbpw13U (Segment 01) Alignment Score: 42.0/50

Visual Description

The video begins with a person sitting on a chair, wrapped in a maroon blanket or towel. The background features a wooden floor and various plants, creating an indoor setting that appears cozy and lived-in. As the scene develops, the person starts to move around, bending down and adjusting their position slightly. They eventually sit up straight again, still wrapped in the same maroon item. In the next few frames, the person continues to adjust their posture while still wrapped in the blanket. This suggests they are either getting ready for bed or preparing to get out of bed. The room's decor remains consistent throughout, maintaining a warm and homely atmosphere. Towards the end of this segment, the person is seen standing up from the chair, holding the blanket over their head as if it were a hat. This action indicates they might be about to put it back on or have just taken it off. The final frame shows the person walking away from the chair, leaving the scene empty once more. Throughout the video, there are no significant changes in the environment or actions apart from the person's movements. The lighting is soft and natural, suggesting daytime or well-lit indoor conditions. There are no visible texts or subtitles within the video itself.

Audio Description

Labels with acoustic features: Rich in frequency modulation -> Speech; High-pitched and short -> Sneeze; Generally high pitched, sweet, and giggly -> Child speech, kid speaking; Loud, harsh, and discordant -> Noise; Chaotic and non-structured -> Background noise; Full range and varied -> Human voice; Richly complex -> Human voice

Alignment Analysis

Temporal

8/10

Spatial

10/10

Contextual

7/10

Causality

9/10

Visibility

8/10

Sound Source Visibility Analysis

Mistral's analysis of sound sources in the audiovisual scene:

👁️ Visible Active: Person

👂 Invisible Active: Sneeze sound

AUDIOCLIP

CLAP

WAV2CLIP

WHISPERCLIP

IMAGEBIND

Select an encoder mode to view generated caption

Click any button above to see the generated caption for this audiovisual example.

Key Findings

Our systematic investigation reveals a fundamental trade-off where excessive alignment with the visual manifold leads to improved retrieval capabilities, but simultaneously reduces the richness of acoustic and linguistic information necessary for quality text description generation.

Performance Trade-offs

Image-centric encoders (ImageBind, AudioCLIP, Wav2CLIP) demonstrate exceptional performance in cross-modal retrieval due to their inherent design for visual alignment, but this intensive alignment causes compression of unique linguistic information and reduces text generation quality. Text-centric encoders (CLAP, Whisper) maintain stronger linguistic authenticity and achieve better balance between retrieval and generation objectives.

Architectural Insights

The initial architectural type of the encoder determines how it responds to the alignment process. Encoders pre-trained with text supervision maintain stronger generative capabilities than those focused primarily on audiovisual alignment, highlighting the value of language exposure for generation tasks.

Pareto Frontier Discovery

We establish a clear Pareto frontier for cross-modal learning, providing guidelines for choosing between retrieval accuracy and generative richness based on application needs. This challenges the assumption that stronger cross-modal alignment necessarily benefits all multimodal tasks.

Code and Resources

All code, data, and pre-trained models are made available to facilitate reproducibility and future research in audio-visual alignment.

GitHub Repository ArXiv Paper HuggingFace Dataset

Citation

If you use SoundCLIP or the AVE-2 dataset in your research, please cite our paper:

@article{vosoughi2025soundclip, title={Can Sound Replace Vision in LLaVA With Token Substitution?}, author={Vosoughi, Ali and Bi, Jing and Liu, Pinxin and Tang, Yunlong and Xu, Chenliang}, journal={Arxiv}, year={2026} }