PromptReverb

Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

¹University of Rochester, ²Smule Labs, ³UC San Diego, ⁴Stanford University
†Work done during internship at Smule Labs | ⁺These authors contributed equally
Accepted at ICASSP 2026 — International Conference on Acoustics, Speech, and Signal Processing
Coming Soon: We will release inference weights and pre-trained models in 4 sizes (S, B, L, XL) — stay tuned!

Citation

If you use PromptReverb in your research, please cite our ICASSP 2026 paper:

BibTeX:
@inproceedings{vosoughi2026promptreverb,
  title={Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching},
  author={Ali Vosoughi and Yongyi Zang and Qihui Yang and Nathan Paek and Randal Leistikow and Chenliang Xu},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  organization={IEEE}
}

Project Page: https://ali-vosoughi.github.io/PromptReverb

Code: https://github.com/ali-vosoughi/PromptReverb

Key Features

Two-Stage Architecture

VAE for band-limited to full-band upsampling (48 kHz) combined with conditional diffusion transformer for text-to-RIR generation

Rectified Flow Matching

Advanced flow-based generation with proper ODE integration using DiT architecture for high-quality reverb synthesis

Natural Language Control

First method to synthesize complete RIRs from free-form textual descriptions using CLAP-based text encoding

Superior Accuracy

Achieves 8.8% mean RT60 error compared to -37% for Image2Reverb baseline with realistic acoustic parameters

Audio Samples

Compare our PromptReverb method with Image2Reverb and ground truth recordings across diverse acoustic environments

Method Overview

PromptReverb Architecture

Text Input
CLAP Encoder
Rectified Flow DiT
VAE Latents
VAE Decoder
Full-Band RIR

Key Technical Components:

β-VAE Pipeline

ResBlock encoder and ConvNeXt+Transformer decoder for compact latent representations and band-limited to full-band upsampling.

CLAP Text Encoder

Audio-language understanding from CLAP for optimal acoustic semantic understanding and text-to-reverb conditioning.

Rectified Flow DiT

Diffusion Transformer with rectified flow matching for stable latent generation with classifier-free guidance support.

Caption-then-Rewrite

VLM captioning followed by LLM-based creative rewriting to generate diverse natural language training data.

Training Details:

  • Dataset: 145,976 training samples from multiple sources (C4DM, RIRS NOISES, Image2Reverb, etc.)
  • Sample Rate: 48kHz full-band output with 22kHz latent training
  • Duration: 5-second standardized RIR length
  • Loss Function: Flow matching with pseudo-Huber penalty + β-VAE + adversarial losses
  • Architecture: Scalable DiT models from 213M (S) to 1.5B (XL) parameters

Results & Performance

8.8%

Mean RT60 Error (XL, Long)

-0.75 dB

VAE Reconstruction SNR

3.79/5

Human Evaluation Score

62×

Faster than Griffin-Lim

Comparison with Baselines

RT60 estimation accuracy across model scales (test set n=1957)

Method Mean RT60 Error (%) Mean RT60 (s)
Ground Truth 3.299
Image2Reverb -37.0% 1.295
PromptReverb XL, Long 8.8% 2.189
PromptReverb XL, Short 4.8% 2.044

Code & Repository

View Repository

Implementation Details

The complete implementation, training scripts, and model architectures are available in our GitHub repository. The codebase includes the two-stage VAE architecture, rectified flow matching implementation, and our caption-then-rewrite pipeline for generating diverse text conditioning data.

Repository Contents

  • Two-stage VAE architecture implementation
  • Rectified flow matching with DiT backbone
  • CLAP text encoder integration
  • Training scripts and configuration files
  • Data preprocessing and caption generation pipeline
  • Evaluation metrics and comparison tools