PromptReverb: Multimodal Room Impulse Response Generation

Citation

If you use PromptReverb in your research, please cite our ICASSP 2026 paper:

BibTeX:

@inproceedings{vosoughi2026promptreverb,
  title={Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching},
  author={Ali Vosoughi and Yongyi Zang and Qihui Yang and Nathan Paek and Randal Leistikow and Chenliang Xu},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  organization={IEEE}
}

Project Page: https://ali-vosoughi.github.io/PromptReverb

Code: https://github.com/ali-vosoughi/PromptReverb

Key Features

Two-Stage Architecture

VAE for band-limited to full-band upsampling (48 kHz) combined with conditional diffusion transformer for text-to-RIR generation

Rectified Flow Matching

Advanced flow-based generation with proper ODE integration using DiT architecture for high-quality reverb synthesis

Natural Language Control

First method to synthesize complete RIRs from free-form textual descriptions using CLAP-based text encoding

Superior Accuracy

Achieves 8.8% mean RT60 error compared to -37% for Image2Reverb baseline with realistic acoustic parameters

Audio Samples

Compare our PromptReverb method with Image2Reverb and ground truth recordings across diverse acoustic environments

Method Overview

PromptReverb Architecture

Text Input

CLAP Encoder

Rectified Flow DiT

VAE Latents

VAE Decoder

Full-Band RIR

Key Technical Components:

β-VAE Pipeline

ResBlock encoder and ConvNeXt+Transformer decoder for compact latent representations and band-limited to full-band upsampling.

CLAP Text Encoder

Audio-language understanding from CLAP for optimal acoustic semantic understanding and text-to-reverb conditioning.

Rectified Flow DiT

Diffusion Transformer with rectified flow matching for stable latent generation with classifier-free guidance support.

Caption-then-Rewrite

VLM captioning followed by LLM-based creative rewriting to generate diverse natural language training data.

Training Details:

Dataset: 145,976 training samples from multiple sources (C4DM, RIRS NOISES, Image2Reverb, etc.)
Sample Rate: 48kHz full-band output with 22kHz latent training
Duration: 5-second standardized RIR length
Loss Function: Flow matching with pseudo-Huber penalty + β-VAE + adversarial losses
Architecture: Scalable DiT models from 213M (S) to 1.5B (XL) parameters

Results & Performance

8.8%

Mean RT60 Error (XL, Long)

-0.75 dB

VAE Reconstruction SNR

3.79/5

Human Evaluation Score

62×

Faster than Griffin-Lim

Comparison with Baselines

RT60 estimation accuracy across model scales (test set n=1957)

Method	Mean RT60 Error (%)	Mean RT60 (s)
Ground Truth	—	3.299
Image2Reverb	-37.0%	1.295
PromptReverb XL, Long	8.8%	2.189
PromptReverb XL, Short	4.8%	2.044

Code & Repository

View Repository

Implementation Details

The complete implementation, training scripts, and model architectures are available in our GitHub repository. The codebase includes the two-stage VAE architecture, rectified flow matching implementation, and our caption-then-rewrite pipeline for generating diverse text conditioning data.

Repository Contents

Two-stage VAE architecture implementation
Rectified flow matching with DiT backbone
CLAP text encoder integration
Training scripts and configuration files
Data preprocessing and caption generation pipeline
Evaluation metrics and comparison tools