Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
If you use PromptReverb in your research, please cite our ICASSP 2026 paper:
VAE for band-limited to full-band upsampling (48 kHz) combined with conditional diffusion transformer for text-to-RIR generation
Advanced flow-based generation with proper ODE integration using DiT architecture for high-quality reverb synthesis
First method to synthesize complete RIRs from free-form textual descriptions using CLAP-based text encoding
Achieves 8.8% mean RT60 error compared to -37% for Image2Reverb baseline with realistic acoustic parameters
Compare our PromptReverb method with Image2Reverb and ground truth recordings across diverse acoustic environments
ResBlock encoder and ConvNeXt+Transformer decoder for compact latent representations and band-limited to full-band upsampling.
Audio-language understanding from CLAP for optimal acoustic semantic understanding and text-to-reverb conditioning.
Diffusion Transformer with rectified flow matching for stable latent generation with classifier-free guidance support.
VLM captioning followed by LLM-based creative rewriting to generate diverse natural language training data.
Mean RT60 Error (XL, Long)
VAE Reconstruction SNR
Human Evaluation Score
Faster than Griffin-Lim
RT60 estimation accuracy across model scales (test set n=1957)
| Method | Mean RT60 Error (%) | Mean RT60 (s) |
|---|---|---|
| Ground Truth | — | 3.299 |
| Image2Reverb | -37.0% | 1.295 |
| PromptReverb XL, Long | 8.8% | 2.189 |
| PromptReverb XL, Short | 4.8% | 2.044 |
The complete implementation, training scripts, and model architectures are available in our GitHub repository. The codebase includes the two-stage VAE architecture, rectified flow matching implementation, and our caption-then-rewrite pipeline for generating diverse text conditioning data.