Paper and repository
ICASSP 2024 paper companion
Learning audio concepts from counterfactual natural language.
Counterfactual Audio releases counterfactual caption annotations for AudioCaps, Clotho, and MACS so researchers can study not only which text matches an audio clip, but which nearby alternatives should be rejected for that same clip.
Novelty: this release adds counterfactual language for the same audio clip, not just factual caption matching.
The audio path stays fixed. The intervention happens in language, by rewriting sound sources and scene descriptions into plausible counterfactual alternatives.
Direct release files
Browse the JSON annotations
Citation
Make the paper easy to cite
Use the recommended BibTeX entry directly from this page or from the repository citation file.
What is in this repo
A paper companion release, not just a storage bucket.
The repository packages the released annotation files, prompt scaffold, usage guidance, and a static project page so a researcher can understand the method, inspect the released data, and plug the annotations into an existing audio-text pipeline with minimal setup.
Counterfactual captions
Each released JSON row preserves the original human caption(s) and adds a counterfactual rewrite for the same clip.
Path-aligned to source audio
The path field keeps the annotations joinable to the
official AudioCaps, Clotho, or MACS audio tree.
Prompt scaffold included
The release exposes the caption-to-source decomposition and rewrite prompt used to generate counterfactual text.
Research-ready website
The GitHub Pages site lives directly in docs/ and is
prepared to serve as a stable paper landing page.
Why this matters
Counterfactual captions change the supervision signal, not just the wording.
Standard CLAP-style audio-text contrastive learning tells the model which caption matches a clip. Counterfactual Audio adds a second text view for that exact same clip, creating a stronger signal for what should not match.
Standard CLAP-style setup
Audio and factual captions are matched contrastively.
The model learns which text is correct for the clip, but it is not explicitly told which nearby textual alternatives should be rejected for the same audio.
Counterfactual Audio setup
The same clip is paired with factual and counterfactual text.
This adds structured linguistic negatives and supports training objectives that preserve factual consistency while separating alternative scenarios in embedding space.
Match vs. reject
The model sees both a positive caption and a semantically nearby alternative attached to the same audio item.
Same audio, different language
The supervision intervenes in text only, which makes it useful for retrieval, representation learning, and causal-style ablations.
Hard negatives with structure
These are not generic random mismatches. They are caption-conditioned alternatives intended to challenge the model more directly.
What you can do
Concrete ways to use the release in research pipelines.
Hard negatives
Treat counterfactual_caption as a structured hard
negative for the same audio clip in an existing audio-text setup.
Retrieval analysis
Measure whether an embedding space separates factual captions from plausible but wrong alternatives.
Representation learning
Study whether the model encodes sound sources and acoustic events more explicitly under counterfactual supervision.
Causal-style ablations
Probe how language-level interventions change retrieval or classification behavior without altering the audio itself.
Dataset augmentation
Use the paired text to augment caption corpora with alternative linguistic scenarios while preserving clip alignment.
Release coverage
Dataset-level view of what is available.
How to use the data
Download the source audio, join on path, and flatten pairs.
This repository does not redistribute the original audio. It complements the official dataset releases and keeps the alignment intentionally simple: the relative path in each JSON row points back to the source clip.
Download source audio
Obtain AudioCaps, Clotho, and MACS from their official providers and preserve the original folder structure when possible.
Load one release JSON
Each row already includes factual captions, counterfactual captions, relative audio path, and metadata such as duration or samplerate.
Flatten into training pairs
Use the example loader to emit JSON Lines records with
audio_path, caption, and
counterfactual_caption.
python examples/resolve_audio_paths.py \
--release clotho-development-counterfactual.json \
--dataset-root /path/to/clotho \
--limit 3 \
--check-files
If you already have a CLAP-style pipeline, the simplest integration is
to use caption as the factual positive and
counterfactual_caption as a structured hard negative tied
to the same audio item.
Counterfactual examples
Curated text-first examples from the released files.
The page stays lightweight and license-safe by keeping the demo text-first. Each card below fixes the audio path and shows how the paired language shifts.
How to read this: the audio clip is fixed for every card. Only the language intervention changes from factual to counterfactual.
Source / event swaps
Curated examples
Prompt scaffold
The release also exposes the text-generation recipe.
The prompt template documents the source-identification and counterfactual-rewrite setup used in the released annotation process.
Release caveats
Important limits to keep in view when you use the release.
No audio redistribution
This repository releases text annotations and supporting documentation, not hosted audio clips from the source datasets.
Validation split note
In the current Clotho validation JSON, the
captions_counterfactual field mirrors the factual
captions entry-for-entry.
Original dataset licenses apply
Any workflow that touches the original audio should follow the licenses and terms attached to AudioCaps, Clotho, MACS, and their upstream sources.
Resources
Paper, data release, dataset sources, and citation paths.
Paper
Companion page for the accepted ICASSP 2024 paper.
Data release
Official source datasets
Citation
Use the recommended BibTeX below or the repository’s CITATION.cff.
Recommended citation
Cite the ICASSP 2024 paper if you use this release.
The release is designed to make the paper and dataset easy to adopt together: inspect the JSON files, wire them into your pipeline, and cite the published work when it supports your research.
@inproceedings{vosoughi2024counterfactualaudio,
title={Learning Audio Concepts from Counterfactual Natural Language},
author={Vosoughi, Ali and Bondi, Luca and Wu, Ho-Hsiang and Xu, Chenliang},
booktitle={2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024},
url={https://ieeexplore.ieee.org/document/10446736}
}