ICASSP 2024 paper companion

Learning audio concepts from counterfactual natural language.

Ali Vosoughi, Luca Bondi, Ho-Hsiang Wu, Chenliang Xu

Accepted at IEEE ICASSP 2024

Counterfactual Audio releases counterfactual caption annotations for AudioCaps, Clotho, and MACS so researchers can study not only which text matches an audio clip, but which nearby alternatives should be rejected for that same clip.

Novelty: this release adds counterfactual language for the same audio clip, not just factual caption matching.

Teaser figure from the Counterfactual Audio paper
What changes?

The audio path stays fixed. The intervention happens in language, by rewriting sound sources and scene descriptions into plausible counterfactual alternatives.

Citation

Make the paper easy to cite

Use the recommended BibTeX entry directly from this page or from the repository citation file.

@inproceedings{vosoughi2024counterfactualaudio, ...}
Jump to full citation

What is in this repo

A paper companion release, not just a storage bucket.

The repository packages the released annotation files, prompt scaffold, usage guidance, and a static project page so a researcher can understand the method, inspect the released data, and plug the annotations into an existing audio-text pipeline with minimal setup.

Counterfactual captions

Each released JSON row preserves the original human caption(s) and adds a counterfactual rewrite for the same clip.

Path-aligned to source audio

The path field keeps the annotations joinable to the official AudioCaps, Clotho, or MACS audio tree.

Prompt scaffold included

The release exposes the caption-to-source decomposition and rewrite prompt used to generate counterfactual text.

Research-ready website

The GitHub Pages site lives directly in docs/ and is prepared to serve as a stable paper landing page.

Why this matters

Counterfactual captions change the supervision signal, not just the wording.

Standard CLAP-style audio-text contrastive learning tells the model which caption matches a clip. Counterfactual Audio adds a second text view for that exact same clip, creating a stronger signal for what should not match.

Standard CLAP-style setup

Audio and factual captions are matched contrastively.

The model learns which text is correct for the clip, but it is not explicitly told which nearby textual alternatives should be rejected for the same audio.

Counterfactual Audio setup

The same clip is paired with factual and counterfactual text.

This adds structured linguistic negatives and supports training objectives that preserve factual consistency while separating alternative scenarios in embedding space.

Match vs. reject

The model sees both a positive caption and a semantically nearby alternative attached to the same audio item.

Same audio, different language

The supervision intervenes in text only, which makes it useful for retrieval, representation learning, and causal-style ablations.

Hard negatives with structure

These are not generic random mismatches. They are caption-conditioned alternatives intended to challenge the model more directly.

What you can do

Concrete ways to use the release in research pipelines.

Hard negatives

Treat counterfactual_caption as a structured hard negative for the same audio clip in an existing audio-text setup.

Retrieval analysis

Measure whether an embedding space separates factual captions from plausible but wrong alternatives.

Representation learning

Study whether the model encodes sound sources and acoustic events more explicitly under counterfactual supervision.

Causal-style ablations

Probe how language-level interventions change retrieval or classification behavior without altering the audio itself.

Dataset augmentation

Use the paired text to augment caption corpora with alternative linguistic scenarios while preserving clip alignment.

Release coverage

Dataset-level view of what is available.

How to use the data

Download the source audio, join on path, and flatten pairs.

This repository does not redistribute the original audio. It complements the official dataset releases and keeps the alignment intentionally simple: the relative path in each JSON row points back to the source clip.

01

Download source audio

Obtain AudioCaps, Clotho, and MACS from their official providers and preserve the original folder structure when possible.

02

Load one release JSON

Each row already includes factual captions, counterfactual captions, relative audio path, and metadata such as duration or samplerate.

03

Flatten into training pairs

Use the example loader to emit JSON Lines records with audio_path, caption, and counterfactual_caption.

python examples/resolve_audio_paths.py \
  --release clotho-development-counterfactual.json \
  --dataset-root /path/to/clotho \
  --limit 3 \
  --check-files

If you already have a CLAP-style pipeline, the simplest integration is to use caption as the factual positive and counterfactual_caption as a structured hard negative tied to the same audio item.

Counterfactual examples

Curated text-first examples from the released files.

The page stays lightweight and license-safe by keeping the demo text-first. Each card below fixes the audio path and shows how the paired language shifts.

How to read this: the audio clip is fixed for every card. Only the language intervention changes from factual to counterfactual.

Source / event swaps

Curated examples

Prompt scaffold

The release also exposes the text-generation recipe.

The prompt template documents the source-identification and counterfactual-rewrite setup used in the released annotation process.

Release caveats

Important limits to keep in view when you use the release.

No audio redistribution

This repository releases text annotations and supporting documentation, not hosted audio clips from the source datasets.

Validation split note

In the current Clotho validation JSON, the captions_counterfactual field mirrors the factual captions entry-for-entry.

Original dataset licenses apply

Any workflow that touches the original audio should follow the licenses and terms attached to AudioCaps, Clotho, MACS, and their upstream sources.

Resources

Paper, data release, dataset sources, and citation paths.

Recommended citation

Cite the ICASSP 2024 paper if you use this release.

The release is designed to make the paper and dataset easy to adopt together: inspect the JSON files, wire them into your pipeline, and cite the published work when it supports your research.

@inproceedings{vosoughi2024counterfactualaudio,
  title={Learning Audio Concepts from Counterfactual Natural Language},
  author={Vosoughi, Ali and Bondi, Luca and Wu, Ho-Hsiang and Xu, Chenliang},
  booktitle={2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024},
  url={https://ieeexplore.ieee.org/document/10446736}
}