Francesca Ronchini1, Luca Comanducci1, and Fabio Antonacci1
1 Dipartimento di Elettronica, Informazione e Bioingegneria - Politecnico di Milano
Paper accepted @ DCASE Workshop 2024
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
For generating the data, we used AudioLDM2 and AudioGen.
Please refer to the AudioLDM2 GitHub repo and follow the installation instructions. For this study, we used the official checkpoints available in the Hugging Face 🧨 Diffusers and the audioldm checkpoint.
When AudioLDM2 has been installed, you can generate the audio files running the script audio_generation/class_generation_audioldm.py Before running the script, you need to specify the path to the output folder, the audio class to generate, the prompt to use to generate the files, and the number of files to generate in the audio_generation/class_generation_audiogen.py.
After that, you can run the script with the command:
cd audio_generation
python class_generation_audioldm.py
Please refer to the AudioGen GitHub repo and follow the installation instructions.
When AudioGen has been installed, you can generate the audio files running the script audio_generation/class_generation_audiogen.py. Before running the script, you need to specify the path to the output folder, the audio class to generate, the prompt to use to generate the files, and the number of files to generate in the audio_generation/class_generation_audiogen.py.
cd audio_generation
python class_generation_audiogen.py
When all the data have been generated, you can reproduce the experiments.
First, install all the packages required by the system. Run the following command on your terminal to install all the packages needed:
pip install -r requirements.txt
When all packages have been installed, you need to specify which dataset to use following the instructions on the config/default.yaml file.
After all the parameters have been defined, you can run the code with the following command:
python main.py
Additional material and audio samples are available on the companion website.
For more details: Francesca Ronchini, Luca Comanducci, and Fabio Antonacci, "Synthetic training set generation using text-to-audio models for environmental sound classification", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), October 2024
If you use code or comments from this work, please cite our paper:
@inproceedings{Ronchini2024,
author = "Ronchini, Francesca and Comanducci, Luca and Antonacci, Fabio",
title = "Synthetic Training Set Generation using Text-To-Audio Models for Environmental Sound Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
address = "Tokyo, Japan",
month = "October",
year = "2024",
pages = "126--130",
}