Automated Audio Captioning - DCASE (2023)

Automatic creation of textual content descriptions for general audio signals.

If you are interested in the task, you can join us on the dedicated slack channel

Automated audio captioning (AAC) is the task of general audio contentdescription using free text. It is an inter-modal translation task (notspeech-to-text), where a system accepts as an input an audio signal andoutputs the textual description (i.e. the caption) of that signal. AACmethods can model concepts (e.g. "muffled sound"), physicalproperties of objects and environment (e.g. "the sound of a big car","people talking in a small and empty room"), and high level knowledge("a clock rings three times"). This modeling can be used in variousapplications, ranging from automatic content description to intelligentand content-oriented machine-to-machine interaction.

Automated Audio Captioning - DCASE (1)

The task of AAC is a continuation of the AAC task fromDCASE2021.As in DCASE2021, this year the task of AAC will allow the usage ofany external data and/or pre-trained models. For example, participantsare allowed to use other datasets for AAC or even datasets for sound eventdetection/tagging, acoustic scene classification, or datasets from anyother task that might be deemed fit. Additionally, participants can usepre-trained models, such as (but not limited to) Word2Vec, BERT, and YAMNet,wherever they want in their model. Please see below for some recommendations fordatasets and pre-tuned models.

This year, the task of Automated Audio Captioning is partially supportedby:

  • the European Union's Horizon 2020 research and innovation programme undergrant agreement No 957337, project MARVEL.
  • the ANR under Grant No ANR-18-CE23-0020, project LEAUDS.

Automated Audio Captioning - DCASE (2)Automated Audio Captioning - DCASE (3)

The AAC task in DCASE2022 will be using Clotho v2 dataset for the evaluationof the submissions. Though, participants can use any other dataset, fromany other task, for the development of their methods. In this section, wewill describe Clotho v2 dataset.

Clotho dataset

Clotho v2 is an extension of the original Clothodataset (i.e. v1) and consists ofaudio samples of 15 to 30 seconds duration, each audio sample having fivecaptions of eight to 20 words length. There is a total of 6972 (4981 fromversion 1 and 1991 from v2) audio samples in Clotho, with 34 860 captions(i.e. 6972 audio samples * 5 captions per each sample). Clotho v2is built with focus on audio content and caption diversity, and the splitsof the data are not hampering the training or evaluation of methods. Thenew data in Clotho v2 will not affect the splits used in order to assessthe performance of methods using the previous version of Clotho (i.e.the evaluation and testing splits of Clotho V1). Allaudio samples are from the Freesound platform, and captions are crowdsourcedusing Amazon Mechanical Turk and annotators from English speaking countries.Unique words, named entities, and speech transcription are removed withpost-processing.

Clotho v2 has a total of around 4500 words and is divided in four splits:development, validation, evaluation, and testing. Audio samples are publiclyavailable for all four splits, but captions are publicly available only forthe development, validation, and evaluation splits. There are no overlappingaudio samples between the four different splits and there is no word thatappears in the evaluation, validation, or testing splits, and not appearingin the development split. Also, there is no word that appears in the developmentsplit and not appearing at least in one of the other three splits. All wordsappear proportionally between splits (the word distribution is kept similaracross splits), i.e. 55% in the development, 15% in the and validation, 15% in theevaluation, and 15% in the testing split.

Words that could not be divided using the above scheme of 55-15-15-15 (e.g.words that appear only two times in the all four splits combined),appear at least one time in the development split and at least one timeto one of the other three splits. This splitting process is similar tothe one used for the previous version of Clotho. More detailed info aboutthe splitting process can be found on thepaper presenting Clotho, freely available onlinehere and cited as:

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen.Clotho: an audio captioning dataset.In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

@inproceedings{Drossos_2020_icassp, author = "Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas", title = "Clotho: an Audio Captioning Dataset", booktitle = "Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP)", year = "2020", pages = "736-740", abstract = "Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684)."}

The data collection of Clotho received funding from the European ResearchCouncil, grant agreement 637422 EVERYSOUND.

Automated Audio Captioning - DCASE (4)

Audio samples in Clotho

Audio samples have durations ranging from 10s to 30s, no spelling errors in the firstsentence of the description on Freesound, good quality (44.1kHz and 16-bit),and no tags on Freesound indicating sound effects, music, or speech.Before extraction, all files were normalized and the preceding andtrailing silences were trimmed.

The content of audio samples in Clotho greatly varies, ranging fromambiance in a forest (e.g. water flowing over some rocks), animal sounds(e.g. goats bleating), and crowd yelling or murmuring, to machines andengines operating (e.g. inside a factory) or revving (e.g. cars, motorbikes),and devices functioning (e.g. container with contents moving, doorsopening/closing). For a thorough description how the audio samples areselected and filtered, you can check the paper that presents Clotho dataset.

In the following figure is the distribution of the duration of audio filesin Clotho v2.

Automated Audio Captioning - DCASE (5)

Captions in Clotho

The captions in the Clotho dataset range from 8 to 20 words in length,and were gathered by employing the crowdsourcing platformAmazon Mechanical Turk and a three-stepframework. The three steps are:

  1. audio description,
  2. description editing, and
  3. description scoring.

In step 1, five initial captions were gathered for each audio clip from distinctannotators. In step 2, these initial captions were edited to fix grammaticalerrors. Grammatically correct captions were instead rephrased, in order to acquirediverse captions for the same audio clip. In step 3, the initial and edited captionswere scored based on accuracy, i.e. how well the caption describes the audio clip,and fluency, i.e. the English fluency in the caption itself. The initial and editedcaptions were scored by three distinct annotators. The scores were then summedtogether and the captions were sorted by the total accuracy score first, totalfluency score second. The top five captions, after sorting, were selected as thefinal captions of the audio clip. More information about the caption scoring (e.g.scoring values, scoring threshold, etc.) is at the corresponding paper of thethree-step framework.

(Video) DCASE Workshop 2021, ID 9 - Automated Audio Captioning with Weakly Supervised Pre-Training and Wo...

We then manually sanitized the final captions of the dataset by removing apostrophes,making compound words consistent, removing phrases describing the content ofspeech, and replacing named entities. We used in-house annotators to replacetranscribed speech in the captions. If the resulting caption were under 8 words,we attempt to find captions in the lower-scored captions (i.e. that were not selectedin step 3) that still have decent scores for accuracy and fluency. If there were nosuch captions or these captions could not be rephrased to 8 words or less, theaudio file was removed from the dataset entirely. The same in-house annotators wereused to also replace unique words that only appeared in the captions of one audio clip.Since audio clips are not shared between splits, if there are words that appearonly in the captions of one audio clip, then these words will appear only in onesplit.

A thorough description of the three-step framework can be found at the correspondingpaper, freely available online here and citedas:

Publication

Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen.Crowdsourcing a dataset of audio captions.In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019.URL: https://arxiv.org/abs/1907.09238.

Crowdsourcing a Dataset of Audio Captions

Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

@inproceedings{Drossos_2019_dcase, Author = "Lipping, Samuel and Drossos, Konstantinos and Virtanen, Tuoams", title = "Crowdsourcing a Dataset of Audio Captions", year = "2019", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop ({DCASE})", month = "Nov.", abstract = "Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.", url = "https://arxiv.org/abs/1907.09238"}

Development, validation, and evaluation datasets of Clotho

Clotho v2 is divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, an evaluation split of 1045 audio clips with 5225 captions, and a testing split of 1043 audio clips with 5215 captions. These splits are created by first constructingthe sets of unique words of the captions of each audio clip. These sets of wordsare combined to form the bag of words of the whole dataset, from which we canderive the frequency of a given word. With the unique words of audio files asclasses, we usemulti-label stratification.More information on the splits of Clotho can be found in the corresponding paper.

The name of the splits for Clotho differ from the DCASE terminology. To avoidconfusion for participants, the correspondence of splits between Clotho andDCASE challenge is:

Clotho naming of splitsDCASE Challenge naming of splits
developmentdevelopment
validation
evaluation
testingevaluation

Clotho development and validation splits are meant for optimizing audio captioning methods. Theperformance of the audio captioning methods can then be assessed (e.g. for reportingresults in a conference or journal paper) using Clotho evaluation split. Clothotesting split is meant only for usage in scientific challenges, e.g. DCASE challenge.For the rest of this text, the DCASE challenge terminology will be used. Fordifferentiating between Clotho development and evaluation, the termsdevelopment-training, development-validation, and development-testing will be used, wherevernecessary. Development-training refers to Clotho development split,development-validation refers to Clotho validation split, anddevelopment-testing refers to Clotho evaluation split.

Clotho download

DCASE Development split of Clotho can be found at the online Zenodo repository.Make sure that you download Clotho v2.1, as there were some minor fixes in thedataset (fixing of file naming and some corrupted files).

Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.

DCASE Evaluation split of Clotho (i.e. Clotho-testing) can be found at the online Zenodo repository.

Task 6 Evaluation dataset (1.2 GB)
Automated Audio Captioning - DCASE (7) version 1

Evaluation data are:

  • clotho_audio_test.7z: The DCASE evaluation (Clotho-testing) audio clips.
  • clotho_metadata_test.csv: The meta-data of the DCASE evaluation (Clotho-testing) audio clips, containing only authors and licence.

NOTE: Participants are strongly prohibited to use any additional informationfor the DCASE evaluation (testing) of their method, apart the provided audio files fromthe DCASE Evaluation split.

In this year's edition, we are introducing supplementary data for evaluation. The objective is to provide contrastive analysis and performance measures on out-of-domain audio and captions.This supplementary subset Clotho-analysis can be obtained on Zenodo:

Task 6a Analysis dataset (12.7 GB)
Automated Audio Captioning - DCASE (8) version 1

Note that results on this new subset will not affect submission rankings.

Other suggested resources

Participants are allowed to use any available resource (e.g.pre-trained models, other datasets) for developing their AAC methods.In order to help participants discover and use other resources, we haveput-up a GitHub repository with AAC resources.

(Video) DCASE Workshop 2021, ID 51 - Continual Learning for Automated Audio Captioning Using the Learning...

Participants are encouraged to browse the information at the GitHubrepository with the other suggested AAC resources and use any resourcethat they deem proper. Additionally, participants are allowed to useany other resource, no matter if it is listed or not at the informationof the GitHub repository.

Participants are free to use any dataset and pre-trained model, inorder to develop their AAC method(s). The assessment of the methodswill be performed using the withheld split of Clotho, which is thesame as last year, offering direct comparison with the results ofthe AAC task at DCASE2020 and DCASE2021.

Participants are allowed to:

  • Use external data (e.g. audio files, text, annotations).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Make subjective judgments of the DCASE evaluation (testing) data, nor to annotate it.
  • Use additional information of the DCASE evaluation (testing) for their method, apart from the provided audio files from the DCASE Evaluation split.

All participants should submit:

  • the output of their audio captioning with the Clotho-testing split (*.csv file),
  • the output of their audio captioning with the Clotho-analysis split (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team.For each system, metadata should be provided in a separate file,containing the task specific information. All files should bepackaged into a zip file for submission. Please make a clearconnection between the system name in the submitted metadata(the .yaml file), submitted system output (the .csv file), andthe technical report (the .pdf file)! For indicating the connectionof your files, you can consider using the following naming convetion:

<author>_<institute>_task6a_submission_<submission_index>_<testing_output or analysis_output or metadata or report>.<csv or yaml or pdf>

For example:

Gontier_INR_task6a_submission_1_testing_output.csvGontier_INR_task6a_submission_1_analysis_output.csvGontier_INR_task6a_submission_1_metadata.yamlGontier_INR_task6a_submission_1_report.pdf

The field <submission_index> is to differentiate your submissionsin case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following two columns:

  1. file_name, which will contain the file name of the audio file.
  2. caption_predicted, which will contain the output of the audio captioning method for the file with file name as specified in the file_name column.

For example, if a file has a file name test_0001.wav and the predictedcaption of the audio captioning method for the file test_0001.wavis hello world, then the CSV file should have the entry:

 file_name caption_predicted . . . . . . test_0001.wav hello world

Please note: automated procedures will be used for theevaluation of the submitted results. Therefore, the column namesshould be exactly as indicated above.

Two output files are expected for each submitted system: one for the DCASE evaluation subset (Clotho-testing), and one for the Clotho-analysis subset.

For each system, metadata should be provided in a separate file.The file format should be as indicated below.

# Submission information for task 6, subtask Asubmission: # Submission label # Label is used to index submissions. # Generate your label following way to avoid # overlapping codes among submissions: # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)] label: gontier_inr_task6a_1 # # Submission name # This name will be used in the results tables when space permits name: DCASE2022 baseline system # # Submission name abbreviated # This abbreviated name will be used in the results table when space is tight. # Use maximum 10 characters.abbreviation: Baseline # Authors of the submitted system. Mark authors in # the order you want them to appear in submission lists. # One of the authors has to be marked as corresponding author, # this will be listed next to the submission in the results tables. authors: # First author - lastname: Gontier firstname: Felix email: felix.gontier@inria.fr # Contact email address corresponding: true # Mark true for one of the authors # Affiliation information for the author affiliation: abbreviation: INRIA institute: INRIA department: Multispeech # Optional location: Nancy, France # Second author...# System informationsystem: # System description, meta data provided here will be used to do # meta analysis of the submitted system. # Use general level tags, when possible use the tags provided in comments. # If information field is not applicable to the system, use "!!null". description: # Audio input / sampling rate # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz input_sampling_rate: 16kHz # Acoustic representation # Here you should indicate what can or audio representation # you used. If your system used hand-crafted features (e.g. # mel band energies), then you can do: # # `acoustic_features: mel energies` # # Else, if you used some pre-trained audio feature extractor, # you can indicate the name of the system, for example: # # `acoustic_features: audioset` acoustic_features: VGGish # Word embeddings # Here you can indicate how you treated word embeddings. # If your method learned its own word embeddings (i.e. you # did not used any pre-trained word embeddings) then you can # do: # # `word_embeddings: learned` # # Else, specify the pre-trained word embeddings that you used # (e.g. Word2Vec, BERT, etc). word_embeddings: BART # Data augmentation methods # e.g. mixup, time stretching, block mixing, pitch shifting, ... data_augmentation: !!null # Method scheme # Here you should indicate the scheme of the method that you # used. For example: machine_learning_method: encoder-decoder # Learning scheme # Here you should indicate the learning scheme. # For example, you could specify either # supervised, self-supervised, or even # reinforcement learning. learning_scheme: supervised # Ensemble # Here you should indicate if you used ensemble # of systems or not. ensemble: No # Audio modelling # Here you should indicate the type of system used for # audio modelling. For example, if you used some stacked CNNs, then # you could do: # # audio_modelling: cnn # # If you used some pre-trained system for audio modelling, # then you should indicate the system used (e.g. COALA, COLA, # transfomer). audio_modelling: transformer # Word modelling # Similarly, here you should indicate the type of system used # for word modelling. For example, if you used some RNNs, # then you could do: # # word_modelling: rnn # # If you used some pre-trained system for word modelling, # then you should indicate the system used (e.g. transfomer). word_modelling: transformer # Loss function # Here you should indicate the loss fuction that you employed. loss_function: crossentropy # Optimizer # Here you should indicate the name of the optimizer that you # used. optimizer: adamw # Learning rate # Here you should indicate the learning rate of the optimizer # that you used. leasrning_rate: 1e-5 # Gradient clipping # Here you should indicate if you used any gradient clipping. # You do this by indicating the value used for clipping. Use # 0 for no clipping. gradient_clipping: 0 # Gradient norm # Here you should indicate the norm of the gradient that you # used for gradient clipping. This field is used only when # gradient clipping has been employed. gradient_norm: !!null # Metric monitored # Here you should report the monitored metric # for optimizing your method. For example, did you # monitored the loss on the validation data (i.e. validation # loss)? Or you monitored the SPIDEr metric? Maybe the training # loss? metric_monitored: validation_loss # System complexity, meta data provided here will be used to evaluate # submitted systems from the computational load perspective. complexity: # Total amount of parameters used in the acoustic model. # For neural networks, this information is usually given before training process # in the network summary. # For other than neural networks, if parameter count information is not directly # available, try estimating the count as accurately as possible. # In case of ensemble approaches, add up parameters for all subsystems. # In case embeddings are used, add up parameter count of the embedding # extraction networks and classification network # Use numerical value (do not use comma for thousands-separator). total_parameters: 140000000 # List of external datasets used in the submission. # Development dataset is used here only as example, list only external datasets external_datasets: # Dataset name - name: Clotho # Dataset access url url: https://doi.org/10.5281/zenodo.3490683 # Has audio: has_audio: Yes # Has images has_images: No # Has video has_video: No # Has captions has_captions: Yes # Number of captions per audio nb_captions_per_audio: 5 # Total amount of examples used total_audio_length: 24430 # Used for (e.g. audio_modelling, word_modelling, audio_and_word_modelling) used_for: audio_and_word_modelling # URL to the source code of the system [optional] source_code: https://github.com/audio-captioning/dcase-2021-baseline# System resultsresults: development_evaluation: # System results for development evaluation split. # Full results are not mandatory, however, they are highly recommended # as they are needed for through analysis of the challenge submissions. # If you are unable to provide all results, also incomplete # results can be reported. bleu1: 0.555 bleu2: 0.358 bleu3: 0.239 bleu4: 0.156 rougel: 0.364 meteor: 0.164 cider: 0.358 spice: 0.109 spider: 0.233

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask fromeach participant/team to consider making available the code of theirmethod (e.g. in GitHub) and pre-trained models, after the challenge is over.

The submitted systems will be evaluated according to their performance on thewithheld evaluation split. For the evaluation, the captions will not have anypunctuation and all letters will be small case. Therefore, the participantsare advised to optimized their methods using captions which do not havepunctuation and all letters are small case.

All of the following metrics will be reported for every submitted method:

  1. BLEU1
  2. BLEU2
  3. BLEU3
  4. BLEU4
  5. ROUGEL
  6. METEOR
  7. CIDEr
  8. SPICE
  9. SPIDEr

Ranking of the methods will be performed according to the SPIDErmetric, which is a combination of CIDEr and SPICE.Specifically, the evaluation will be performed based on the averageof CIDEr and SPICE, referred to as SPIDEr andshown to have the combined benefits of CIDEr and SPICE.More information is available on the corresponding paper,available online here.

For a brief introduction and more pointers on the above-mentioned metrics,you can refer to the original paper of audio captioning:

Publication

Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen.Automated audio captioning with recurrent neural networks.In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, New York, U.S.A., Oct. 2017.URL: https://arxiv.org/abs/1706.10006.

Automated Audio Captioning with Recurrent Neural Networks

Abstract

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

@inproceedings{Drossos_2017_waspaa, Author = "Drossos, Konstantinos and Adavanne, Sharath and Virtanen, Tuomas", title = "Automated Audio Captioning with Recurrent Neural Networks", booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ({WASPAA})", address = "New Paltz, New York, U.S.A.", year = "2017", month = "Oct.", abstract = "We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.", url = "https://arxiv.org/abs/1706.10006"}
(Video) DCASE Workshop 2021, ID 57 - Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags

In this edition of the task, we are also introducing more recent model-based metrics for captioning. Note that these additional metrics are contrastive, and will not affect rankings of the submissions.

Complete results and technical reports can be found in the results page

To provide a starting point and some initial results for the challenge,there is a baseline system for the task of automated audio captioning.The baseline system is freely available online, is a sequence-to-sequencemodel, and is implemented using PyTorch.

The baseline system consists of three parts:

  1. the caption evaluation part,
  2. the dataset pre-processing/feature extraction part, and
  3. the deep neural network (DNN) method part

You can find the baseline system of automated audio captioning task on GitHub.

Caption evaluation

Caption evaluation is performed using a version of the caption evaluationtools used for the MS COCO challenge. This version of the code has beenupdated in order to be compliant with Python 3.6 and above and with theneeds of the automated audio captioning task. The code can for the evaluationis included in the baseline system, but also can be found online.

Clotho examples are available as WAV and CSV files. For use in the development of an audiocaptioning method, features have to be extracted from the audio clips(i.e. WAV files) and the captions in the CSV files have to be pre-processed (e.g. punctuation removal). The extracted features and processed words then have to be matched as input-outputpairs.

In the baseline system there is code that implements the above. Thiscode is also available as stand-alone, in the following repository:

Deep neural network (DNN) method

Lastly, the baseline is a sequence-to-sequence system consisting of an encoder and a decoder, and is implemented with the Pytorch HuggingFace framework.Specifically, it is a standard sequence-to-sequence transformer with 6 encoder and 6 decoder layers.The encoder takes as an input embeddings from a pre-trained VGGish model with dimension 128, transformed by an affine layer to embeddings of dimension 768, and outputs a sequence of embeddings of the same length as the input sequence. Each transformer layer of the encoder outputs 768 features for each timestep of the input representation.

The decoder employs encoder outputs to generate the caption in an autoregressive manner. To do so, previously generated words are tokenized and transformed into embeddings as inputs to the decoder. In the baseline model, the tokenizer is pre-trained with a byte-pair-encoding process, i.e. each token corresponds to a sub-word from the model vocabulary instead of a full English word. This tokenizer has a vocabulary size of 50265 tokens. Each token in the past sequence is then associated to a feature vector through an embedding map, and input to the decoder. Each layer of the decoder attends on the previously generated tokens with self-attention, as well as on the entire encoder output sequence with cross-attention. The embedding dimension of each decoder layer is 768. Lastly, a classifier head consisting of one linear layer with softmax activation outputs a probability for each token of the vocabulary.

At evaluation, generation is performed through beam search, although greedy decoding is also provided in the code.

Hyper-parameters

Feature extraction and caption processing: The baseline system usesthe following hyper-parameters for the feature extraction:

  • Pre-trained VGGish embeddings of dimension 128
  • Windows of 1s without overlap

Captions are pre-processed according to the following:

  • Removal of all punctuation
  • All letters to small case
  • Tokenization
  • Pre-pending of start-of-sentence token (i.e. <s>)
  • Appending of end-of-sentence token (i.e. </s>)

Neural network and optimization hyper-parameters: The deep neuralnetwork used in the baseline system has the following hyper-parameters:

  • One affine layer with no activation function, which transforms 128-dimensional VGGish features into 768-dimensional embeddings.
  • Six transformer encoder layers with 768 output embeddings, each composed of:
    • A multi-head self-attention layer with 12 heads, followed by a residual connection to the layer input, and layer normalization.
    • Two affine layers of dimension 3072 and 768 respectively, followed by a residual connection to the output of self-attention, and layer normalization.
  • Six transformer decoder layers with 768 output embeddings, each composed of:
    • A multi-head self-attention layer with 12 heads, followed by a residual connection to the layer input, and layer normalization.
    • A multi-head cross-attention layer attending on the encoder outputs with 12 heads, followed by a residual connection to the self-attention output, and layer normalization.
    • Two affine layers of dimension 3072 and 768 respectively, followed by a residual connection to the output of cross-attention, and layer normalization.
  • An affine layer with 50265 output features and softmax activation, which predicts tokens from the output of the last decoder layer.

Parameter optimization is performed using AdamW on the cross-entropy loss for 20 epochs, with a batch size of 4 examples and gradient accumulation of 2 steps (equivalent to a batch size of 8). The learning rate is 10-5. The loss on the validation set is evaluated every 1000 iterations, and the checkpoint corresponding to the lowest validation loss is retained for inference.Training takes about 2 hours on a single GPU (GTX 1080ti).

All input audio features and captions in a batch are padded to a fixed length for convenient batching. Specifically, the input audio features are padded with zero vectors, and tokenized captions are padded with a special <pad> token. For both audio and token input sequences, an attention mask is provided to the model in order to ignore padded elements.

Results for the development dataset

The results of the baseline system for the development dataset of Clotho v2.1 are:

MetricValue
BLEU10.555
BLEU20.358
BLEU30.239
BLEU40.156
ROUGEL0.364
METEOR0.164
CIDEr0.358
SPICE0.109
SPIDEr0.233

The pre-trained weights for the DNN of the baseline system yieldingthe above results are freely available on Zenodo:

If you participating in this task, you might want to check thefollowing papers. If you find a paper that had to be cited herebut it is not (e.g. a paper for some of the suggested resourcesthat is missing), please contact us and report it to us.

  • The initial publication on audio captioning:

Publication

Konstantinos Drossos, Sharath Adavanne, and Tuomas Virtanen.Automated audio captioning with recurrent neural networks.In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, New York, U.S.A., Oct. 2017.URL: https://arxiv.org/abs/1706.10006.

PDF

(Video) DCASE Workshop 2021, ID 19 - Improving the Performance of Automated Audio Captioning via Integrat...

Automated Audio Captioning with Recurrent Neural Networks

Abstract

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

@inproceedings{Drossos_2017_waspaa, Author = "Drossos, Konstantinos and Adavanne, Sharath and Virtanen, Tuomas", title = "Automated Audio Captioning with Recurrent Neural Networks", booktitle = "IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ({WASPAA})", address = "New Paltz, New York, U.S.A.", year = "2017", month = "Oct.", abstract = "We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.", url = "https://arxiv.org/abs/1706.10006"}
  • The three-step framework, employed for collecting the annotationsof Clotho (if you use the three-step framework, consider citing thepaper):

Publication

Samuel Lipping, Konstantinos Drossos, and Tuoams Virtanen.Crowdsourcing a dataset of audio captions.In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE). Nov. 2019.URL: https://arxiv.org/abs/1907.09238.

Crowdsourcing a Dataset of Audio Captions

Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

@inproceedings{Drossos_2019_dcase, Author = "Lipping, Samuel and Drossos, Konstantinos and Virtanen, Tuoams", title = "Crowdsourcing a Dataset of Audio Captions", year = "2019", booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop ({DCASE})", month = "Nov.", abstract = "Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. “people talking in a big room”). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.", url = "https://arxiv.org/abs/1907.09238"}
  • The Clotho dataset (if you use Clotho consider citing the Clotho paper):

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen.Clotho: an audio captioning dataset.In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

@inproceedings{Drossos_2020_icassp, author = "Drossos, Konstantinos and Lipping, Samuel and Virtanen, Tuomas", title = "Clotho: an Audio Captioning Dataset", booktitle = "Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP)", year = "2020", pages = "736-740", abstract = "Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684)."}
(Video) DCASE Workshop 2021, ID 68 - Audio Captioning Transformer

Videos

1. DCASE Workshop 2021, ID 70 - Transfer Learning followed by Transformer for Automated Audio Captio...
(DCASE Workshop 2021)
2. DCASE 2020 Workshop, ID 79, Daiki Takeuchi et al.
(DCASE2020 Workshop)
3. DCASE Workshop 2021, ID 25 - Evaluating Off-the-Shelf Machine Listening and Natural Language Mode...
(DCASE Workshop 2021)
4. DCASE Workshop 2021, ID 33 - Low-Complexity Acoustic Scene Classification for Multi-Device Audio:...
(DCASE Workshop 2021)
5. DCASE Workshop 2021, ID 71 - Self-Trained Audio Tagging and Sound Event Detection in Domestic Env...
(DCASE Workshop 2021)
6. DCASE Workshop 2021, ID 61 - Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervi...
(DCASE Workshop 2021)
Top Articles
Latest Posts
Article information

Author: Sen. Ignacio Ratke

Last Updated: 03/04/2023

Views: 5534

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.