TitleLearning of units and knowledge representation
Publication TypeConference Paper
Year of Publication2013
AuthorsMetze F, Anguera X, Ewert S, Gemmeke J, Kolossa D, Provost EMower, Schuller B, Serrà J
EditorS. Müller NM, Schuller B.
Conference NameDagstuhl Seminar 13451: Computational Audio Analysis
Conference LocationWadern, Germany
Date Published04/11/2013

Our group came together to discuss how knowledge could be used to define and infer units of sound that could be used in a portable way for a number of tasks. Participants felt that a top-down approach would be needed, which is complementary to purely data-driven bottom-up clustering approaches, as are currently prevalent in classification experiments. Members wanted to specifically investigate how an attempt to solve multiple problems at the same time (“holistic” approach) could benefit each individual task by exposing and exploiting correlations and complementarity, which would otherwise stay hidden. Members also felt that a sound statistical framework was needed and that a careful modeling of uncertainty and a mechanism to feed back confidences was needed. This would also be beneficial in the presence of multiple, possibly overlapping signals as is typically the case for sounds Finally, members were interested in working on meta-data of speech. First ideas were discussed on how to learn from data units representing emotions that would be both acoustically discriminative and useful in the context of a certain application, or discernible by humans. Most members had some background in low-level feature extraction and in deep learning. Against this background, members developed an experiment, which they intend to execute in a distributed collaboration over the next couple of weeks. The experiment will be performed on the IEMOCAP database using various existing tools available to the group members. Collaboration tools will be set up at CMU. To establish a baseline, members will investigate the suitability of multi-task learning by training a single deep neural network (DNN) to predict both binary and continuous valued emotion targets on the IEMOCAP benchmark database. The network will be adapted to other databases (most likely AVEC and CreativeIT) to investigate the portability of the learner and to investigate the utility of multi-task learning. These experiments can be performed with feed-forward as well as recurrent architectures. Next, prior knowledge will be incorporated into the classification by adding database information, speaker information, or other meta-data (automatically extracted or manually labeled) as additional inputs to the network training. Finally, the recurrence loop will be optimized by investigating which information should be fed back. This information may comprise the utility of certain features or classes in a certain task, the saliency of some features, or the classification accuracy (posterior probabilities) of some classes on a held-out dataset. Members discussed an uncertainty weighted combination approach that should be able to update the structure and parameters of the classifier so as to improve classification accuracy. The goal will be to optimize the allocation of parameters towards modeling useful target units rather than attempting to accurately model distinctions that will eventually not be used in an application. Results will be published in peer-reviewed literature, and will hopefully lead to follow-up collaborations including organizing future workshops and joint proposals.