AVEC 2017 menu

International Workshop on Voice and Speech Processing in Social Interactions

Latest news

The program is available here


Voice and speech play a fundamental role in social interactions, but they have been relatively neglected, at least in the last years, with respect to other aspects of social exchanges such as, e.g., facial expressions or gestures. Furthermore, research on voice and speech has been carried out with different methodologies and goals in different domains, but no major attempts have been made to identify common research questions, to integrate findings produced in different communities or simply to diffuse awareness about the multiple facets of the problem. The goal of this workshop is to bridge this gap by gathering researchers from different communities, in particular neurosciences, psychology and computing science, to share the latest findings on social aspects of voice and speech and to identify the most important issues still open in the domain.

The event, possibly the first of a series, will launch the newly created “Social Interaction Research Centre” of the University of Glasgow and it is expected to result into a collection of articles to be published in a common book.

Call for Participation

Interested participants are invited to submit an abstract (max 500 words). Authors of accepted submissions will be invited to present their work in oral or poster format (please indicate your preference in the submission). The deadline for the submission is March 15th, 2011. The submission should be sent via e-mail to the organisers:

Pascal Belin: Pascal.Belin@Glasgow.ac.uk
Alessandro Vinciarelli: Alessandro.Vinciarelli@Glasgow.ac.uk


Wolfson Medical School of the University of Glasgow
Yudowitz Seminar Room

Wolfson Medical School Building
University Avenue
University of Glasgow
G12 8QQ


The Workshop is sponsored by:

Social Signal Processing Network (SSPNet)
Scottish Information and Computer Science Alliance (SICSA)


Keynote lecture

  • Klaus Scherer (University of Geneva)

Title: The voice of power and influence

Abstract: I make a strong case for an important appeal function of vocal emotion expression in social influence settings, particularly persuasion. Thus, it can be argued that appropriate emotional expression by a persuader will tend to increase the effectiveness of the persuasive message because of a) the attribution of greater credibility and trustworthiness to the sender, and b) the production of appropriate emotions in the audience which may induce the desired attitudes or behaviors or make the cognitive processing more amenable to accepting the message emitted by the persuader. What is the underlying mechanism? I suggest that one can conceive of a symbolic function of vocal affect signals by assuming that the acoustic characteristics of an emotional vocalization reflect the complete pattern of the cognitive appraisal process that produced the emotional state in the sender. This information about the criteria used in the emotion-antecedent evaluation should allow the listener to reconstruct the major features of the emotion producing event and its effect on the speaker I suggest that is possible to elaborate predictions on how we would expect the major phonation characteristics to vary as a result of the major emotion antecedent evaluation criteria and I report data from the actor portrayal studies that confirm many of the theoretical predictions on vocal patterning based on the component process model of emotion. I further suggest that the inference of felt power and competence as well as the attribution of authenticity are central determinants of the emotion-induction aspect of successful persuasion – especially in political speech and will review the some of the evidence for voice markers of power and dominance.

Scherer, K. R. (1979). Voice and speech correlates of perceived social influence. In H. Giles & R. St.Clair (Eds.), The social psychology of language (pp. 88-120). London: Blackwell.

Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227-256.

Invited lectures

Psychology / Cognitive neuroscience perspective

  • Pascal Belin (University of Glasgow)

Cerebral processing of vocal information.

  • Sonja A. Kotz (Minerva Research Group “Neurocognition of Rhythm in Communication”, MPI for Human Cognitive and Brain Sciences, Leipzig, Germany)

Title: On the encoding of vocal emotion expressions

Abstract: Speech is an important carrier of emotional information. However, (1) little is known about how distinct vocal emotion expressions are recognized in a receiver’s brain, (2) whether such possible distinctions are encoded as specific modulations of the human voice, and (3) whether the neural correlates of such vocal expressions are phylogenetically similar and thus shared across species. I will present a series of EEG as well as classical fMRI data and will end with recent data from a multivariate pattern analysis of fMRI data that investigated to which extent vocal emotion expressions are represented in local brain activity patterns.

  • Caroline Floccia (Plymouth University)

Title: Dialect and accent perception

Abstract: Foreign accents and dialects, which constitute an important source of variability in speech, are a challenge for models of speech perception as well as for automatic speech recognition. Until recently, little was known about the mechanisms underlying normalisation of these within-language variations in on-line speech processing. Here we present some psycholinguistic data exploring infants’, children and adult’s perception and normalisation of accents, in favour of a possible dissociation between the processing of foreign accents and regional dialects.

  • Stefan R. Schweinberger (University of Jena and DFG Research Unit)

Title: Recognition of Speaker Identity

Abstract: Although the voice is an important cue to speaker identity, voice identification and its neural correlates have received relatively little scientific attention. I will briefly present evidence on the role of the stimulus and its duration in famous voice recognition, on the ability of specific kinds of retrieval cues to resolve failures in voice identification, and on evidence from case studies regarding specific types of functional impairment. In more recent research, we used adaptation to demonstrate contrastive effects in the identification of personally familiar voices. In experiments using different whole-sentence utterances as adaptors and test stimuli, prolonged exposure to speaker A´s voice strongly biased the perception of identity-ambiguous voice morphs between speakers A and B towards speaker B (and vice versa). Intriguingly, we observed significant – albeit smaller – bias effects to voice identification, when adaptors were videos of familiar speakers’ silently articulating faces. More recent research measured event-related potentials (ERPs) to investigate the neural representation and the time course of vocal identity processing, using short VCV utterances. Here, contrastive voice identity aftereffects were much more pronounced when the same syllable, rather than a different syllable, was used as adaptor. Identity adaptation induced amplitude reductions of the frontocentral N1-P2 complex and a prominent reduction of a parietal P3 component, for test voices preceded by identity-corresponding adaptors. Importantly, only the P3 modulation remained clear for across-syllable combinations of adaptor and test stimuli. Our are consistent with the view that voice identity is contrastively processed by specialised neurons in auditory cortex within ~250 ms after stimulus onset, with identity processing becoming less dependent on speech content after ~300 ms.

  • Jean-Luc Schwartz

Title: Binding and mixing together the sounds and sights of a speaking partner

Abstract: It is known since long that speech perception is multimodal, and various proposals have been made for explaining how audiovisual fusion could be achieved in speech perception. In this talk I will try to make clear that fusion is not an automatic process, and that it probably involves a preliminary stage of audiovisual speech scene analysis in which the adequate pieces of speech and sound are bound together before further processing and categorization.

  • David Feinberg (McMaster University)

Voice Attractiveness.

Computational/Signal Processing Perspective

  • Simon King (University of Edinburgh)

Title: Synthetic speech – beyond mere intelligibility

Abstract: Some text-to-speech synthesisers are now as intelligible as human speech. This is a remarkable achievement, but the next big challenge is to approach human-like naturalness, which will be even harder. I will describe several lines of research which are attempting to imbue speech synthesisers with the properties they need to sound more “natural” – whatever that means.

The starting point is personalised speech synthesis, which allows the synthesiser to sound like an individual person without requiring substantial amounts of their recorded speech. I will then describe how we can work from imperfect recordings or achieve personalised speech synthesis across languages, with a few diversions to consider what it means to sound like the same person in two different languages and how vocal attractiveness plays a role.

Since the voice is not only our preferred means of communication but also a central part of our identity, losing it can be distressing. Current voice-output communication aids offer a very poor selection of voices, but recent research means that soon it will be possible to provide people who are losing the ability to speak, perhaps due to conditions such as Motor Neurone Disease, with personalised communication aids that sound just like they used to, even if we do not have a recording of their original voice.

There will be plenty of examples, including synthetic child speech, personalised synthesis across the language barrier, and the reconstruction of voices from recordings of disordered speech.

This work was done with Junichi Yamagishi, Sandra Andraszewicz, Oliver Watts, Mirjam Wester and many others.

Relevant publications:

J. Yamagishi, B. Usabaev, S. King, O. Watts, J. Dines, J. Tian, R. Hu, Y. Guan, K. Oura, K. Tokuda, R. Karhila, and M. Kurimo. Thousands of voices for HMM-based speech synthesis – analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech and Language Processing, 18(5):984-1004, July 2010.

O. Watts, J. Yamagishi, S. King, and K. Berkling. Synthesis of child speech with HMM adaptation and voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5):1005-1016, July 2010.

S. Andraszewicz, J. Yamagishi and S King. Vocal Attractiveness of Statistical Speech Synthesisers. To appear in Proc. ICASSP 2011, Prague, May 2011.

M. Wester et al. Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project. Proc. SSW7, Kyoto, Japan, 2010.


  • Marc Schroeder (DFKI)

Title: The challenge of synthesizing expressive prosody in speech and non-verbal vocalizations

Abstract: There is no doubt that prosody plays an important role in communicating socially relevant information, including emotions, stance, and others. Many studies provide aspects of relevant information; nevertheless, it remains difficult to describe expressive prosody in a way that would lead to a satisfactory realisation in synthetic speech. This talk looks at the problem from a data-driven perspective, including corpus analysis and speech synthesis technology. We will illustrate some of the problems that arise from the limitations of current prosody models and signal processing technologies.

Schröder, M. (2009). Expressive Speech Synthesis: Past, Present, and Possible Futures, Affective Information Processing (Tao, J., Tan, T., eds.), pp. 111-126. London: Springer. http://dx.doi.org/10.1007/978-1-84800-306-4_7

Pammi, S., Schröder, M., Charfuelan, M., Türk, O., & Steiner, I. (2010).  Synthesis of listener vocalisations with imposed intonation contours. In Proc. Seventh ISCA Tutorial and Research Workshop on Speech Synthesis, Kyoto, Japan. http://www.dfki.de/~schroed/articles/pammi_etal2010a.pdf

Charfuelan, M., Schröder, M., & Steiner, I. (2010). Prosody and voice quality of vocal social signals: the case of dominance in scenario meetings. In Proc. Interspeech. Makuhari, Japan. http://www.dfki.de/~schroed/articles/charfuelan_etal2010.pdf

  • Nick Campbell (trinity College Dublin)

Title: Talking with People (and robots); the Nonverbal Aspects

This talk focusses on the use of voice and tone-of-voice in social interaction, with particular emphasis on the way that people signal not just their affective and cognitive states but also their short-term and long-term social relationships in conversational discourse.  Current speech processing technology is well able to model and interpret the linguistic content in human speech, both for input and for output, but is virtually unaware of this rich souce of interpersonal information that is carried alongside the propositional content in conversational interaction.  The talk briefly describes a forthcoming ISO standard for representing the multidimensional aspects of such discourse interaction and then focusses on methods for capturing and processing the audio-visual data in corpora that are optimised to be representative of such social aspects in conversational interaction.

  • Alessandro Vinciarelli (University of Glasgow)

Title: Social Signal Processing: understanding nonverbal communication in social interactions

Abstract: There is more than words in linguistic communication. Whenever involved in social interactions, people display a wide number of nonverbal behavioural cues (facial expressions, vocalisations, gestures, postures, etc.) that add entirely new layers of meaning to the words being uttered. Social Signal Processing is the new, emerging domain aimed at conceptual modelling, automatic analysis and machine synthesis of nonverbal cues used as social signals, i.e. signals conveying information about social actions, social relations, social emotions and social attitudes. The goal of this talk is to illustrate the general aspects of the domain, present some examples of SSP works, and show how SSP can be helpful to make computers more adept and robust to realistic socio-cultural phenomena.

3 comments to International Workshop on Voice and Speech Processing in Social Interactions

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>