Challenge Guidelines

To get started

Obtain an account for the AVEC2012 database and download the data. Even if you intend to participate in the word-level sub-challenge (WLSC) only, you will probably want to download the FCSC set as well (see the detailed description of features below) .

After downloading the data you can directly start your experiments with the train and development sets. Once you found your best method you should write your paper for the Workshop. At the same time you can compute your results per instance of the test set and upload them. We will then let you know your performance result. See below for more information on the submission process and the way we performance is measured.

Please note that the goal of AVEC2012 is the recognition of the continuous labels, whereas the goal of AVEC2011 was the detection of binarised positive/negative classes of the four dimensions. The binarised labels are still included for download from the same web site, but solely for the purpose of comparison with the AVEC2011 results. Note that the test set for AVEC2012 also differs from that of AVEC2011.

Please also note that, in contrast with AVEC2011, there is only a single challenge. There are no separate audio, video, and audio-visual sub-challenges. You are free to use either modality, or both, but everyone will compete for the same prize. Please also note that aligned transcripts of the user’s speech is available, which can be used as an additional cue.

The Challenge measure will be the cross-correlation between the ground truth and predicted labels of the four dimensions. The scripts used to compute your scores are publicly available for both the FCSC and WLSC sub-challenges. Participants are allowed to find their own features and use their own regression algorithm. However, standard feature sets (for audio and video separately) are given that may be used. The labels of the test set are unknown, and participants have to stick to the definition of training, development, and test sets. They may report on results obtained on the development set, but have only a limited number of five trials to upload their results on the test set, whose labels are unknown to them.

See below for a list of frequently asked questions (FAQ).

Challenge data

The organisers provide the following data for the  two sub-challenges:

  • Video data (avi)
  • Audio data (wav)
  • Aligned transcripts with word timing information
  • Labels for the train and development partitions
  • Video features
  • Audio features

N.B.2 The complete set of video test features is now available separately in a double-volume zip archive called This contains the video features for all 32 test sequences.

N.B.2 In the distributed data, the labels for the ‘power’ dimension of development session 20 is incorrect. To save forcing people to download the full dataset, we have made the corrected labels and associated audio-arff files available separately in a zip file called Within you will find files such as labels_continuous_devel020_power.dat. Please replace the files extracted from the main AVEC2012 zip file with the ones contained in the errata zip file.

Corresponding SEMAINE sessions

The AVEC dataset is a subset of the SEMAINE database. In fact, it’s the Solid-SAL partition of that database, split into three subsets. Below is a mapping of AVEC sessions to SEMAINE sessions.

Baseline results. Performance is measured in cross-correlation averaged over all sequences.
1 2 1 8 1 13
2 3 2 9 2 14
3 4 3 10 3 15
4 5 4 11 4 16
5 29 5 19 5 25
6 30 6 20 6 26
7 31 7 21 7 27
8 40 8 22 8 52
9 41 9 34 9 53
10 42 10 35 10 54
11 43 11 36 11 55
12 58 12 37 12 64
13 59 13 46 13 65
14 60 14 47 14 66
15 61 15 48 15 67
16 70 16 49 16 100
17 71 17 82 17 101
18 72 18 83 18 102
19 73 19 84 19 103
20 76 20 85 20 118
21 77 21 94 21 119
22 78 22 95 22 120
23 79 23 96 23 121
24 88 24 97 24 122
25 89 25 112 25 125
26 90 26 113 26 126
27 91 27 114 27 127
28 106 28 115 28 128
29 107 29 131 29 137
30 108 30 132 30 138
31 109 31 133 31 139
32 134 32 140


Labels are provided for the train and development partitions. They are the average value of 3-8 raters, who provided continuous time, continuous value labels for their perceived strength of the four dimensions arousal, expectancy, power, and valence.

There are two types of labels, corresponding to the two sub-challenges FCSC and WLSC (fully continuous and word-level, respectively). The FCSC labels consist of one continuous value every 0.02 s. The time interval coincides with the sampling rate of the video data, i.e. 5o frames per second. The WLSC labels consist of one continuous value per word.

The label data of the FCSC is provided in separate files, with naming convention labels_continuous_$PARTITION$SESSIONNR_$DIMENSION.dat, e.g.  labels_continuous_devel001_arousal.dat. The label data of WLSC is not provided separately, instead it is part of the WCSC audio features (see below). The reasoning behind this is that it is assumed that people who partake in the WLSC sub-challenge will always use audio.

Provided features

For the video modality, we provide appearance based features on a single-frame basis. For every frame LBP features are derived from coarsely registered faces. To save disk space, features are only provided for the FCSC, which means there are features for every frame in every video. No separate video features are provided for the WLSC. These would have been  a subset of the FCSC features, and using the word timings contained in the WLSC audio features this subset can easily be created by the participants themselves. When converting audio feature times to corresponding frame numbers, please bear in mind that the exact frame-rate of the video is 49.979 fps. On a three minute video, the difference between recording at 50 fps or 49.979 fps would be 4 frames, so if you use 50 fps you probably won’t notice. For the same reason, the audio and video data (avis and wavs) as well as the turn and word timing information files are not provided with the WLSC dataset, as they are a duplicate of the FCSC data. It is therefore most likely that you will need the FCSC data even if you intend to participate in the WLSC only.

For the audio modality, for both FCSC and WLSC, features were only computed when the user was speaking. The FCSC audio features are a set of OpenSMILE features, computed over 2-second segments. The first segment of a word starts at the start time of that word, and subsequent segments are separated 0.5 seconds from each other, if a word lasts long enough to allow multiple segments to be extracted. The WLSC audio features are the same set of OpenSMILE features, but this time computed over a variable length segment, which is defined by the start and end time of each word.

The test features will be released at a later moment.

Results submission

Participants’ results should be sent as a single zip file per sub-challenge to the organisers by email ( The zip file should include the name of your team, the sub-challenge (FCSC/WLSC), and the number of this attempt, e.g. Exact formatting doesn’t matter for the zip file name, as a human will process this. The zip file should not contain any sub-directories.

The data in the results files themselves should be formatted the same way as the training/development ground truth label files, that is, ASCII values, one prediction per line. Their filenames should also be formatted similarly: e.g. ‘labels_continuous_test002_power.dat’ for FCSC result files, and ‘labels_wordlevel_test001_arousal.dat’ for WLSC results.

The organisers will provide for each dimension the cross-correlation error, which will be used to rank participants, as well as the RMS error, which can be used by the authors to further discuss their results in the paper accompanying their submission.

To increase the transparency of the challenge, we have released the evaluation code used by the organisers to calculate the scores of participants’ entries. Please note that the code is not supposed to run on your own machines without modification of the data paths, and that it should never be possible to calculate scores on the test results. After modification of the code, you should be able to calculate scores on the development partition though.

Paper on the Challenge

The introduction paper on the challenge, Björn Schuller, Michel Valstar, Florian Eyben, Roddy Cowie, Maja Pantic: AVEC 2012 – The Continuous Audio/Visual Emotion Challenge, to appear in Proc. Second International Audio/Visual Emotion Challenge and Workshop (AVEC 2012), Grand Challenge and Satellite of ACM ICMI 2012, ACM, Santa Monica, CA, 22.-26.10.2012,  is now ready for download. This paper provides extensive descriptions of the challenge data, provided labels, baseline features, and baseline results. All participants are asked to avoid repetitions of the description of the challenge, data, or feature descriptions in their submissions, but refer to this paper for this type of information.

Baseline results

Below you can find the baseline results, which are also included in the paper on the baseline. Please refer to that manuscript for an explanation of how these results were obtained and a short discussion of their meaning. In the table, FCSC stands for Fully Continuous Sub-Challenge, and WLSC for Word-Level Sub-Challenge. Please note, that until the paper on the baseline results is published, the results remain subject to change.

Baseline results. Performance is measured in cross-correlation averaged over all sequences.
FCSC test 0.141 0.101 0.072 0.136 0.112
WLSC test 0.021 0.028 0.009 0.003 0.015
FCSC development 0.181 0.148 0.084 0.215 0.157
WLSC development 0.018 0.009 0.001 0.002 0.007
Audio Only
WLSC test 0.014 0.038 0.016 0.040 0.027
WLSC development 0.054 0.020 0.019 0.062 0.039
Video Only
FCSC test 0.077 0.128 0.030 0.134 0.093
WLSC test 0.005 0.012 0.018 0.005 0.011
FCSC development 0.151 0.122 0.031 0.207 0.128
WLSC development 0.032 0.013 0.005 0.003 0.014

Background Information:

You can find additional information on the data on the SEMAINE homepage.

Paper Submission:

Each contribution to the Challenge must be accompanied by a paper  submitted to the Second International Audio/Visual Emotion Challenge and Workshop (AVEC 2012) with the following conditions:

  • The deadline for submission of the papers and results is found under “Important Dates”.
  • The papers will undergo a normal review process.
  • Papers shall not repeat the descriptions of database, labels, partitioning etc. of the SEMAINE corpus but cite the introductive paper (cf. above).
  • Per participating site five result uploads on the test set are allowed until the Camera Ready Paper deadline.
  • Four continuous problems need to be solved for Challenge participation: arousal, valence, expectancy, and power – each of which being above or below average. The Challenge competition measure is the cross-correlation on frame level.
  • Papers may well report additional results on other databases.
  • The development set allows for tests and results to be reported by the participants apart of their results on the official test set.
  • An additional publication is planned that summarises all results of the challenge and results combination by ROVERING or ensemble techniques. However, this publication is assumed to be post AVEC 2012.

In submitting a manuscript to this workshop, the authors acknowledge that no paper substantially similar in content has been submitted to another conference or workshop.Accepted workshop papers will be included in the proceedings of ICMI 2012. Manuscripts should follow the ICMI main conference paper format: 8 pages ACM style. Authors should submit papers as a PDF file via the official ICMI system. Once you are in the conference management system, please choose AVEC for submission. AVEC 2012 reviewing is double blind. Reviewing will be by members of the program committee. Each paper will receive at least two reviews. Acceptance will be based on relevance to the workshop, novelty, and technical quality.

Frequently asked questions/made comments:

How many times can I submit my results? You can submit results five times. We will not count badly formatted results towards one of your five submissions.

I downloaded the zip files but cannot open them.The 17 main AVEC 2012 zip files and two video test partition zip files are actually two multi-volume archives. This means that you shouldn’t try to open each zip file separately, rather you only open and You need a zip program that can handle multi-volume archives.

Why does the number of features not equal the number of labels?The video features are directly extracted from the videos, frame by frame. The labels were created using a tool called FeelTrace, which was based on the QuickTime player. The QuickTime player handles keyframes slightly differently, thus you can get effects where there are a few labels more or less than there are features. You should rectify this by matching the number of labels to the number of features. The difference between the two lengths is negligible in the face of the number of samples for each session.

If you measure only correlation, there could be a huge difference between the predicted and true labels, yet they could still be perfectly correlated!That’s correct. In an ideal world, we would also take the RMS value into account. However, having two performance measures makes it incredibly hard to rank participants. Correlation is a more meaningful measure than RMS.

The beginning of many sessions is the same as the end of some others, why do their associated labels have different values?This is an effect of the human rating process. It is related to the fact that what the labellers rate isn’t an immediate response to what they see and hear, but something that depends on the history of the interaction. That is why the last three seconds of a certain interaction are more richly (and differently) annotated than the same three seconds at the beginning of an interaction.

There are lots of zeros in some of the video feature files, for instance in train files 25, 26, 27. Is this normal or error or is there a specific reason?The all-zero entries in the video feature files indicate that face detection has failed for that frame.

For the WLSC, sometimes, the time markers in word-based audio features are slightly different from the timestamps shown in the word transcription files. Why is this?As the baseline paper explains, if a word is too short to allow the necessary statistics to be computed, the window was increased step-wise until it is long enough. This is reflected in the start time of the relevant audio features.

For the WLSC, the number of words in the transcription files and the number of feature vectors in the given audio features are different. For example, there are 656 words in the train_transcrip001.txt. However, there are only 649 audio feature instances in train_words001.arff to represent these 656 words. Why is this?This is because the transcripts were performed on audio files that were cut at a slightly different point. You should take the number of audio features as the correct number of words.

What is the order in which the video features are stored? The features are stored as follows: horizontal-position of the top-left corner of the face box, vertical-position of the top-left corner of the face box, width of the face box, height of the face box, and then followed by the LBP histograms of the blocks concatenated in lexicographic order. If you are interested in obtaining the same spatio-temporally reduced features used in the baseline system, you can use the matlab function readReducedVideoData (either directly or as inspiration). You will also need the function line2feats.