Challenge Guidelines

Secondary emotion recognition test

The organisers have decided to follow up the emotion sub-challenge with a second test, where the participants do not get to see the data. There will be two options for participants to perform this test: either end-to-end programmes can be send to the organisers, or end-to-end programmes can be brought to the FG conference where the test will be performed on site. It will be approximately half the size of the original test set.

If you choose to go with the first option, will do our utmost best to accomodate any type of programme, on any type of OS, within reason. To accomodate a timely handling of this evaluation, the programmes must be in by the 3d of February 2011. We will return results as soon as possible. Please contact for more information.

If you choose to go with the second option, please bring your own computer hardware (e.g. laptop) on which your end-to-end emotion recognition software runs. A five-hour slot has been arranged in which the test can be performed. The test will be held in the Grand Ballroom of the FG’11 conference venue, on Thursday the 24th of March, from 5 till 10 pm. The procedure will be as follows:

  1. We provide the test videos on a memory stick (same format as test set 1)
  2. You generate your results within 4 hours in the presence of the organisers (results should have same format as test 1)
  3. You email the results to the organisers
  4. We reply with your scores and update the emotion recognition ranking accordingly

We are currently working with the organisers of FG to find a suitable time and place for this, and will keep you informed.

If this change of protocol is unacceptable to you and you wish to withdraw your submission (including your paper), please let us know as well.

Overview of the data

The GEMEP-FERA dataset consists of recordings of 10 actors displaying a range of expressions, while uttering a meaningless phrase, or the word ‘Aaah’. There are 7 subjects in the training data, and 6 subjects in the test set, 3 of which are not present in the training set.

Participants are encouraged to use other databases of FACS AU coding to train their AU detection systems. Examples of this are the MMI Facial Expression database, as well as the Cohn-Kanade database. Because of the nature of the emotion categories in this challenge, it is not possible to use other training data for the emotion recognition sub-challenge.

Black frames

In a number of videos, the first (few) frames are entirely black. This occurs in the following videos:

AU training:
train_035 frames 1, 2, 3, 4
train_054 frame 1
train_055 frame 1
train_062 frame 1
train_078 frame 1
train_079 frame 1
AU test:
test_043 frame 1
Emotion train:
train_05 frames 1, 2
train_081 frame 1
train_082 frame 1
train_105 frame 1
train_128 frame 1
train_129 frame 1
Emotion test:
test_005 frame 1

Test data

The test data will be made available through the gemep-fera database website on January 17th, at 9 am GMT.  The test data contains six subjects: three of these are also present in the training data (person-specific data), while the other three are new (person-independent data).  Participants are allowed to match the test subjects with the training subjects, in case they wish to optimise test results using a person-specific facial expression recognition method.

Challenge scores

For the FERA2011 challenge, scores will be computed in terms of F1-measure for AU detection and Classification rate for emotion detection. To obtain the overall score for the AU-detection sub-challenge, we will first obtain the F1-score for each AU independently, and then compute the average over all 12 AUs. Similarly, for the emotion categories we will first obtain the Classification rate per emotion, and then compute the average over all 5 emotions. The F1-measure for AUs is computed based on a per-frame detection (i.e. an AU prediction has to be specified for every frame, for every AU, as being either present or absent). The function used to calculate the scores is now available. The classification rate for emotions is computed based on a per-video prediction (event-based detection). It will be calculated per emotion as the fraction of the number of videos correctly classified as that emotion divided by the total number of videos of that emotion in the test set.

Participants will be given access to the data one week before the submission deadline to test their systems. The results should be emailed to, in one zip file for the AU results, and one zip file for the emotion results. The zip files should contain one file per video, using the same naming scheme as used for the training data, i.e.:

  • test_001-au.dat
  • test_001-emotion.dat

Please make sure to include your name in the zip filenames. Participants have two chances to send us results for every method they wish to report on. The number of methods you can report on is limited by common sense. We encourage participants to submit a separate paper for each method, unless the methods are very similar or if a comparison between the methods would be a meaningful improvement to the paper. We will send the scores immediately after computing them. The scores must be reported in the submitted paper.

If possible. please send us in separate zip files the unsigned output of your classifier. We will use this to report on the area under the ROC curve performance.

AU detection

There are 12 AUs that need to be detected: AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU15, AU17, AU18, AU25, and AU26. Note that during speech (coded as AD50), there is NO coding for AU25 or AU26. Because we make the annotation of AD50 available together with the other AU labels, you will be able to exclude sections of speech from your training for these two AUs. Likewise, for the computation of your scores, we will discard any detections of AU25 and AU26 during speech.

AUs are labelled frame-by-frame, and should thus be detected on a frame-by-frame basis (i.e. we’re doing AU spotting). An AU prediction has to be specified for every frame, for every AU, as being either present or absent.

The test data for AU detection consists of 71 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.

Participants are kindly requested to report in their paper on the F1 measure obtained for every AU, as well as the average F1 measure.

Emotion detection

There are five discrete, mutually-exclusive emotion categories that need to be detected. The categories are: Anger, Fear, Joy, Relief, and Sadness. Emotions are labelled per video, and should thus be detected on an event, or portrayal, basis. That is, one label is given for an entire video, and one prediction per video is requested for the test data.

The test data for emotion detection consists of 134 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.

Participants are kindly requested to report in their paper on the classification result obtained for every emotion, as well as the average classification rate. Besides this, participants are requested to include the confusion matrix of their predictions, with rows being the predicted values and columns the true values.

Data format

The AU labelling is provided as ASCII text, where every line is the labelling for a single frame, and every column resembles an AU. So, line 1, column 1 is AU1 for the first frame, line 2, column 2 is AU2 for frame 2, and line 3, column 3 represents a non-existing label (there’s no AU3). This was done for ease of interfacing with the data. Note that column 50 represents speech. The AU labels indicate presence (a 1) or absence (a 0) of that AU.

The emotion category labelling is done using a single word in an ASCII text file. That label represents the emotion shown in the video.

The format of the result files should be the same as the training label files.

Baseline results

The baseline method we applied utilises static dense appearance descriptors and statistical machine learning techniques. To wit, we apply Uniform Local Binary Patterns (Uniform LBP) with 8 neighbours and radius 1 to extract the appearance features, PCA to reduce the dimensionality of the descriptor, and Support Vector Machines with Radial Basis Function kernels to classify the data. The method used is generally that described for the static LBP method in the paper titled ‘Action Unit detection using appearance in sparse space-time volumes’, accepted for publication in the main conference of FG2011. A preview of that paper can be found here.

Data is being pre-processed as follows. We use the OpenCV face detector to extract the face location in each frame. The detected face is scaled to be 200 by 200 pixels. We then apply the OpenCV implementation of eye-detection, and use the detected eye locations to remove any in-plane rotation of the face. We also translate the face so that the subject’s right eye centre is always at the coordinates x=60, y=60. We do not use the detected eye locations to normalise for scale, as the OpenCV eye detection is too inaccurate for this. For training, the pre-processed images were manually verified. Incorrectly pre-processed images were removed from the training set but were not replaced. For the test set, no manual verification was done.

The baseline features extracted from the GEMEP-FERA training partition of the AU detection sub-challenge are now available from here, as a matlab struct. The cell field .X holds the features, per subject, and the .Y cell field holds the corresponding labels.

Below we give details that are specific to the two sub-challenges.

AU detection sub-challenge

We divided the 12 AUs into upper and lower face AUs: AU1, AU2, AU4, AU6, and AU7 belong to the set of upper-face AUs, while AU10, AU12, AU15, AU17, AU18, AU25, and AU26 belong to the set of lower-face AUs. To extract features, we split the face in two halves, and use the upper half to extract features for the upper face AUs, and the lower half for the lower-face AUs. In both cases, we divide the face into 50 squares: 5 rows and 10 columns. Each square has side 20 pixels. Within each block we apply on every pixel the Uniform LBP operator, and  the results are used to create a 59-bin histogram for each block (Uniform LBP can generate 59 different binary words). The histograms of all 50 blocks are concatenated, resulting in a 2950 dimensional feature vector.

Not all frames of the training set were used to train the classifiers. Instead, we used from every video one frame for every AU combination present. This ensures that we do not include the same facial appearance more than once in our training set. Here we assume that the pre-processing deals with the rigid head motion and that there are no effects of occlusion or  varying lighting conditions.

PCA was applied to the training set, retaining 98% of the variance in the reduced set. This was used to train one SVM for every AU, using RBF kernels. We used the popular libSVM implementation. The optimal values for the kernel parameter and the slack variable were found using 5-fold subject-independent cross-validation on the training set. After the optimal parameter values were found, the classifiers were trained on the entire data.

The test data was pre-processed as described above. We apply the trained classifier to all frames in all videos. If during pre-processing the face detector failed, we decided that that frame had no AUs in it. If the eye-detection was off, we simply use the detection that the classifiers provided given the mis-aligned face. Below you can find the results of this baseline method for the AU detection, where we list the performance in terms of F1-measure of the classifiers with the decision threshold set at 0.0:

Person independent
AU1: 0.633
AU2: 0.675
AU4: 0.133
AU6: 0.536
AU7: 0.493
AU10: 0.445
AU12: 0.769
AU15: 0.082
AU17: 0.378
AU18: 0.126
AU25: 0.796
AU26: 0.371
Avg.: 0.453
Person specific
AU1: 0.362
AU2: 0.400
AU4: 0.298
AU6: 0.255
AU7: 0.481
AU10: 0.526
AU12: 0.688
AU15: 0.199
AU17: 0.349
AU18: 0.240
AU25: 0.809
AU26: 0.474
Avg.: 0.423
AU1: 0.567
AU2: 0.589
AU4: 0.192
AU6: 0.463
AU7: 0.489
AU10: 0.479
AU12: 0.742
AU15: 0.133
AU17: 0.369
AU18: 0.176
AU25: 0.802
AU26: 0.415
Avg.: 0.451

Emotion recognition sub-challenge

To extract features for the emotion recognition sub-challenge, we used the entire face. The face area is divided into 100 squares: 10 rows and 10 columns with a side of 20 pixels. As with the AU detection sub-challenge, Uniform LBP was applied, and the histograms of all 100 blocks are concatenated into a single 5900 dimensional feature vector.

As the videos do not have a clear neutral element, all frames in a video of a certain emotion are assumed to depict that emotion, and thus all frames are used to train the classifiers for emotions. We trained five one-versus-all binary classifiers. During testing, every frame from the test video is passed to the five classifiers, and the emotion belonging to the classifier with the highest decision function value output is assigned to that frame. To decide which emotion label should be assigned to the entire test video, we apply majority voting over all frames in the video. In case of a tie, the emotion that occurs first in an alphabetically sorted list is chosen (i.e. if the emotions ‘anger’ and ‘fear’ would tie for having the highest amount of detected frames, then ‘anger’ would be chosen as it occurs before ‘fear’ in alphabetical order).

PCA was applied to the training set, retaining 95% of the variance in the reduced set.

Failed pre-processed images were treated identically as in the AU detection sub-challenge. Below is given the classification rate for every emotion. Note that we cannot provide the confusion matrix here, as it would reveal too many details about the test set.

Person-independent partition
Anger: 0.86
Fear: 0.07
Joy: 0.70
Relief: 0.31
Sadness: 0.27
Avg: 0.44
Person-specific partition
Anger: 0.92
Fear: 0.40
Joy: 0.73
Relief: 0.70
Sadness: 0.90
Avg: 0.73
Anger: 0.89
Fear: 0.20
Joy: 0.71
Relief: 0.46
Sadness: 0.52
Avg: 0.56

Random results

To aid participants in the analysis of their results, please find below the results of a ‘random’, or naive system. In case of  AU detection, this has been obtained by setting all predictions of every AU to true, which results in the highest possible F1 measure by a naive approach. For emotion detection, we assigned a random label to each video with no prior knowledge on the frequency of occurrence of each emotion.

AU detection

F1 measure for au detection results, person independent


Random data


AU1: 0.600840

AU2: 0.515969

AU4: 0.589979

AU6: 0.694118

AU7: 0.607989

AU10: 0.473644

AU12: 0.749948

AU15: 0.155125

AU17: 0.470138

AU18: 0.210746

AU25: 0.820785

AU26: 0.482947


Avg.: 0.531019


F1 measure for au detection results, person specific


Random data


AU1: 0.307167

AU2: 0.405145

AU4: 0.526316

AU6: 0.489121

AU7: 0.636828

AU10: 0.531303

AU12: 0.720707

AU15: 0.227667

AU17: 0.223132

AU18: 0.244692

AU25: 0.829545

AU26: 0.509946


Avg.: 0.470964


F1 measure for au detection results, overall


Random data


AU1: 0.505762

AU2: 0.477156

AU4: 0.567277

AU6: 0.625726

AU7: 0.618707

AU10: 0.495311

AU12: 0.739379

AU15: 0.182412

AU17: 0.387943

AU18: 0.223348

AU25: 0.824529

AU26: 0.494557


Avg.: 0.511842


Emotion detection

Random classification


Person independent      Person specific     Overall


anger    0.214286     0.230769     0.222222

fear     0.200000     0.133333     0.160000

joy     0.200000     0.090909     0.161290

relief    0.111111     0.125000     0.115385

sadness     0.222222     0.142857     0.200000


Avg.: 0.189524     0.144574     0.171779