Secondary emotion recognition test
If you choose to go with the second option, please bring your own computer hardware (e.g. laptop) on which your end-to-end emotion recognition software runs. A five-hour slot has been arranged in which the test can be performed. The test will be held in the Grand Ballroom of the FG’11 conference venue, on Thursday the 24th of March, from 5 till 10 pm.
- We provide the test videos on a memory stick (same format as test set 1)
- You generate your results within 4 hours in the presence of the organisers (results should have same format as test 1)
- You email the results to the organisers
- We reply with your scores and update the emotion recognition ranking accordingly
Overview of the data
The GEMEP-FERA dataset consists of recordings of 10 actors displaying a range of expressions, while uttering a meaningless phrase, or the word ‘Aaah’.
Participants are encouraged to use other databases of FACS AU coding to train their AU detection systems. Examples of this are the MMI Facial Expression database, as well as the Cohn-Kanade database. Because of the nature of the emotion categories in this challenge, it is not possible to use other training data for the emotion recognition sub-challenge.
In a number of videos, the first (few) frames are entirely black. This occurs in the following videos:
The test data will be made available through the gemep-fera database website on January 17th, at 9 am GMT. The test data contains six subjects: three of these are also present in the training data (person-specific data), while the other three are new (person-independent data). Participants are allowed to match the test subjects with the training subjects, in case they wish to optimise test results using a person-specific facial expression recognition method.
For the FERA2011 challenge, scores will be computed in terms of F1-measure for AU detection and Classification rate for emotion detection. To obtain the overall score for the AU-detection sub-challenge, we will first obtain the F1-score for each AU independently, and then compute the average over all 12 AUs. Similarly, for the emotion categories we will first obtain the Classification rate per emotion, and then compute the average over all 5 emotions. The F1-measure for AUs is computed based on a per-frame detection (i.e. an AU prediction has to be specified for every frame, for every AU, as being either present or absent). The function used to calculate the scores is now available. The classification rate for emotions is computed based on a per-video prediction (event-based detection). It will be calculated per emotion as the fraction of the number of videos correctly classified as that emotion divided by the total number of videos of that emotion in the test set.
Participants will be given access to the data one week before the submission deadline to test their systems. The results should be emailed to email@example.com, in one zip file for the AU results, and one zip file for the emotion results. The zip files should contain one file per video, using the same naming scheme as used for the training data, i.e.:
Please make sure to include your name in the zip filenames. Participants have two chances to send us results for every method they wish to report on. The number of methods you can report on is limited by common sense. We encourage participants to submit a separate paper for each method, unless the methods are very similar or if a comparison between the methods would be a meaningful improvement to the paper. We will send the scores immediately after computing them. The scores must be reported in the submitted paper.
If possible. please send us in separate zip files the unsigned output of your classifier. We will use this to report on the area under the ROC curve performance.
There are 12 AUs that need to be detected: AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU15, AU17, AU18, AU25, and AU26. Note that during speech (coded as AD50), there is NO coding for AU25 or AU26. Because we make the annotation of AD50 available together with the other AU labels, you will be able to exclude sections of speech from your training for these two AUs. Likewise, for the computation of your scores, we will discard any detections of AU25 and AU26 during speech.
AUs are labelled frame-by-frame, and should thus be detected on a frame-by-frame basis (i.e. we’re doing AU spotting). An AU prediction has to be specified for every frame, for every AU, as being either present or absent.
The test data for AU detection consists of 71 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.
Participants are kindly requested to report in their paper on the F1 measure obtained for every AU, as well as the average F1 measure.
There are five discrete, mutually-exclusive emotion categories that need to be detected. The categories are: Anger, Fear, Joy, Relief, and Sadness. Emotions are labelled per video, and should thus be detected on an event, or portrayal, basis. That is, one label is given for an entire video, and one prediction per video is requested for the test data.
The test data for emotion detection consists of 134 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.
Participants are kindly requested to report in their paper on the classification result obtained for every emotion, as well as the average classification rate. Besides this, participants are requested to include the confusion matrix of their predictions, with rows being the predicted values and columns the true values.
The AU labelling is provided as ASCII text, where every line is the labelling for a single frame, and every column resembles an AU. So, line 1, column 1 is AU1 for the first frame, line 2, column 2 is AU2 for frame 2, and line 3, column 3 represents a non-existing label (there’s no AU3). This was done for ease of interfacing with the data. Note that column 50 represents speech. The AU labels indicate presence (a 1) or absence (a 0) of that AU.
The emotion category labelling is done using a single word in an ASCII text file. That label represents the emotion shown in the video.
The format of the result files should be the same as the training label files.
The baseline method we applied utilises static dense appearance descriptors and statistical machine learning techniques. To wit, we apply Uniform Local Binary Patterns (Uniform LBP) with 8 neighbours and radius 1 to extract the appearance features, PCA to reduce the dimensionality of the descriptor, and Support Vector Machines with Radial Basis Function kernels to classify the data. The method used is generally that described for the static LBP method in the paper titled ‘Action Unit detection using appearance in sparse space-time volumes’, accepted for publication in the main conference of FG2011. A preview of that paper can be found here.
Data is being pre-processed as follows. We use the OpenCV face detector to extract the face location in each frame. The detected face is scaled to be 200 by 200 pixels. We then apply the OpenCV implementation of eye-detection, and use the detected eye locations to remove any in-plane rotation of the face. We also translate the face so that the subject’s right eye centre is always at the coordinates x=60, y=60. We do not use the detected eye locations to normalise for scale, as the OpenCV eye detection is too inaccurate for this. For training, the pre-processed images were manually verified. Incorrectly pre-processed images were removed from the training set but were not replaced. For the test set, no manual verification was done.
The baseline features extracted from the GEMEP-FERA training partition of the AU detection sub-challenge are now available from here, as a matlab struct. The cell field .X holds the features, per subject, and the .Y cell field holds the corresponding labels.
Below we give details that are specific to the two sub-challenges.
AU detection sub-challenge
We divided the 12 AUs into upper and lower face AUs: AU1, AU2, AU4, AU6, and AU7 belong to the set of upper-face AUs, while AU10, AU12, AU15, AU17, AU18, AU25, and AU26 belong to the set of lower-face AUs. To extract features, we split the face in two halves, and use the upper half to extract features for the upper face AUs, and the lower half for the lower-face AUs. In both cases, we divide the face into 50 squares: 5 rows and 10 columns. Each square has side 20 pixels. Within each block we apply on every pixel the Uniform LBP operator, and the results are used to create a 59-bin histogram for each block (Uniform LBP can generate 59 different binary words). The histograms of all 50 blocks are concatenated, resulting in a 2950 dimensional feature vector.
Not all frames of the training set were used to train the classifiers. Instead, we used from every video one frame for every AU combination present. This ensures that we do not include the same facial appearance more than once in our training set. Here we assume that the pre-processing deals with the rigid head motion and that there are no effects of occlusion or varying lighting conditions.
PCA was applied to the training set, retaining 98% of the variance in the reduced set. This was used to train one SVM for every AU, using RBF kernels. We used the popular libSVM implementation. The optimal values for the kernel parameter and the slack variable were found using 5-fold subject-independent cross-validation on the training set. After the optimal parameter values were found, the classifiers were trained on the entire data.
The test data was pre-processed as described above. We apply the trained classifier to all frames in all videos. If during pre-processing the face detector failed, we decided that that frame had no AUs in it. If the eye-detection was off, we simply use the detection that the classifiers provided given the mis-aligned face. Below you can find the results of this baseline method for the AU detection, where we list the performance in terms of F1-measure of the classifiers with the decision threshold set at 0.0:
Emotion recognition sub-challenge
To extract features for the emotion recognition sub-challenge, we used the entire face. The face area is divided into 100 squares: 10 rows and 10 columns with a side of 20 pixels. As with the AU detection sub-challenge, Uniform LBP was applied, and the histograms of all 100 blocks are concatenated into a single 5900 dimensional feature vector.
As the videos do not have a clear neutral element, all frames in a video of a certain emotion are assumed to depict that emotion, and thus all frames are used to train the classifiers for emotions. We trained five one-versus-all binary classifiers. During testing, every frame from the test video is passed to the five classifiers, and the emotion belonging to the classifier with the highest decision function value output is assigned to that frame. To decide which emotion label should be assigned to the entire test video, we apply majority voting over all frames in the video. In case of a tie, the emotion that occurs first in an alphabetically sorted list is chosen (i.e. if the emotions ‘anger’ and ‘fear’ would tie for having the highest amount of detected frames, then ‘anger’ would be chosen as it occurs before ‘fear’ in alphabetical order).
PCA was applied to the training set, retaining 95% of the variance in the reduced set.
Failed pre-processed images were treated identically as in the AU detection sub-challenge. Below is given the classification rate for every emotion. Note that we cannot provide the confusion matrix here, as it would reveal too many details about the test set.
To aid participants in the analysis of their results, please find below the results of a ‘random’, or naive system. In case of AU detection, this has been obtained by setting all predictions of every AU to true, which results in the highest possible F1 measure by a naive approach. For emotion detection, we assigned a random label to each video with no prior knowledge on the frequency of occurrence of each emotion.
F1 measure for au detection results, person independent
F1 measure for au detection results, person specific
F1 measure for au detection results, overall
Person independent Person specific Overall
anger 0.214286 0.230769 0.222222
fear 0.200000 0.133333 0.160000
joy 0.200000 0.090909 0.161290
relief 0.111111 0.125000 0.115385
sadness 0.222222 0.142857 0.200000
Avg.: 0.189524 0.144574 0.171779