Secondary emotion recognition test
The organisers have decided to follow up the emotion sub-challenge with a second test, where the participants do not get to see the data. There will be two options for participants to perform this test: either end-to-end programmes can be send to the organisers, or end-to-end programmes can be brought to the FG conference where the test will be performed on site. It will be approximately half the size of the original test set.
If you choose to go with the first option, will do our utmost best to accomodate any type of programme, on any type of OS, within reason. To accomodate a timely handling of this evaluation, the programmes must be in by the 3d of February 2011. We will return results as soon as possible. Please contact michel.valstar@imperial.ac.uk for more information.
If you choose to go with the second option, please bring your own computer hardware (e.g. laptop) on which your end-to-end emotion recognition software runs. A five-hour slot has been arranged in which the test can be performed. The test will be held in the Grand Ballroom of the FG’11 conference venue, on Thursday the 24th of March, from 5 till 10 pm. The procedure will be as follows:
- We provide the test videos on a memory stick (same format as test set 1)
- You generate your results within 4 hours in the presence of the organisers (results should have same format as test 1)
- You email the results to the organisers
- We reply with your scores and update the emotion recognition ranking accordingly
We are currently working with the organisers of FG to find a suitable time and place for this, and will keep you informed.
If this change of protocol is unacceptable to you and you wish to withdraw your submission (including your paper), please let us know as well.
Overview of the data
The GEMEP-FERA dataset consists of recordings of 10 actors displaying a range of expressions, while uttering a meaningless phrase, or the word ‘Aaah’. There are 7 subjects in the training data, and 6 subjects in the test set, 3 of which are not present in the training set.
Participants are encouraged to use other databases of FACS AU coding to train their AU detection systems. Examples of this are the MMI Facial Expression database, as well as the Cohn-Kanade database. Because of the nature of the emotion categories in this challenge, it is not possible to use other training data for the emotion recognition sub-challenge.
Black frames
In a number of videos, the first (few) frames are entirely black. This occurs in the following videos:
Test data
The test data will be made available through the gemep-fera database website on January 17th, at 9 am GMT. The test data contains six subjects: three of these are also present in the training data (person-specific data), while the other three are new (person-independent data). Participants are allowed to match the test subjects with the training subjects, in case they wish to optimise test results using a person-specific facial expression recognition method.
Challenge scores
For the FERA2011 challenge, scores will be computed in terms of F1-measure for AU detection and Classification rate for emotion detection. To obtain the overall score for the AU-detection sub-challenge, we will first obtain the F1-score for each AU independently, and then compute the average over all 12 AUs. Similarly, for the emotion categories we will first obtain the Classification rate per emotion, and then compute the average over all 5 emotions. The F1-measure for AUs is computed based on a per-frame detection (i.e. an AU prediction has to be specified for every frame, for every AU, as being either present or absent). The function used to calculate the scores is now available. The classification rate for emotions is computed based on a per-video prediction (event-based detection). It will be calculated per emotion as the fraction of the number of videos correctly classified as that emotion divided by the total number of videos of that emotion in the test set.
Participants will be given access to the data one week before the submission deadline to test their systems. The results should be emailed to michel.valstar@imperial.ac.uk, in one zip file for the AU results, and one zip file for the emotion results. The zip files should contain one file per video, using the same naming scheme as used for the training data, i.e.:
- test_001-au.dat
- test_001-emotion.dat
Please make sure to include your name in the zip filenames. Participants have two chances to send us results for every method they wish to report on. The number of methods you can report on is limited by common sense. We encourage participants to submit a separate paper for each method, unless the methods are very similar or if a comparison between the methods would be a meaningful improvement to the paper. We will send the scores immediately after computing them. The scores must be reported in the submitted paper.
If possible. please send us in separate zip files the unsigned output of your classifier. We will use this to report on the area under the ROC curve performance.
AU detection
There are 12 AUs that need to be detected: AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU15, AU17, AU18, AU25, and AU26. Note that during speech (coded as AD50), there is NO coding for AU25 or AU26. Because we make the annotation of AD50 available together with the other AU labels, you will be able to exclude sections of speech from your training for these two AUs. Likewise, for the computation of your scores, we will discard any detections of AU25 and AU26 during speech.
AUs are labelled frame-by-frame, and should thus be detected on a frame-by-frame basis (i.e. we’re doing AU spotting). An AU prediction has to be specified for every frame, for every AU, as being either present or absent.
The test data for AU detection consists of 71 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.
Participants are kindly requested to report in their paper on the F1 measure obtained for every AU, as well as the average F1 measure.
Emotion detection
There are five discrete, mutually-exclusive emotion categories that need to be detected. The categories are: Anger, Fear, Joy, Relief, and Sadness. Emotions are labelled per video, and should thus be detected on an event, or portrayal, basis. That is, one label is given for an entire video, and one prediction per video is requested for the test data.
The test data for emotion detection consists of 134 videos, of the same kind as the training data videos. Half of the subjects in the test data also appear in the training data, while the other half does not. This way it is possible to assess how well systems generalise to unseen subjects.
Participants are kindly requested to report in their paper on the classification result obtained for every emotion, as well as the average classification rate. Besides this, participants are requested to include the confusion matrix of their predictions, with rows being the predicted values and columns the true values.
Data format
The AU labelling is provided as ASCII text, where every line is the labelling for a single frame, and every column resembles an AU. So, line 1, column 1 is AU1 for the first frame, line 2, column 2 is AU2 for frame 2, and line 3, column 3 represents a non-existing label (there’s no AU3). This was done for ease of interfacing with the data. Note that column 50 represents speech. The AU labels indicate presence (a 1) or absence (a 0) of that AU.
The emotion category labelling is done using a single word in an ASCII text file. That label represents the emotion shown in the video.
The format of the result files should be the same as the training label files.
Baseline results
The baseline method we applied utilises static dense appearance descriptors and statistical machine learning techniques. To wit, we apply Uniform Local Binary Patterns (Uniform LBP) with 8 neighbours and radius 1 to extract the appearance features, PCA to reduce the dimensionality of the descriptor, and Support Vector Machines with Radial Basis Function kernels to classify the data. The method used is generally that described for the static LBP method in the paper titled ‘Action Unit detection using appearance in sparse space-time volumes’, accepted for publication in the main conference of FG2011. A preview of that paper can be found here.
Data is being pre-processed as follows. We use the OpenCV face detector to extract the face location in each frame. The detected face is scaled to be 200 by 200 pixels. We then apply the OpenCV implementation of eye-detection, and use the detected eye locations to remove any in-plane rotation of the face. We also translate the face so that the subject’s right eye centre is always at the coordinates x=60, y=60. We do not use the detected eye locations to normalise for scale, as the OpenCV eye detection is too inaccurate for this. For training, the pre-processed images were manually verified. Incorrectly pre-processed images were removed from the training set but were not replaced. For the test set, no manual verification was done.
The baseline features extracted from the GEMEP-FERA training partition of the AU detection sub-challenge are now available from here, as a matlab struct. The cell field .X holds the features, per subject, and the .Y cell field holds the corresponding labels.
Below we give details that are specific to the two sub-challenges.
AU detection sub-challenge
We divided the 12 AUs into upper and lower face AUs: AU1, AU2, AU4, AU6, and AU7 belong to the set of upper-face AUs, while AU10, AU12, AU15, AU17, AU18, AU25, and AU26 belong to the set of lower-face AUs. To extract features, we split the face in two halves, and use the upper half to extract features for the upper face AUs, and the lower half for the lower-face AUs. In both cases, we divide the face into 50 squares: 5 rows and 10 columns. Each square has side 20 pixels. Within each block we apply on every pixel the Uniform LBP operator, and the results are used to create a 59-bin histogram for each block (Uniform LBP can generate 59 different binary words). The histograms of all 50 blocks are concatenated, resulting in a 2950 dimensional feature vector.
Not all frames of the training set were used to train the classifiers. Instead, we used from every video one frame for every AU combination present. This ensures that we do not include the same facial appearance more than once in our training set. Here we assume that the pre-processing deals with the rigid head motion and that there are no effects of occlusion or varying lighting conditions.
PCA was applied to the training set, retaining 98% of the variance in the reduced set. This was used to train one SVM for every AU, using RBF kernels. We used the popular libSVM implementation. The optimal values for the kernel parameter and the slack variable were found using 5-fold subject-independent cross-validation on the training set. After the optimal parameter values were found, the classifiers were trained on the entire data.
The test data was pre-processed as described above. We apply the trained classifier to all frames in all videos. If during pre-processing the face detector failed, we decided that that frame had no AUs in it. If the eye-detection was off, we simply use the detection that the classifiers provided given the mis-aligned face. Below you can find the results of this baseline method for the AU detection, where we list the performance in terms of F1-measure of the classifiers with the decision threshold set at 0.0:
Emotion recognition sub-challenge
To extract features for the emotion recognition sub-challenge, we used the entire face. The face area is divided into 100 squares: 10 rows and 10 columns with a side of 20 pixels. As with the AU detection sub-challenge, Uniform LBP was applied, and the histograms of all 100 blocks are concatenated into a single 5900 dimensional feature vector.
As the videos do not have a clear neutral element, all frames in a video of a certain emotion are assumed to depict that emotion, and thus all frames are used to train the classifiers for emotions. We trained five one-versus-all binary classifiers. During testing, every frame from the test video is passed to the five classifiers, and the emotion belonging to the classifier with the highest decision function value output is assigned to that frame. To decide which emotion label should be assigned to the entire test video, we apply majority voting over all frames in the video. In case of a tie, the emotion that occurs first in an alphabetically sorted list is chosen (i.e. if the emotions ‘anger’ and ‘fear’ would tie for having the highest amount of detected frames, then ‘anger’ would be chosen as it occurs before ‘fear’ in alphabetical order).
PCA was applied to the training set, retaining 95% of the variance in the reduced set.
Failed pre-processed images were treated identically as in the AU detection sub-challenge. Below is given the classification rate for every emotion. Note that we cannot provide the confusion matrix here, as it would reveal too many details about the test set.
Random results
To aid participants in the analysis of their results, please find below the results of a ‘random’, or naive system. In case of AU detection, this has been obtained by setting all predictions of every AU to true, which results in the highest possible F1 measure by a naive approach. For emotion detection, we assigned a random label to each video with no prior knowledge on the frequency of occurrence of each emotion.
AU detection
F1 measure for au detection results, person independent
————————————————————————————–
Random data
————————————————————————————–
AU1: 0.600840
AU2: 0.515969
AU4: 0.589979
AU6: 0.694118
AU7: 0.607989
AU10: 0.473644
AU12: 0.749948
AU15: 0.155125
AU17: 0.470138
AU18: 0.210746
AU25: 0.820785
AU26: 0.482947
—————-
Avg.: 0.531019
————————————————————————————–
F1 measure for au detection results, person specific
————————————————————————————–
Random data
————————————————————————————–
AU1: 0.307167
AU2: 0.405145
AU4: 0.526316
AU6: 0.489121
AU7: 0.636828
AU10: 0.531303
AU12: 0.720707
AU15: 0.227667
AU17: 0.223132
AU18: 0.244692
AU25: 0.829545
AU26: 0.509946
—————-
Avg.: 0.470964
————————————————————————————–
F1 measure for au detection results, overall
————————————————————————————–
Random data
————————————————————————————–
AU1: 0.505762
AU2: 0.477156
AU4: 0.567277
AU6: 0.625726
AU7: 0.618707
AU10: 0.495311
AU12: 0.739379
AU15: 0.182412
AU17: 0.387943
AU18: 0.223348
AU25: 0.824529
AU26: 0.494557
—————-
Avg.: 0.511842
————————————————————————————–
Emotion detection
Random classification
————————————————————————————–
Person independent Person specific Overall
————————————————————————————–
anger 0.214286 0.230769 0.222222
fear 0.200000 0.133333 0.160000
joy 0.200000 0.090909 0.161290
relief 0.111111 0.125000 0.115385
sadness 0.222222 0.142857 0.200000
————————————————————————————–
Avg.: 0.189524 0.144574 0.171779
————————————————————————————–

Copyright © 2012