Labeled Faces in the Wild
University of Massachusetts - Amherst

README contents:
--------------------------------

1. lfw-tgz - the database
2. training paradigms
   2a. Image Restricted Configuration
   2b. Unrestricted Configuration
   2c. test procedure
3. training, validation, and testing
   3a. View 1: development training/testing sets
   3b. View 2: performance testing configurations
   3c. pairs.txt file format
   3d. people.txt file format
4. additional details


1. lfw.tgz - the database
--------------------------------

The entire Labeled Faces in the Wild database can be downloaded as a
gzipped tar file.  After uncompressing, the contents of the database
will be placed in a new directory "lfw".  

Each image is available as "lfw/name/name_xxxx.jpg", where "xxxx" is
the image number padded to four characters with leading zeroes.  For
example, the 10th George_W_Bush image can be found as
"lfs/George_W_Bush/George_W_Bush_0010.jpg".

There are a total of 13233 images and 5749 people in the database.  

Each image is a 250x250 jpg, detected and centered using the openCV
implementation of Viola-Jones face detector.  The cropping region
returned by the detector was then automatically enlarged by a factor
of 2.2 in each dimension to capture more of the head and then scaled
to a uniform size.


2. training paradigms
--------------------------------

We give two possibilities for forming the training sets.  

2a. Image Restricted Configuration
----------------

In the first formulation, the training information is restricted to
the image pairs given in the pairs.txt file.  No information about the
actual names of the people in the image pairs should be used.  This is
meant to address the issue of transitivity.

In other words, if one matched pair consists of the 10th and 12th
images of George_W_Bush, and another pair consists of the 42nd and
50th images of George_W_Bush, then under this formulation it would not
be allowable to use the fact that both pairs consist of images of
George_W_Bush in order to form new pairs such as the 10th and 42nd
images of George_W_Bush.

To ensure this holds, one should only use the name information to
identify the image, but not provide the name information to the actual
algorithm.  For this reason, we refer to this formulation as the Image
Restricted Configuration.  Under this formulation, only the pairs.txt
file is needed.

2b. Unrestricted Configuration
----------------

In the second formulation, the training information is provided as
simply the names of the people in each set and the associated images.
From this information, one can, for example, formulate as many match
and mismatch pairs as one desires, from people within each set.

For instance, if George_W_Bush and John_Kerry both appear in one set,
then any pair of George_W_Bush images can be used as a match pair, and
any image of George_W_Bush can be matched with any image of John_Kerry
to form a mismatch pair.

We refer to this formulation as the Unrestricted Configuration, and
provide the people.txt that gives the names of people in each set.

2c. test procedure
----------------

Under both configurations, the test procedure is the same.  That is,
the training sets are formed from 9 of the 10 sets, with the held-out
set as the test set.  The algorithm must then classify each pair from
the held-out set, given in pairs.txt, based on the image information
from that pair alone.  In other words, the algorithm's classification
must be a function of the single pair of images, and not attempt to
leverage the other test pairs.

Note that, the pairs.txt is needed (for the purposes of computing the
test performance), even under the Unrestricted Configuration.  Also,
under the Unrestricted Configuration, one can form mismatch pairs from
images across different sets in the training data.


3. training, validation, and testing
--------------------------------

We organize our data into two "Views".  View 1 is for algorithm
development and general experimentation, prior to formal evaluation,
i.e. model selection or validation.  View 2 is for performance
reporting, and should be used only for the final evaluation of a
method, to minimize "fitting to the test data".

3a. View 1: development training/testing sets
----------------

We give the development sets in both configurations (image restricted
and unrestricted).  The first configuration consists of
pairsDevTraining.txt and pairsDevTest.txt.  The format for these two
files is: the first line gives the number of matched pairs N (equal to
the number of mismatched pairs) in the set, followed by N lines of
matched pairs and N lines of mismatched pairs in the same format as
the files for the performance reporting sets.

The second configuration consists of peopleDevTraining.txt and
peopleDevTest.txt.  The format for these two files is: the first line
gives the number of people N in the set, followed by N lines of names
and number of images per person in the same format as the files for
the performance reporting sets.

3b. View 2: performance testing configurations
----------------

We randomly split the database into 10 sets (uniformly random at the
person level).  We then randomly chosen 300 matched pairs and 300
mismatched pairs within each set.  This information is provided in
pairs.txt.  Using this split, performance on the database can be given
using 10-fold cross validation.

3c. pairs.txt format
----------------

The pairs.txt file is formatted as follows: The top line gives the
number of sets followed by the number of matched pairs per set (equal
to the number of mismatched pairs per set).  The next 300 lines give
the matched pairs in the following format:

name   n1   n2

which means the matched pair consists of the n1 and n2 images for the
person with the given name.  For instance,

George_W_Bush   10   24

would mean that the pair consists of images George_W_Bush_0010.jpg and
George_W_Bush_0024.jpg.

The following 300 lines give the mismatched pairs in the following format:

name1   n1   name2   n2

which means the mismatched  pair consists of  the n1 image  of person
name1 and the n2 image of person name2.  For instance, 

George_W_Bush   12   John_Kerry   8

would mean that the pair consists of images George_W_Bush_0012.jpg and
John_Kery_0008.jpg.

This procedure is then repeated 9 more times to give the pairs for the
next 9 sets.

3d. people.txt format
----------------

The people.txt file is formatted as follows: The top line gives the
number of sets.  The following line gives the number of people in the
first set.  Let that number be N.  The next N lines give the names and
number of images of the people in the first set, one per line. For
instance, if George_W_Bush was in the first line, one line would be:

George_W_Bush   530

The next subsequent line gives the number of people in the second set,
followed by the names and number of images of the people in the second
set.  This procedure is repeated for all 10 sets.


4. additional details
--------------------------------

For additional details on how the database was constructed, as well as
how the configurations were chosen for performance reporting, please
refer to our technical report:

Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.

For updated details on categories of LFW results, including
information concerning unsupervised methods and methods using external
training data, please refer to our follow-up technical report:

Gary B. Huang and Erik Learned-Miller.
Labeled Faces in the Wild: Updates and New Reporting Procedures.
UMass Amherst Technical Report UM-CS-2014-003, 2014.