A Restricted Boltzmann Machine (RBM) is a fully-connected layer in a neural network that is trained as an autoencoder via a special method called contrastive divergence. See the RBM section at deeplearning.net for the theory. Unlike a simple feed-forward network trained as an autoencoder, an RBM is a generative model that can be used to a) generate samples from the data distribution, and b) produce log-likelihood values for a given input sample. I have implemented a variety of RBM types in my TensorFlow based
comprehend package on github. Here I test them on the Labelled Faces in the Wild (LFW) dataset1. They do learn what typical faces look like, generating samples of them from scratch. They are also able to move a given actual face towards a more typical representation.
Fig. 1: 64 random faces from the LFW dataset, after preprocessing
Figure 1 shows a sampling from the LFW dataset, after they have been frontalized2, converted to grayscale, resized to 64x64 pixels, with pixel values normalized to have mean
0.0, standard deviation
0.3 and clipped to range
[-1.0, 1.0]. These processed images, 13200 of them, form our input data. For the purpose of dimension reduction, and to take advantage of the local structure in the data, I first use a convolutional neural network (CNN) before adding an RBM layer on top. The CNN has five layers, with the first four having a convolution stride of 2 doing the bulk of the dimension reduction. The input layer, i.e. the images, with its dimension of
64x64x1=4096 is reduced to
4x4x32=512 at the output layer of the CNN. I train the CNN layer by layer as an autoencoder with L2 cost, using standard back propagation and the Adam optimizer.
The output of the CNN, i.e. the hidden values of the top layer, are then saved to disk and serve as the input data for a single layer RBM, of 500 hidden units. RBMs use a sigmoid activation function, and so values in the range
[0.0, 1.0]. This is easily addressed by a simple linear transformation of the input data. The harder issue to deal with is the fact that vanilla RBMs model probabilities of binary input values: the data is presumed to be either very close to 0.0 or very close to 1.0. When training an RBM, values are sampled to be either 0 or 1. This works great for the MNIST dataset, but not here where the input data is continuous. In fact, this is the main reason for this post: I wanted to test my modifications to vanilla RBMs to handle continuous-valued input (I had already tested vanilla RBMs with MNIST, where they work great).
Literature on continuous-valued RBMs is sparse; what I came across suggested sampling using a standard Gaussian with some standard deviation, e.g. 0.1, centered at the values output by the sigmoid activation function. This seemed not quite right to me as the resulting sampled values will require clipping to stay in range
[0.0, 1.0]. So I am using sampling from a beta distribution, whose range is naturally
[0.0, 1.0], instead. See the
R3BM classes in the
networks.py module of my
comprehend package on github. There was some adaptation involved, as the beta distribution is not normally parameterized by its mean and standard deviation.
Fig. 2: Faces from Figure 1 after autoencoding by the CNN-RBM network
After the RBM was trained, it was stacked on top of the CNN (with the simple linear transformation in between the two to translate the ranges of values). Figure 2 shows the results when the faces from Figure 1 are fed as input to this CNN-RBM network (encoding), and output is fed back down (decoding) from the RBM on top to perform an autoencoding. The faces look regularized, towards an average-looking face—glasses and moustaches go missing. We can control the extent this happens by mixing the hidden values in the upward pass with those on the downward pass at some intermediate layer (not shown).
Fig. 3: Faces generated from scratch by the RBM
Now RBMs can also be used to generate samples. Starting with random input values, after 1000 iterations of Gibbs sampling, I obtain the faces in Figure 3 (when the RBM samples are reversed using the CNN). They look like real faces, though smoother and closer to the canonical face than the actual samples. To get more resolution, we would likely need a deep multi-layered RBM3.
Now for some fun: let’s try interpolating faces. My initial intention was to interpolate between faces at the RBM hidden layer level. But as the RBM likes to typicalize faces, losing distinctiveness, I here do it at the top CNN layer. Figure 4 below shows the result of mixing George W. Bush’s faces with Tony Blair’s. I believe we would get better results if we use a deep RBM and mix hidden values at the top of that—will leave that for another day.
Fig. 4: Top row faces mixed with bottom row faces to get middle row faces
1 Erik Learned-Miller, Gary B. Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled Faces in the Wild: A Survey. Advances in Face Detection and Facial Image Analysis, 2016.
2 Tal Hassner, Shai Harel, Eran Paz and Roee Enbar. Effective Face Frontalization in Unconstrained Images. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
3 G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, Vol. 313, 2006.