It is often useful to visualize the weights learnt by the first hidden layer of an artificial neural net. We can then see which features are being picked up by each node in the layer. The following image is such a visualization. It is a heatmap of the weights of each node in the first hidden layer of an RBM trained on the MNIST dataset. There are 450 nodes, tiled into 25 columns by 18 rows. We can see the nodes picking up patterns in the data; and we notice similarities between some of these features. Can we structure the presentation better to highlight these similarities?
It would be nice if we could cluster the 450 tiles such that similar tiles appear near each other in the visualization. We can pull out some heavy-duty clustering algo that works in the high dimensional space of the features (28x28 = 784 dimensions), and present the features one cluster at a time, as perhaps one row per cluster. This would work, but would probably need some bespoke attention every time we have a different set of tiles, as it is not quite a push-button procedure. Is there a simpler way?
We can compute the 450x450 pairwise correlation coefficients between the 450 features, and just arrange the features in a single line such that adjacent features have high correlation. We will then get sequences of clusters, that we can break at cluster boundaries where the running correlation has a dip. Formally:
Find an ordering
1 < i <= Nfeatures, such that the sum of correlation coefficients,
1 < j < N, is maximized.
This sounds familiar, doesn’t it? Because, it is equivalent to the Traveling Salesman Problem! The N features are the N cities the salesman needs to visit, the distances between the cities being the negative of the correlation coefficient between the features (so minimizing the distance traveled is the same as maximizing the sum correlations between the features in order).
I used a TSP solver on the weights shown above, and then wrapped the resulting ordering by row to generate the image below. Much neater, isn’t it?
The TSP solver, which gives an approximate solution, has O(N^4) computational complexity and uses O(N^2) memory. It took a minute to run. I have implemented a much simpler greedy algorithm, which is less optimal but runs in O(N^2) time and O(N^2) memory (i.e., same as what’s required for computing the correlation matrix), and runs with no noticeable delay. See
comprehend.features.corrsort() in my github project.
For further improvement, one can segment the ordered sequence of features into subsequences, i.e. clusters with high correlation within and lower correlation without. Then, add blank tiles to use as delimiters between subsequences, taking care to avoid linebreaking a subsequence. In other words, treat each subsequence like one would treat a word when typesetting. Center justification with ragged edges would probably look nicest. Any takers?