The Silhouette Loss Function: Metric Learning with a Cluster Validity Index

Jim Bremner

April 09, 2019

At its heart, is all about visualisation. We aim to take any large dataset of unlabelled images and place each image at a position on the screen, such that similar images are positioned closer together. In this way, we can help users label vast numbers of images with minimal effort.

This is relatively straightforward to do if our model has already been trained on the same labels that we are testing it with (e.g. cars, animals), but the bigger challenge, and where is most useful, is with unseen image classes. That is to say: our model is already good at discriminating between a sea lion and a Japanese spaniel, because it has been trained on images of these animals, but we want to leverage what it has learnt about sea lions and Japanese spaniels (amongst many other things) to help it pick up subtle differences in, for example, images of capybaras (which don’t appear in the training data).

This comes under the broad field of transfer learning, and the good news is that current approaches already do quite a good job of it.


The traditional approach

Typically, the unseen images are passed through a CNN which has been pretrained on a large, generic dataset - i.e. it has already learnt to “see”. Then, rather than taking the output of the network (on the right in Figure 1), we carry out layer introspection, which is just the process of pulling out the activations from layers inside the network. In a CNN, these activations are of dimensions convolution height x convolution width x number of channels (as in Figure 1), so we must first flatten them out into a vector, typically with 1000’s of components.

Figure 1: The traditional approach to creating image representations from unseen images using layer introspection.

This vector is an encoding of information contained in the image, which is later used to perform classification. In other words, it is a high-dimensional vector embedding of our test image. However, it’s hard to visualise 1000’s of dimensions, so we then must apply further dimensionality reduction techniques (e.g. PCA) in order to condense all of this information down into 2 or 3 dimensions for plotting.

Layer introspection has a number of advantages. First and foremost, the 2D projections it produces (the final coordinates above) are generally quite good visual representations of the relationship between the images. What’s more, CNN’s have a useful property in that the earlier layers capture lower-level features such as colours or simple shapes, whereas the later layers capture higher-level concepts which tend to be more complex shapes with a closer resemblance to the objective classes in the very final layer (Yosinski et al. 2014). In this way, one can choose to cluster images based on these low- or high-level features depending on the task at hand.

However, this approach still suffers from a number of drawbacks. Whilst the flattened activations are rich in information, they sometimes contain in excess of 200,000 components. Condensing this information down to just 2 dimensions is never a quick process. The methods that are quicker tend to throw away a lot of information, producing poorer quality projections (e.g. PCA), whilst those that do a better job of preserving the shape of clusters from the hyperdimensional space are painfully slow (e.g. t-SNE, UMAP).

On a more fundamental level, these pretrained models were optimised to classify images into distinct categories in their final layer - at no point were they optimised to produce the tight, well-separated clusters that we want. The usefulness of layer introspection really only arises as a by-product of our model’s true objective: to provide discrete classification. As a result, whilst introspection performs well in a large number of cases, sometimes we can still be left with sprawling, overlapping clusters of images.


Possible solutions

It’s clear that we should maybe try optimising our network to cluster images directly. There have been a number of previous attempts at solving this problem; I shall briefly discuss just two here. Xie, Girshick and Farhadi train their network to learn cluster centroids for each image class as well as the image embeddings themselves (Xie, Girshick, and Farhadi 2015).  Caron et al. apply PCA to the flattened vectors from introspection, then creating pseudo labels from k-means clustering, which they use to then optimise the network output in alternating steps (see Figure 2) (Caron et al. 2018).

Figure 2: In (Caron et al. 2018), the authors apply k-means clustering to predict labels which they then use to optimise their model.

These methods seem to work quite well, but there are a few reasons why they aren’t so suitable for us. First, they both optimise using cluster labels, and although we are eventually interested in these, initially we’re much more interested in creating a nice spatial distribution of clusters to help the user in assigning the labels they want.

Both of these methods are unsupervised, and although the images provided on don’t initially have labels, we’d still like to make use of a large amount of labelled background data to pretrain with (autoencoders don’t take advantage of these either). They’re also slow and require multiple stages of optimisation. They would need to be trained on the new images before even producing any visualisations, whereas a pretrained model would be able to start outputting projections straight away.


Spatially optimised clustering

Ideally, we want to output our 2D coordinates at the end of the network and optimise them to produce tight, well-separated clusters. In a way, we want our network to carry out dimensionality reduction for us. We’ve no reason to design a new network from scratch, we already have a large number of candidates that we know “understand” images well (ResNet, VGG etc.), in so far as they can classify objects extremely well. The missing piece is a loss function which will minimise intra-class and maximise inter-class Euclidean distance between image embeddings from the output of the network.

This is known as metric learning, and it is really the core of what platform aims to do: embed images in a 2D space where distance relates inversely to similarity. A couple of popular functions are contrastive loss (Hadsell, Chopra, and LeCun 2006) and triplet loss (Schroff, Kalenichenko, and Philbin 2015), which come from siamese networks (often applied in facial recognition). These compare image embeddings (pairs and triplets of image embeddings respectively) and will result in a smaller loss if the images have the same class and are close together, and a larger loss if they are close but of different classes. The combined effect is to push and pull the embeddings into clusters of distinct classes after each batch update.

Contrastive and triplet loss are widely used but require very specific batch processing (e.g. hard example mining) in order to train well. Only making comparisons between pairs or triplets also seems to be inefficient, can’t we compare more image embeddings at once?


Cluster validity indices

Cluster validity indices (CVI’s) aim to provide a measure of the quality of a particular cluster partition. That is, if we have a number of data points and we assign each with a class label, CVI’s provide a measure of how well those class labels match the underlying spatial distribution - whether points of the same class all lie close to each other in distinct clusters, or if they’re spread out in sprawling, overlapping clusters.

Arbelaitz et al. have carried out an extensive review of around 30 different cluster validity indices, comparing their performance on a number of different synthetic and real datasets (Arbelaitz et al. 2013). They found that the silhouette score generally performs the best.

Figure 3:  Success in identifying the true ground truth partition of cluster labels across a number of different synthetic datasets for various cluster validity indices (Arbelaitz et al. 2013). Here, the silhouette score (Sil) performs best.

This analysis measured how well the cluster labels matched the spatial distribution of points, but we can also change our perspective slightly to measure instead how well the spatial distribution matches the cluster labels. In this case, the cluster labels are fixed (as our ground truth) and the spatial distribution is the variable. From here, it’s easy to see that if the silhouette score provides such a faithful measure of clustering quality, then why can’t we directly optimise our network to maximise it?


The silhouette score

The form of the silhouette score (Rousseeuw 1987) for a single point can be seen in Figure 4. In essence, it’s a normalised difference in distance, capturing how close a point is to other points in its own cluster compared to points in the next nearest cluster.

The formula for the silhouette score $s(i)$ for a single point $i$ (in Figure 4) contains two important terms:

$a(i)$ - the mean distance between $i$ and all other points in its own cluster.

$b(i)$ - the distance between $i$ and its next nearest cluster.

Towards the positive extreme of $s(i) = 1$, we must have that $b(i) \gg a(i)$, so the point $i$ is either sat right on top of all of the other points in its cluster ($a(i)=0$) or the nearest cluster is infinitely far away ($b(i) \rightarrow \infty$). Therefore, $s(i)=1$ represents good quality clustering.

Moving towards the negative extreme of $s(i)=-1$, we now must have that $a(i) \gg b(i)$, so point $i$ is either sat right on top of all of the points in its nearest neighbour cluster ($b(i)=0$), or all of the other points in its own assigned cluster are infinitely far away ($a(i) \rightarrow \infty$). In the latter case, point $i$ most likely should have been assigned to its nearest neighbour, therefore, $s(i)=-1$ represents poor quality clustering.

Figure 4: The silhouette score

The formula in Figure 4, represents the silhouette score for a single point in a projection containing multiple points, but we can easily find an average silhouette score across all of the points. This is a really attractive feature relative to contrastive and triplet losses. With the latter two, we only carry out $\frac{N}{2}$ and $\frac{2N}{3}$ (respectively) comparisons per batch containing $N$ images. The number of comparisons made when using the silhouette score is less obvious because of the dynamics of selecting nearest neighbour clusters, but it is lower bounded at $N-1$, and some simple experiments have shown that it sits at $\sim 1.35 N$ on average. The gradients produced would therefore presumably be much smoother as each batch takes into account the relationship between many more embeddings (and in a much more efficient way).

The only thing that remains to be done to convert the score into a proper loss function is to simply flip and rescale it to between 0 and 1, where 0 represents perfect clustering, and 1 represents poor clustering.


Alternatives: COREL loss

To really test the performance of the silhouette loss function, it would be nice to have an alternative approach with a similar objective. Clustering-Oriented Representation Learning with Attractive-Repulsive Loss (or COREL loss for short) was recently proposed by Kenyon-Dean et al. to address the problem of the poor representational (cluster) quality of embeddings learnt using the categorical cross entropy (CCE) loss function (Kenyon-Dean et al. 2018). They start interestingly by reframing CCE as an attractive-repulsive loss function as shown in Figure 5.

Figure 5: Softmax categorical cross entropy refactored as an attractive-repulsive loss function.

By breaking up the log of the fraction from the softmax CCE equation, they reveal attractive and repulsive terms. The attractive term is simply the negative sum of the dot products of the ground truth vectors with the predicted vectors, which is an unnormalised version of the cosine similarity between the two. Therefore, if the angles between the ground truth and predicted vectors are small, the loss is reduced, as we’d expect. The repulsive term is the positive log of the sum of the exponentials of this same unnormalised cosine similarity but between the predicted vector and all of the negative ground truth vectors. Therefore, if a predicted vector has a small angle between a negative class, then the loss is increased, acting to push it away.

Motivated by this formulation of CCE, the authors go on to form their own loss functions: Cosine-COREL and Gaussian-COREL, which cluster in cosine and Euclidean space respectively. In Figure 6 I show the former just for the sake of brevity. Here, the authors change the attractive term to a normalised cosine similarity, and the repulsive term to only include the nearest neighbour negative ground truth class, much the same as in the silhouette loss function that I described in the previous section.

Figure 6: The Cosine-COREL loss function, where $s_{cos}$ is the cosine similarity.

The authors find that these COREL loss functions produce much better cluster representations than CCE without sacrificing much in terms of classification accuracy. So these seem like good loss functions to compare with our silhouette loss.


Seen labels: CIFAR-10

Although we are more interested in the performance on unseen labels, it’s useful to assess performance on seen classes too just to get a sense of what the models are learning. I carried out a comparison between contrastive, CCE, Cosine-COREL, Gaussian-COREL and silhouette losses, the results (after dimensionality reduction with UMAP) can be seen in Figure 7. All models were trained from a ResNet34 pretrained on ImageNet, and all but the model with contrastive loss (for reasons I shall explain) were trained until convergence.

It’s important to note that since the main reason for researching different methods was to avoid carrying out excessive dimensionality reduction, I decided to keep the embedding dimensions as small as possible. This meant that for contrastive loss, with a modest 256-d embedding dimension, I struggled to train the model at all. I’d imagine we’d see much better results for this loss function if I had used a larger embedding dimension as seen in the literature, but this would have been contrary to my aims.

Figure 7: Clustering performance on seen labels from CIFAR-10. Result of UMAP dimensionality reduction on (a) 256-d embedding (b) 10-d embedding (c) 10-d embedding (d) 10-d embedding (e) 16-d embedding

We know CCE loss performs very well on discrete classification tasks, but if we try and plot its learnt embeddings, the clusters formed are often broken apart and sometimes have interclass connections. The COREL loss functions, which, as we have seen, are similar in form, instead produce round, whole and nicely separated clusters. It’s worth pointing out that for the CCE and the COREL loss functions, the number of classes in the training set must equal the embedding dimension - so once we reach a large number of training classes, we have no choice but to use a large embedding.

It’s nice to see that the silhouette loss performs well, although perhaps not quite as well as the COREL loss functions. The difference is that we have complete control over the output dimensionality - I used a 16-d embedding here.


Unseen labels: CIFAR-100

In order to test performance on unseen labels, I took the CIFAR-100 dataset and randomly partitioned its 100 classes into 50 ‘seen’ labels to train on and 50 ‘unseen’ labels with which to test. This process is summarised in Figure 8. As the best performing model on the seen labels, I also included a Cosine-COREL model trained on all of the 50 seen classes. Finally, to test, I included all of the aforementioned models as well as the traditional approach using layer introspection from a ResNet34 trained on ImageNet as a baseline.

Figure 8: The train/test split for testing clustering performance on unseen labels.

At this point, we could make a quantitative analysis of each of the model’s performance on the unseen labels with a cluster validity index such as the silhouette score. But since we’ve trained one of the models to directly optimise this score, this seems a little unfair. Instead, we can still draw broad conclusions from plotting subsamples of the 50 unseen classes as I have done in Figure 9.

Figure 9: Unseen class clustering performance on three separate subsamples of the CIFAR-100 unseen labels partition. UMAP reductions from 25088-, 50- and 16-dimensional embeddings for introspection, COREL and silhouette models respectively.

The first thing to note is that all models do quite a good job of visually separating the different classes into clusters - although overall, the COREL and silhouette models are superior to layer introspection. Of course, we are limited to making qualitative judgements here, but from these and many other plots that I have examined, it seems that in general, the silhouette loss probably produces tighter and better-separated clusters on unseen labels than Cosine-COREL. This also comes after much less training (~25 mins as opposed to ~1hr30 on a Tesla P4 GPU) and with complete control over the embedding dimension.

It’s interesting to point out a failure case, however. In the centre plot of Figure 9, both of the proposed models struggle to separate sunflowers from orchids. This is an example of poor intradomain performance, where our models have learnt the concepts of tulips and roses from the seen labels, but this hasn’t provided them with much help in distinguishing between the unseen sunflowers and orchids. We could probably explain this effect if we notice that a sunflower is probably more visually similar to an orchid, than a trout is to a tiger, hence the different abilities in differentiating between each. The good thing is that performance on these intradomain examples will probably improve with further pretraining on a larger dataset (exploratory experiments I have done confirm this). Testing on a small set of labelled images also shows that on, our models can already differentiate between difficult examples after the user has labelled only a few dozen images, working in a virtuous cycle to accelerate labelling further.


What about the feature hierarchy?

I mentioned earlier that one of the benefits of layer introspection was that by looking at the activations in different layers, one can choose to cluster images based on lower or higher level features (e.g. simple colours or complex object classes respectively). Unfortunately, the COREL and silhouette loss functions are only trained with one level of features so don’t allow for this at their final layer, unless we trained with some complicated multi-objective loss function which created hierarchical clusters according to different levels of features - but I don’t see this working well. However, there’s nothing stopping us from applying both methods in parallel, with little extra overhead.

It may be interesting to explore DenseNets (Huang et al. 2016), where early/late layer features are concatenated. Perhaps embeddings learnt in this way combine lower and higher level features more effectively?



From this very early exploratory analysis, it’s easy to see the potential usefulness of the silhouette loss function. Here are some of my ideas for future research:

The silhouette score is just one of many cluster validity indices to optimise, perhaps others / combinations of others may provide more useful properties.

I’ve only focused on clustering in Euclidean space as this is directly applicable to visualising data on a 2-d plane. One could just as easily use any distance or dissimilarity function depending relating to a particular application.

Clustering is also useful for almost all other forms of data (e.g. sound and text), I don’t see why you couldn’t use a silhouette-style loss function for these.

Moving further away from visualising data, the concept of learning representations is crucial in few-shot learning, where embeddings for background classes are traditionally learnt before testing on unseen classes - exactly as I have done here (see (Snell, Swersky, and Zemel 2017)). Some early testing on OMNIGLOT gives competitive results. I’m also interested in whether such a framework might provide any advantages when training with a relation network (Sung et al. 2017) in place of our standard Euclidean distance function.



Arbelaitz, Olatz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez, and Iñigo Perona. 2013. “An Extensive Comparative Study of Cluster Validity Indices.” Pattern Recognition 46 (1): 243–56.

Caron, Mathilde, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. “Deep Clustering for Unsupervised Learning of Visual Features.”

Hadsell, R., S. Chopra, and Y. LeCun. 2006. “Dimensionality Reduction by Learning an Invariant Mapping.” In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 1735–42.

Huang, Gao, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2016. “Densely Connected Convolutional Networks.”

Kenyon-Dean, Kian, Andre Cianflone, Lucas Page-Caccia, Guillaume Rabusseau, Jackie Chi Kit Cheung, and Doina Precup. 2018. “Clustering-Oriented Representation Learning with Attractive-Repulsive Loss.”

Rousseeuw, Peter J. 1987. “Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53-65.” Journal of Computational and Applied Mathematics 20 (November): 53–65.

Schroff, Florian, Dmitry Kalenichenko, and James Philbin. 2015. “FaceNet: A Unified Embedding for Face Recognition and Clustering.”

Snell, Jake, Kevin Swersky, and Richard S. Zemel. 2017. “Prototypical Networks for Few-Shot Learning.”

Sung, Flood, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2017. “Learning to Compare: Relation Network for Few-Shot Learning.

Xie, Junyuan, Ross Girshick, and Ali Farhadi. 2015. “Unsupervised Deep Embedding for Clustering Analysis.”

Yosinski, Jason, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. “How Transferable Are Features in Deep Neural Networks?”


Twitter / Facebook / Email