Adding Language to

Krisztian Kovacs

April 30, 2019

by Krisztian Kovacs and Jim Bremner wants to make it easy for anyone to label large datasets of images. As it stands, this process is already much easier than manually sorting through them - as platform already clusters images in a meaningful way. Wouldn’t it be great if, in addition, platform could automatically label its clusters? Or if we could feed it the labels we are searching for, and let it project our images accordingly?

In a way, our interaction with when searching for a particular type of image is not unlike interacting with a helpful librarian when in search of a particular genre of book. At the moment, the librarian has essentially managed to organise all of the books into their relevant shelves (clusters), as any good librarian should. Unfortunately, however, none of the shelves in our library are actually labelled. We can probably still find the shelf we are searching for just by browsing through ourselves. But we want to do better; we want to tell our librarian exactly what we are looking for so they can take us to the relevant shelf directly.  

In practice, doing so is not straightforward. We need to create a model that can not only handle images it has never seen before, but also cluster them according to labels it hasn’t seen before. In other words, we need a model that is capable of decent zero-shot classification performance.

Our solution, as one might guess, is to teach our model English. Specifically, we use a visual-semantic embedding network, a model that uses the word vector representations of labels instead of their one-hot encoded form. Our network is essentially transforming images into a semantically meaningful space so that we can compare them to arbitrary words.

There has been previous work in this area, however, it was not obvious how to extend these models into a multi-label setting, and they never gained much traction. A common theme to all previous visual-semantic models was that they compared image activations to label embeddings after pooling, at the linear layers. Doing so lost the spatial correspondence between labels and images. It also set up a much harder objective for the network, as it had to represent concepts at different locations with the same ‘average’ embedding.

In contrast, we compare image and label embeddings at the final convolutional layer. This small change allows the network to infer different concepts at different locations -- improving its classification performance and allowing it to produce basic segmentation masks. In fact, the network can be tweaked for some new applications as well, such as searching images by concept and location (such as ‘image with a dog on the right, but not on the left). Importantly for, such a network can produce the new semantic projections that we want (as shown below on Figure 1).


Figure 1: An example projection of our model. We want to be able to project images along arbitrary labels. The model did not see the labels (person, ocean) during training.


Previous Work

Visual-semantic classification models aim to combine computer vision and word vectors. A simple single label example would be to predict the word2vec embedding of a label instead of its one-hot-encoded version (as used in the original DeViSE paper (Frome et al. 2013)).

What is attractive about such an approach? First, it can evaluate labels not seen during training (zero-shot learning). Second, it can handle a varying number of classes. We can train the model on 10 classes, then decide to add more without changing the network (and thus we can train it on arbitrary many labels - even 100k different ones). Third, each class teaches the network about not only about itself, but the general embedding space. For example, learning to correctly predict the embedding of ‘dog’ will presumably help in predicting the embedding for ‘puppy’. Thus, this type of network is well suited for tasks that have few images per label.

One difficulty with visual embedding models is that it’s not obvious how to extend them to a multi-label setting. (Wang et al. 2016) has used a CNN-RNN framework where the RNN predicts a sequence of labels. (Wang et al. 2016; Ren et al. 2015) used a two-stage process: first object detection to create image subregions, second to predict the single label embedding for each of the resulting subregions.

(Li and Yeh 2018) is the closest to our approach. For each image, they sample positive and negative labels. Positive labels appear in the image, negative labels are random samples from the list of classes that don’t appear.

For each image, their model outputs a k x d dimensional matrix, where d is the embedding dimension, and k is a hyperparameter. They then take the dot product between this matrix and each label embedding, resulting in a k-dimensional vector for each image-label combination. They use a loss function that pushes positive (correct) labels to zero, and negative (incorrect) labels away from zero.

At evaluation time, they provide an image with a list of potential labels and calculate the dot product between the output image matrix and each label embedding. That results in a k-dimensional vector for each label. They then predict the label with the smallest distance from the origin.

Our approach

Our data sampling approach is similar to (Li and Yeh 2018). For each image, we sample positive and negative labels - 50 in total. Images usually only contain a handful of positive labels, so we include all of them and sample as many negative labels as needed to make 50 labels in total.

Our model then predicts a [0, 1] score for each image label combination, indicating whether or not the two match. It is similar to a Siamese Network, but instead of predicting the similarity of two images, we predict the similarity between an image and a label embedding.

For label embeddings, we use the standard 300 dimensional word2vec trained on Google News.  If a label contains multiple words (‘small dog’) we simply average the embeddings of each word. We also normalize label embeddings to the unit norm for calculating cosine distances later.

Our backbone network is a pretrained Resnet34. We chose this network not because it’s ideal for our task (it’s not), but because it’s 1) it’s simple and 2) it’s’s choice, allowing for fair comparisons later.

We depart from (Li and Yeh 2018) by comparing image and label embeddings at the convolutional level. If the network’s final convolutional layer contained 300 channels (the dimension for our label embedding), then we could calculate the dot product for each height and width dimension (see Figure 2).


Figure 2: An example of combining a convolutional block with a label embedding, giving similarity scores at (x, y) coordinates. For a 224x224 image, a Resnet34’s final convolutional layer has a layer dimensions 7 x 7. The dotted outline represents an image-region embedding within the convolutional block.


This architecture embeds each spatial location in 300-dimensional word2vec space. It naturally accommodates different objects at different locations -- if there is a cat at the left side of the image, and a dog on the right side, we don’t have to represent both with a single word2vec vector - each can be mapped to its own concept.

However, we still have one difficulty. How do we treat labels that refer to the same location? For example, ‘dog’ and ‘furry’ would both point to the same x, y coordinates. The solution is simple, we use multiple 7x7x300 blocks! This is equivalent to learning a k x 300-dimensional matrix at each height-width location in  (Li and Yeh 2018)’s formulation. However, we think it’s easier to visualize it as k seperate 7x7x300 convolutional blocks. Figure 3 illustrates the whole pipeline.

Figure 3: Illustration of the model head with k=3 (our main model uses k=5). Illustrates a single image single label comparison. For a 224x224 image, a Resnet34’s last convolutional layer has a height & width of 7.   


We then calculate the cosine distances between image and label embeddings in each of the k distinct blocks. That gives us k separate 7x7 distance matrices. We treat this set of maps just as another convolutional block (with k being the channel dimension) and pass it through a few simple conv and ReLU layers. That results in a single 7x7 map giving the similarity between the label and the image at each spatial location.

To get our final prediction, we pool this map. We use average and max-pooling, and interpolate between the two with a learnable weight.

To sum up, we feed our model two inputs: an image and a list of labels (both positive and negative). For each label, our model gives a similarity measure between the image and that label. If our targets are one-hot encoded vectors (with 1 for positive labels), we can actually train this model with standard loss functions! Since we are in a multi-label setting, we use binary-cross-entropy as our loss.

It’s important to note that the network is kept deliberately simple. Many of the architecture choices could be optimized for better performance:

  • Changing the backbone network. Instead of a Resnet34, we could use a more advanced model, with less down-sampling (so the height - width dimensions of the last convolutional layer are larger).
  • Instead of cosine distances, we could use a more meaningful measure. For example, concatenating the image and word embeddings and passing it through a mini-neural net, which then learns a more flexible distance metric.
  • We could choose a better representation than simply averaging the word2vec embeddings for labels with multiple words.
  • We haven’t tuned k, only set it to k=5.

In addition to these architecture choices, we have not optimized our training procedure either. We only trained on 224x224 images and stopped at 13 epochs. We did use default fastai image transformations, and the 1-cycle policy for faster training described in (Smith 2018).

Our model produced multilabel classification results that were competitive with those in the literature, but we won’t reproduce them here for the sake of brevity. What’s more relevant to is how we can now apply this work to speed up the process of clustering and labelling images.

Semantic Projections

The projections in are based on intermediate-level features in the convolutional part of the network. These represent edges and textures (at the lower levels), basic object shapes such as faces (at the middle layers), or high-level features such as the eye of a chicken (at the higher levels). plots images using these features after they’ve been reduced to manageable dimensions.

Figure 4: PCA reduction of image activations (first 2 dimensions). From the NUS-wide dataset, based on the layer immediately before the dot product with label embedding, for 200 random images.


Figure 4 shows the first 4 dimensions of the image embeddings (after reducing the 300 dimensions to 10 using PCA). The image clusters are interpretable: animals, people, buildings, landscapes etc.

Our architecture allows for new types of projections. In addition to the projections described above, it’s able to represent images in word2vec ‘concept’ space (think of our helpful librarian). This concept space is often much closer to how people want to view object categories, as it is not based on proxies such as shapes and textures. To achieve this, all we need to do is rank images based on how close they are to our chosen label’s word2vec representation.

All projections below are from a random subset of the NUS-wide dataset, based on labels not seen during training the model (so we can see how well we generalize into ‘unseen’ word2vec space).


Figure 5: Projections for the 'animal' label. This label was not seen during training. The y-axis is randomly generated to spread out the images.


Figure 5 ranks images based on an unseen class during training (‘animal’). While the model has not encountered this class before, it has seen similar ones  and is able to roughly place images in the correct neighborhood. Note that the ranking is meaningful, despite the model not having seen the labels we project along during training time.

We can also create 2D plots based on 2 labels (possibly to visualize the interaction between them) - Figure 6 shows such a case.

animal_garden_projFigure 6: 2D Projection along the garden-animal dimensions


Another interesting analysis we can make is of the diversity of objects within an image. By analysing the variance of the region embeddings for an image we can produce a diversity score and treat it as simply another axis to plot against. So for example, if we wanted to find images containing only sky (low diversity), we could make a plot such as Figure 7, where a cluster of empty skies have been picked out in the bottom right area of the plot (high levels of sky, low levels of diversity).

Figure 7: 2D projection along the diversity-sky dimensions


For projecting along multiple labels, we can either scroll through the 1- or 2-dimensional projections or cluster them around a few points in embedding space. Figure 8 demonstrates this idea.

Figure 8: Clustering images based on 5 classes. Note that none of these labels were seen during training.


Spatial Applications

These projections are already useful. However, another attractive characteristic of our model is that, unlike traditional classification networks, its predictions are inherently spatial. Rather than pooling and flattening the spatial features contained in the final convolutional layers, we preserve their structure. In essence, we have a coarse semantic segmentation network, where the model encodes each different region of the image in an embedding.

Below we illustrate what kind of applications these intra-image embeddings lend themselves to.

Localisation maps

First, we can localize objects or concepts in an image. By extracting cosine similarities (shown in Figure 1), we can easily build semantic localisation maps (reminiscent of class activation maps (CAMs) (Zhou et al. 2015)). Figure 10 has a number of different heat maps representing the local strength of different word embeddings in an image.

Figure 9: Semantic localisation maps


We applied this procedure on the PASCAL VOC2012 segmentation dataset to see how well these coarse localisation maps match pixel-level annotations. Setting the background threshold at 0.8 gave us very reasonable results, with a micro-averaged segmentation accuracy (mean intersection over union) of ~30%. A few example segmentations can be seen in Figure 10. We must bear in mind here that our model was only trained on image-level annotations. It’d be interesting to use these in a similar way to (Wei et al. 2018), where the authors supplement their strongly supervised pixel-level dataset with more noisy masks (in their case produced from CAMs).

Figure 10: Example semantic segmentation performance on PASCAL VOC2012


We can also carry out an even less supervised localisation for our image, instead simply applying k-means clustering to the region embeddings (with a supplied value of k). Even though the clustering algorithm has been given no spatial information whatsoever, it can still coarsely segment the image into its different objects (see Figure 11).

Figure 11: K-means clustering on region embeddings for coarse object localisation (here with randomly chosen values of 2 k 5).


It’s worth pointing out that not only can these localisation maps produce multiple labels for a given region, but they can also gauge the spatial distribution of anything you can express in language. That’s not only simple object classes such as “sea” and “boat”, but more descriptive words such as “stormy” and “dark” or more abstract concepts such as “ominous”. Indeed, there’s nothing stopping us from creating a localisation map for the sentences by using an upscaled version of our model.

Localisation map search

We’ve seen that we easily transform images into semantic localisation maps, but why not try going in the other direction: to search for images in our dataset which best match a given set of maps?

We could first create search queries simply by painting on a grid where we would (positive grid values) and where we wouldn’t (negative grid values) like to see particular concepts. We can multiply each of these query maps (one for each search concept), with the corresponding semantic localisation maps that we described in the previous section, and then sum these values together. The final result would be a score for each image representing the closeness of the match with the search query.

We can carry out sophisticated image searches by combining any number of query maps with different criteria. For example, below we search for images with sky in the top half but without any clouds. Our query maps may look something like the plots in the top of Figure 13. The bottom of Figure 13 shows the search results.

Figure 12: Localisation map search through NUS-WIDE validation set for the query (top). Matching images contain sky with no clouds located in the top half of the image



Figure 13 shows a search with “person” on the right-hand side but not on the left.


Figure 13: Localisation map search through NUS-WIDE validation set for the query (top). Matching images contain a “person” on the right but no “person” of the left.


We modified visual-spatial embedding models by placing the label-image comparison at the convolutional layer. The resulting model is capable of zero-shot learning, and can create projections along the axis spanned by an arbitrary label. The particular outputs of our network also allow us to create unsupervised segmentation masks, localisation maps and a localisation map search function.



Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc ’aurelio Ranzato, and Tomas Mikolov. 2013. “DeViSE: A Deep Visual-Semantic Embedding Model.” In Advances in Neural Information Processing Systems, 2121–29.

Li, Yi-Nan, and Mei-Chen Yeh. 2018. “Learning Image Conditioned Label Space for Multilabel Classification.”

Ren, Zhou, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2015. “Multi-Instance Visual-Semantic Embedding.”

Smith, Leslie N. 2018. “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay.”

Wang, Jiang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. “CNN-RNN: A Unified Framework for Multi-Label Image Classification.”

Wei, Yunchao, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S. Huang. 2018. “Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation.”

Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2015. “Learning Deep Features for Discriminative Localization.”

Chua, Tat-Seng and Tang, Jinhui and Hong, Richang and Li, Haojie and Luo, Zhiping and Zheng, Yantao. 2009."NUS-WIDE: a real-world web image database from National University of Singapore",


Twitter / Facebook / Email