CNN Explainer - Interpreting Convolutional Neural Networks (3/N)

Visualizing Boosted Convolutional Features

In today’s article, we are going to investigate what Convolutional Neural Networks (CNNs) learn during the object classification task. Visualizing CNN’s features would allow us to see what from CNN’s point of view makes thing a thing. By the end of this article, you will be able to visualize hierarchical features reflecting how CNNs ‘understand’ images.

In other words - if you are curious about what’s in the below image, keep reading!

(source: author)

This is the third part of the CNN Explainer series. If you haven’t checked previous parts yet, feel free to do it now.

The Essence of Deep Learning in Computer Vision

Deep Neural Networks in Computer Vision applications are usually trained by being exposed to a vast number of annotated visual examples, with the idea that the network will implicitly learn the essence of the matter necessary to give an accurate prediction, without being explicitly told what to look for.

For example, when training a mug detector, we would gather a dataset of various mugs, and hope that the CNN during training would implicitly learn what makes a mug a mug, without being explicitly told to look for a specific shape for example.

The notion of identity is a very interesting philosophical problem, investigated over the ages by various thinkers, that at the end of the day were just trying to answer the following question:

What makes a thing what it is?

Convolutional Neural Networks are probably not suited to answer such a broad philosophical question on a general level, but they might be able to do it on a very specific level in object classification applications.

Let’s try to verify that!

Going back to the mug detection problem, we humans somehow intuitively know that both of these pictures show mugs:

Even though these objects are in different colors, have slightly different shapes and the ears are not in the same place, we can easily agree that both of these objects are mugs.

So what makes mug a mug?

Let’s see if a Convolutional Neural Network can answer that question.

Boosted Features Visualizations

The idea of visualizing features learned by CNNs that we are going to use in this project is based on publications by Alexander Mordvintsev et. al. (2015) and Erhan et al. (2009).

To better understand the underlying concept, let’s stop for a second and remind ourselves how CNNs work.

During the training phase, CNNs are being exposed to image datasets and with gradient descent, they adapt their weights to these images through filters.

You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified.

François Chollet, Deep Learning with Python

In this project, we are going to do something kind of the opposite, i.e. we are going to start with a random noise image, and optimize its pixels with gradient ascent so that it activates our target filter the most. So instead of optimizing model weights, we are optimizing input pixels.

With that approach, we are going to enhance the patterns that CNN ‘understands’.


Firstly, we need to select a target layer. We are going to select the last non-output layer because it should contain a mix of both high-level features that are learned at the end of the network and low-level features that are learned in the very first layers. However, feel free to experiment with that and try selecting different layers.

After selecting the target layer, it’s time to collect its activations while passing our target image, which in our case contains a mug.

(source: author)

These are the activations of the filters of the target layer of a resnet18 CNN (pretrained on imagenet) after seeing the target image.

layer = list(model.children())[-2] # last non-output layeractivations = ActivationsExtractor(layer) # activations’ hook_ = model(preprocessed_image.cuda()) # forward pass to collect activationsaverage_activations_per_filters = [activations.features[0, i].mean().item() for i in range(activations.features.shape[1])]
(source: author)

We can see that one of the filters activated significantly stronger than the others, the one at index 327.

most_activated_filter_idx = np.argmax(average_activations_per_filters)

The filter that activated the most, is very likely to have the features corresponding closely to the input image of a mug, let’s try to visualize them!

  1. Firstly we are going to start with a random noise image (line 2)
  2. Next, we are going to set an Adam optimizer for the random noise image (line 7)
  3. Then we are going to collect the activations of the target filter and make it our loss. Note that we are taking its negation because we want gradient ascent instead of gradient descent (line 12)
  4. We are repeating this process for several iterations
  5. Finally, we can display our ultimately optimized image that originated from random pixels (lines 16–17)
(source: author)

As we can see from the above output with boosted convolutional features

for our CNN the most important thing that makes mug a mug is its ear.


Take a look at the other results I came up with, but don’t worry, I won’t rob you from the joy of analyzing them.

I encourage you to look closely and try to discover various patterns in these psychedelically looking images.

What’s Next?

Visualizing the highly hierarchical features of CNNs, brings us one step closer to understanding them. Moreover, these representations while being very informative from the research perspective can be both beautiful and eerie, adding another dimension of how we can perceive neural networks.

I encourage you to play with the project’s hyperparameters and generate such stunning representations on your own!

And don’t forget to 👏 if you enjoyed this article 🙂.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store