How It Works

It does need a few words, so please bear with me.

Neural networks are typically structured in layers. Each of these layers processes the results from the previous layer and produces a result that is fed into the next layer. The first layer processes the input data (in our case, an image), and the last layer produces the final output (in our case, a stylized image). The intermediate layers produce internal representations of the data.

In computer vision, most layers are typically "convolutional", which means that they apply filters to their input, similar to the filters we use in image processing software. The difference is that we don't tell the network what filters to use. It learns that on its own during training.

The training process works by feeding the network a (large) set of example images. The network then produces a result for each image, and we compare its output to the desired output. When the difference between the two is too large, we adjust the network's parameters (filters) in the direction that would produce better results. Then we repeat this process multiple times, producing better and better results after each step.

After the training, it turns out that the first layers of the network have learned to detect low-level features, like lines and edges, while the last layers have learned to detect high-level features, like faces. When we are building a network to detect what's in an image (called "image classification"), the very last layers (the "head") weigh the presence or absence of the detected high-level features and classify the image based on that. For example, if it has eyes, ears, a long wet nose, long tongue and sharp teeth, but no wheels, it must be a dog. But when it has wheels, doors and windows, it must be a car. However, this is an oversimplification - we can even detect different dog breeds, and different vehicle types.

It also turns out that the middle layers of image classification networks can detect textures, or styles. It is shiny metal, or it is wood, or brown fur. It has a pattern of stripes, or dots, or waves. So we can modify an original image and stylize it like another image. The final image will have the same subject and high-level structure as the original image, but with the textures of the style image. Since the style is represented by the output of the network's middle layers, it is just a set of numbers. The same is true for the content, which is represented by the numbers in the output of the last layers. Therefore, the task is to come up with an image, such that, when it passes through the network, the middle layers produce an output close to their output for the style image, and the last layers produce an output close to their output for the original image. We start with the original image and pass it through a pre-trained network multiple times. Each time, we measure the differences between:

the middle (texture) layer's output and the same layer's output when the network was given the style image
the last (content) layer's output and the same layer's output for the original image

Then we adjust the image in the direction that would reduce these differences. Rinse and repeat until both differences are small enough, et voilà!

Pretty cool, right? But there's a catch. The process is computationally intensive and slow. It can take minutes to stylize a single high-resolution image, even when we already have a pre-trained image classification network. For a desktop application, this may not be a problem, but who would wait minutes for a web page to load? So to speed this up, we use another neural network, trained specifically for this task. Before training, we use the above (slow) method to produce a (large) set of stylized images. Then, during training, we feed the network the original images, and gradually adjust it to produce results close enough to the stylized images. After training, each time we need to stylize a new image, we feed it to the network only once, and the result is a stylized image, hopefully close to what the previous method would have produced. This is much faster because it happens in one go and does not require multiple iterations. It also requires less memory. It still needs a few seconds, but that's much better than minutes. With good hardware and/or modest resolutions, it can work in real-time with videos.

Links:

For a high-level mathematical explanation how the images are "adjusted", see here.
If you want to know more, check out the original paper by Gatys et al., and the fast style transfer paper by Ghiasi et al.
Tutorials for TensorFlow and PyTorch are available respectively here and here.
The pre-trained model for fast style transfer is available on TensorFlow Hub.

Cheers!