Pix2pix Utilizing the Deep Learning Reference Stack
This article describes how to perform image to image translation using end-to-end system stacks from Intel. In this scenario, we used the Deep Learning Reference Stack, a highly-performant and optimized stack for Intel® Xeon® Scalable Processors.
Artists are faced with the challenge of effectively synthesizing photos from label maps, reconstructing objects from edge maps or colorizing images, among other tasks. Pix2pix is a fun, popular cGAN deep learning model that, given an abstract input, can create realistic outputs for use in art, mapping, or colorization. The pix2pix architecture is complex, but utilizing it is easy and an excellent showcase of the abilities of the Deep Learning Reference Stack.
- TensorFlow* v2.0
- Deep Learning Reference Stack (DLRS) v0.5.0
Pix2pix is a conditional Generative Adversarial Network (cGAN) that uses a discriminator model to train a generative model. The discriminator is a traditional image classifier that tells if an image has been “generated” or is real. The generator takes an abstract image and tries to generate a realistic image. For example, the generator can be given a drawing of a cat, and it will try to create a realistic cat. The generator and discriminator are then trained against each other to make each other better. As the discriminator improves its ability to discern real images from fake ones, the generator must improve its ability to generate realistic images.
There are many ways to build a generator, but the goals are the same. A generator needs to take an input, whether that is random noise or an image, and create an output that fools the discriminator. In the case of pix2pix, the generator takes a 256x256 RGB color image and outputs the same thing. The input image is an abstraction of what we want to create, while the output is a realistic image that the discriminator cannot distinguish from a real image.
Real and Abstract images. We leave it to the reader to decipher which is which.
Between the input and output layers are convolutional layers that interpret the input image and generate the output image. Explaining convolutional layers is outside the scope of this paper, though there are many excellent online resources  to explain the concept if you are not familiar.
The generator model is designed as a U-net, shown below alongside an encoder-decoder architecture. Understanding the encoder-decoder architecture is necessary to understanding a U-net, so the encoder-decoder will be explained first.
The encoder-decoder takes a 256x256x3 image and passes it through a convolutional layer that outputs a 126x126x64 matrix. This process continues until the center layer, which is a 1x1x512 matrix. The first half of the encoder-decoder architecture is the encoder. After the middle, each layer applies a transposed convolutional layer, which functions like a convolutional layer but in reverse, and transforms the matrices back into a 256x256x3 shape, which we use as our output image. The section after the middle layer is the decoder.
A U-net is the same, but each layer of the encoder is concatenated with its mirror layer in the decoder, which allows each layer of the decoder to use information from the interpretations of previous layers in the encoder. This allows the generator to make more realistic and accurate images.
Batch normalization and dropout is also applied within the generator, but these concepts and their purpose will be covered below.
The discriminator is a traditional image classification convolutional neural network with one wrinkle thrown in. Instead of the convolutions leading down to a single numerical result, i.e. 1 to 0, the network outputs a 16x16 matrix with values between 1 and 0. Each value represents a “patch” of the original image, and classifies the probability that the patch is a real image versus a generated image. So what is a “patch” in this scenario?
The concept of a patch is closely tied to the concept of “receptive field” that comes up in convolutional neural networks. A receptive field is the region of input space that determines a unit of a neural net. For each value in the 16x16 output matrix, there is a corresponding receptive field, or patch, on the input image that is determined by the size and number of convolutional layers between the output and the input.
The pix2pix receptive fields are 70x70, and the input image is 256x256. Thus, every result in the 16x16 output matrix classifies whether a patch of the original image is real, rather than have a single value classify the entire image.
Below is a model plot of the discriminator. Notice that there are two inputs:
input_3. The discriminator takes an abstract image as one input, and is tasked with deciding if the other input is generated from the abstract or a real image that the abstract was created from.
Combined Training Methodology
Training a cGAN is difficult and convoluted, and requires several steps. The discriminator and the generator cannot be trained at the same time, so they need to be alternated. In the pix2pix model, the discriminator training starts the cycle. The diagram below shows the training model.
The discriminator ran twice, once with a real and abstract image, and again with a generated and abstract image. (In the diagram, the abstract image is only shown once, due to quirks of how
tf.keras accepts input and plots models.) Then the results are combined through a concatenate operation and the model is updated simultaneously to reject generated images and accept real images.
There is an important difference here with how this model handles discriminator training, compared to other implementations, and it improves on common techniques by reducing the number of forward and backward passes and thus improving training time. Most implementations will train separately on real data and generated data. Though they are able to achieve reasonable results, this gives rise to the possibility that a discriminator will simply learn to reject inputs while being fed generated data, and then to accept inputs when being fed real data, without ever learning how to discern between the two. By combining these two steps, we ensure that the discriminator is learning to discern significant differences between real and generated data.
Below is the generator training model. It looks simple, and in many ways it is. An abstract image is fed to the generator, which then generates an image. The discriminator is fed an abstract image and the generated image, and decides if the generated image is real or not. Then the generator is trained to fool the discriminator.
There is an important detail to the generator that is not shown in the diagram below. Unlike a traditional cGAN, the loss function is not decided completely by the output of the discriminator. Pix2pix also compares the generated image with the L1 loss, which measures the absolute difference between result and target parameters. In the case of pix2pix, the L1 loss penalizes the difference between the generated image and the real image pixel by pixel, channel by channel.
L1 loss proved to be a necessity to obtaining realistic images and advancing training. WIthout L1 loss, the generator learned to create realistic shapes, but would alternate the color of the generated image to fool the discriminator, rather than create more realistic images. Training would have the generator cycle unstably between ghoulish colors of purple, orange, green, and perhaps purple again while the discriminator seemed to forget what it had learned several cycles ago. L1 loss penalizes color variation and introduced stability to the generated images so that progress could be made on creating realistic images.
Similar to improvements made in the discriminator’s training methodology, this version of pix2pix improves on generator training methodology. Most implementations separate the generator training into two parts: adversarial training, when the generator is compared to the discriminator, and L1 loss training, where the generated image is compared to the real image. Separating these steps introduces the possibility of the generator learning how to fool the discriminator and then having that learning reversed by training using the L1 loss objective, such as the case of color changing. This implementation combines these two steps so that the generator is learning to fool the discriminator while still minimizing L1 loss.
Various issues encountered while training
The following section is a grab bag of observations and sticking points encountered during the development of pix2pix.
Batch normalization is a technique where the inputs to each layer are normalized. This has been shown in research to improve training, and during development there was improved performance after batch normalization was introduced. However, there is no consensus on why batch normalization improves neural net training performance. It is added to almost every layer and is considered a default option, but still there are competing theories as to why it is useful. Initially, it was thought to reduce covariate shift, but that was proven false . Currently it is theorized that batch normalization smooths the objective function and decouples length and direction, both of which speed training.
The generator has dropout added to several layers. Dropout is when a percentage of values in a matrix are reduced to zero randomly. The pix2pix generator has dropout applied to 50% of several layers, which prevents overfitting.
If a discriminator’s loss is near zero, we observe that the generator trains slowly. This is a common issue when training cGANs. To avoid this, the common solution in cGAN training is to set discriminator labels to a random value close to 1 or 0. For example, instead of a real image being labelled as 1, it will be labelled between 0.7 and 1.0. Also, instead of a generated value being labelled as 0, it will be labelled between 0.0 and 0.3. While this did solve the issue of a vanishing gradient for the discriminator, it was found after further improvements to other parts of the model that noisy labels were not necessary, thus the labels in the final version of pix2pix are binary. For an in-depth explanation of noisy labels, please refer to  and .
The size of the receptive field in the discriminator was a key value in getting successful results from the generator, and results can vary wildly with different sized patches. 70x70 patches seem to have the best performance on 256x256 images.
The most egregious inhibitor to development progress was the time it took to train a model. Pix2pix takes several hours to train. While this is insignificant compared to some modern machine learning models, it proved prohibitive to the trial and error process of coding something new. Because of this issue, much of development was focused on increasing model turnaround time. The following techniques stood out as time savers.
First, separating generator and discriminator training into functions proved to slow the training significantly. Functional separation required that models be reloaded or redefined during every training run. This introduced several bugs due to the uncommon techniques used in such an architecture that are alien in most machine learning implementations and framework assumptions. Several solutions to this exist, such as caching the model or memoizing it, but for a slight loss in readability and memory efficiency, defining all training logic in a single function proved to be the better route.
Second, it is common practice to save checkpoints of models during training, but in pix2pix it is also good practice to save generated images during training. Because of the adversarial nature of cGANs, training accuracy is not a reliable indicator of real image generation. Only by seeing the image visually can you assess the quality of the generator, and waiting for the end of training is unnecessary. The training would regularly pause to save a generated image so that the progress of the generator could be assessed periodically. This has the added benefit of creating a series of images that show the incremental change in the generator.
Third, it is unnecessary to train through the entire dataset before switching from training the generator to the discriminator, and vice versa. It was found that both the generator and discriminator achieve greater than 90% accuracy after training on 40 samples of data, so to continue training (in our case) for another 360 samples is a waste.
Fourth, initially L1 loss and adversarial loss were trained separately. This created three rounds of training in a cycle comprised of the discriminator training, the adversarial training, and the L1 training. All three trained at approximately the same speed. When the L1 and adversarial training were combined, we found they trained at the same rate as a single training round did before. The most expensive portion of training was the forward and backwards pass through the neural net, whereas the loss calculation proved to be a small portion of the time to train. Thus training speed was improved by roughly one third.
Use Case Architecture
The following diagram shows the design of the use case.
The Deep Learning Reference Stack trains and serves the pix2pix model. The Web Browser Interface is created with node and will call DLRS to generate images.
Running Pix2pix with the Deep Learning Reference Stack
This section contains a step by step tutorial to run pix2pix. This tutorial assumes you have an Intel® Advanced Vector Extensions 512 (Intel® AVX-512) capable machine and some familiarity with Docker*. These steps are also possible on any cloud service provider with a few modifying details, but those details are beyond the scope of this paper.
- An Intel® AVX-512 capable machine
- Familiarity with Docker
- Familiarity with git
- Use Docker to pull the DLRS image to your local machine:
docker pull clearlinux/stacks-dlrs-mkl
- Run the DLRS container:
docker run -it clearlinux/stacks-dlrs-mkl
- Clone the pix2pix use case into the container:
git clone https://github.com/intel/stacks-usecase
- Train the model:
- Run inference with your trained model:
python infer.py <path to your image>
That’s it! You have a working version of pix2pix running out of the Deep Learning Reference Stack. For more advanced examples and detailed usage instructions, please visit https://github.com/intel/stacks-usecase.
The pix2pix model is complex, but using it is easy, and using DLRS is even easier. This example had a pretrained model available and imported custom code, but you can use your own code and import your own models. There are no changes to the frontend of the TensorFlow and Pytorch frameworks, so any existing machine learning code will function as intended in DLRS and take advantage of the full power of Intel hardware, such as the Intel® Xeon® Scalable Processors.
So go try out the stacks today! The code is publicly available at https://github.com/intel/stacks-usecase.