IMAGE SHADOW REMOVAL BASED ON GENERATIVE ADVERSARIAL NETWORKS

Accurate detection of shadows and removal in the image are complicated tasks, as it is difficult to un - derstand whether darkening or gray is the cause of the shadow. This paper proposes an image shadow re - moval method based on generative adversarial networks. Our approach is trained in unsupervised fashion which means it does not depend on time-consuming data collection and data labeling. This together with training in a single end-to-end framework significantly raises its practical relevance. Taking the existing method for unsupervised image transfer between different domains, we have re - searched its applicability to the shadow removal problem. Two networks have been used. Тhe first network is used to add shadows in images and the second network for shadow removal. ISTD dataset has been used for evaluation clarity because it has ground truth shadow free images as well as shadow masks. For shadow removal we have used root mean squared error between generated and real shadow free images in LAB color space. Evaluation is divided into region and global where the former is applied to shadow regions while the latter to the whole images. Shadow detection is evaluated with the use of Intersection over Union, also known as the Jaccard index. It is computed between the generated and ground-truth binary shadow masks by dividing the area of overlap by the union of those two. We selected random 100 images for valida - tion purposes. During the experiments multiple hypotheses have been tested. The majority of tests we con - ducted were about how to use an attention module and where to localize it. Our network produces better results compared to the existing approach in the field. Analysis showed that attention maps obtained from auxiliary classifier encourage the networks to concentrate on more distinctive regions between domains. However, generative adversarial networks demand more accurate and consistent architecture to solve the problem in a more efficient way.


Introduction
Shadow is a common visual phenomenon when the object overlaps illumination source. Detected shadows can provide the important clues for better visual scene understanding [1; 2]. However, they can degrade the performance of algorithms in several computer vision spheres as object detection [17], tracking [11] and intrinsic image decomposition [15]. Therefore, effective shadow removal could give a performance boost for these tasks.
Shadow removal is a very challenging task because it is not enough to detect and remove the shadow, we also need to fill the background so it looks naturally for both human and a computer system.
Current works can be divided into two groups: classical and deep learning-based. Classical solutions used user input or hand-crafted features for shadow detection [7; 3] after which they tried to make the shadow region match the background. Meanwhile, deep learning-based approaches use neural networks for extracting high level features and background filling. One of the early works [21] used three different neural networks operating in different contexts for more quality features extraction. Later, Hu et al. [9] explored direction-aware spatial context for this task to compensate the lack of data. After that, Wang et al. [24] used adversarial learning by stacking two networks together where one is used for shadow detection and one for shadow removal. More recently, Ding et al. [5] constructed the framework which used attention maps and recurrent learning. Other approach [25] argued that earlier works were not directly constructed for shadow removal task and proposed the novel architecture with hierarchical features aggregation.
However, all these methods used the supervised data to train and thus demanded the tedious collection and annotation processes. For the similar reason these approaches are also constrained with the complexity of the scenes. It is also argued that such approach may lead to change in illumination between shadow and shadow-free images [27].
Recently, the unsupervised solution was presented [27] where the problem was formulated in unsupervised mapping learning between two domains -shadow and shadow-free -using CycleGAN [9] framework. First CGAN detects and removes the shadow while the other tries to generate it given the image and shadow mask as the input. Shadow mask, therefore, is received by running the Otsu's algorithm on the difference between generated and input image. The final results are competitive with those outlined above and significantly higher than general image-to-image translation with CycleGAN. However, this approach is not directly constructed to fit the shadow removal task, thus it has problems such as leaving the artifacts on the shadow boundaries and using binary masks for detected shadows. One more problem that may have consequences in a real-world application is that shadow generator network chooses the mask at random, so we could receive the image with inappropriate shadow on it.

Generative adversarial networks (GAN)
GAN is a framework for estimating the generative models that was firstly proposed by Ian J. Goodfellow et al. [6]. It consists of two networks where generator network G captures the data distribution and discriminator network D estimates the probability of the sample came from data distribution. In an initial variant G took the noise vector (usually, from uniform distribution) as the input trying to generate a sample that will "trick" the discriminator D. This framework corresponds to a minimax game(two-player non-cooperative game) whe re G is maximizing the probability of D making mistake. To bring it formal we can consider such minimax game in which we should optimize a following loss function: (1) where p r is real data distribution over sample х while p z is a noise distribution over noise z. This work provoked a significant amount of research due to its capacity to generate high-quality samples. It was further improved by introducing Conditional GAN [18] which uses the label information to present the multi-modal solution. Thus, researcher can tune what kind of sample the network should generate.
CGANs were successfully used in learning the mappings between the domains, for example pix2pix [10] approach can handle multiple vision tasks as day to night, summer to winter or aerial to map in a single framework by introducing the pairs of images from each domain. However, it cannot handle the domains with no one-to-one mappings as for style transfer etc.
Hence, for unsupervised domain mapping CycleGAN [27] was firstly proposed by using the cycle consistency loss between domains. Still, it suffered from mode collapse generating one sample for different inputs. For that reason there was a research conducted [1; 14] to extend the initial solution to cope with the "many-to-many" mapping with the use of latent variables.
There has been a significant amount of work done and now GANs can generate high-quality images that are hardly distinguishable from the real ones. They are particularly good at face generation [12], style transfer [24; 3], inpainting [19; 20], domain transfer/adaptation [27; 1; 14] and are also used for shadow removal/detection [24; 5].

Method
We divide training into two parts: one to learn from shadow images and one to learn from shadowfree ones.

Adversarial learning
Let I s be an image from shadow Х s domain. We use a generator network G s→ f to translate an image to shadow-free Х f domain and obtain If. G s→ f network also includes an auxiliary classifier ƞ s , where ƞ s (x) represents the probability of x taken from shadow domain [13].
Then, we use the corresponding discriminator D f to discriminate whether the data comes from Х f or G s→ f (X s ). This network also consists the auxiliary classifier ƞ D f that is aimed to solve the same task as the discriminator itself. We should also notice that Least Squares GAN objectives are used for more stable training, thus the adversarial loss will look like this: is used in → Let be a k-th feature map of → output from -th layer. Then, map by using the global pooling layers(i.e. average, max). Thus, we o is an importance weight for the -th feature map.
is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.

CAM attention
Auxiliary classifier ƞ s is used in G s→ f to distinguish between domains and is inspired by CAM (Class activation maps) [24].
Let C k s be a k-th feature map of G l s→ f output from l-th layer. Then, C k s ij is the value at (i, j) position and we want to learn the importance weights for each feature map by using the global pooling layers (i.e. average, max). Thus, we obtain: is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
where n is a number of feature maps and w k s is an importance weight for the k-th feature map. .
is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
is used in → -th feature map of → output from -th layer. Then, is the the global pooling layers(i.e. average, max). Thus, we obtain: e weight for the -th feature map. We are learning those weights from: is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.

(4)
To make ƞ s distinguish between domains the corresponding cross-entropy loss is optimized: be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
Then, a s (3.2) is transferred as the input to the following layer of the network and the learning continues.
Attention a s is aggregated to be transferred as attention map A s to G f→ s : is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
where we sum the values over the channels c.
Auxiliary classifier ƞ D f is also integrated in D f to decide whether the data comes from Х f or G s→ f (X s ): is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
is used in → be a k-th feature map of → output from -th layer. Then, is the using the global pooling layers(i.e. average, max). Thus, we obtain: ortance weight for the -th feature map.

Cycle consistency and identity loss
If we only use adversarial loss for learning then the mapping is highly under-constrained. That is why we present the inverse transformation G f→ s to transform the images back and encourage the contents to be the same.
As we outlined above generator network G f→ s additionally takes attention map A s and generated mask M s [9] as the input (concatenating to the image as additional channels). To preserve the consistency between the generated shadow image and the original one we take the same attention map and shadow mask extracted from shadow removal generator G s→ f . This allows to produce multiple shadow images from one shadow-free raising the generalization capacity. Shadow mask is a binary map where -1 indicates non-shadow region while 1 -the shadow region. Attention map is also normalized to [-1,1] and is used for complementing the shadow mask.
Then, we formulate following cycle-consistency loss: is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
However, using only adversarial and cycle-consistency losses gives the generators freedom to change colors on images without being penalized. That is why researches in original work [12] introduced an identity loss to regularize the generators to be near an identity mapping when the inputs from target domain are provided. Furthermore, this approach allows our solution to remove/generate shadows only when the image from proper domain is given: where A n , M n are constructed only from -1 (nonshadow) which penalizes the network for generating the shadows on images where the shadow is already presented.

Learning from non-shadow images
Given the generator network G f→ s and also attention map A s together with shadow mask M s we can define corresponding losses for inverse transformation. We have the same adversarial loss where generator is maximizing the probability of discriminator to make mistake:  is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
However, we do not integrate the CAM module into the inverse transformation networks due to stability issues and because this approach gives better results in experiments.
The cycle-consistency constraint also stays the same: we generate shadow image from shadow-free X f and then using G s→ f to restore the image back and optimize the networks: is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
Finally, we adopt the G s→ f to produce shadow free image given the real shadow free image from X f . That means that we encourage the network to not remove anything if there is nothing to remove. is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.

Shadow mask generation
Our generator G f→ s uses the shadow mask as the input, so we can condition network with it and generate multiple shadows from one shadow-free image. We follow the same approach as [9] and construct the threshold binarizer B between generated shadow free image If and original image I s : M l = B (If , I s ) (13) Thus, when we obtain a pair of images we compute the difference If -I s and compute the threshold to assign the values greater that it as 1 and those lesswith -1. The threshold is computed using Otsu's algorithm which separates the shadow from non-shadow regions by maximizing the intra-class variance.

Attention map generation
Attention map A s is received from auxiliary classifier ƞ s by applying the pooling operation to feature maps. In our approach we are using average (GAP) together with max pooling (GMP) layers to get the complete picture. GAP is able to find all discriminative regions on the image while GMP is encouraged to find only one [26]. So we decided to combine the best from two worlds by applying both GAP and GMP and concatenating the corresponding results.
So, additionally we have: After that we concatenate the outputs of ƞ s and ƞ m and feed them to 1x1 convolutional layer with the following ReLU non-linearity to restore the input dimensions due to channel axis concatenation.
The output from these operation is followed by aggregation to obtain an attention map A s which we additionally scale to [-1,1] range for invariance.
During the training process those maps are generated for each shadow image in the same way as binary shadow masks.

Losses
To conclude, we present the final loss function which is a weighted sum of adversarial, CAM, cycle-consistency and identity losses outlined above in both architectures:

Shadow removal generator network
The network architecture is following a Johnson et al. [11] and reminds the encoder-decoder architecture without skip-connections. Encoder is constructed from two down-sample convolutional layers, it is important that there are no pooling layers and down-sampling is implemented using convolutions with stride 2.
Then, we have a bottleneck layer where most of work takes place. It includes nine residual blocks with linear dilation growth starting from the sixth layer. Dilation factors should be tuned depending on the receptive field size. In our experiments, we have made an assumption that the first part of the network would extract the shadow region operating on the local level while the second part will be responsible for filling this region, thus it will need a background information. For that reason we added receptive field growth at the end but in the way it does not exceed the input image size.
Bottleneck layer also integrates an attention CAM module that we described above. We inserted it before the receptive field growth (i.e. shadow removal process takes place) so it would help to localize the shadow in a more efficient way.
Finally, decoder is here to restore the image back to initial size by the use of transposed convolutions, it is important that network is learning to make the downsample and upsample operations itself.

Shadow free discriminator
We will remind that discriminator D f network is used for discriminating the real shadow free images from those generated. Architecture for it is following the idea of PatchGAN [16] where the network is not looking at the whole image but on patches (usually, 70×70) of it deciding whether the patch is real or not. We additionally complemented it with CAM attention module which is trained to solve the same task as the discriminator itself. CAM attention is operating before the final layer. Discriminators are not using dilated convolutions.

Shadow generator network
This network is also following the Johnson et al. [11] architecture and has dilated convolutions in it which may help in shadow generation, however we have not seen any difference in experiments. We did not use an attention module here because it exposes an unstable training.
Generator uses shadow free image together with attention map and shadow mask(binary map) where three of them are concatenated by channel axis. Attention and mask are scaled to [-1,1] to improve invariance.
Discriminator D s is also a PatchGAN with 70×70 patches with no attention module in it.

Training strategy
We used Rectified Linear Unit (ReLU) non-linearity and reflection padding for generator networks. is used in → Let be a k-th feature map of → output from -th layer. Then, is the map by using the global pooling layers(i.e. average, max). Thus, we obtain: is an importance weight for the -th feature map.
Instance normalization (IN) is utilized for all networks just after the convolutional layer. The exceptions are input and output layers where we want to encourage the networks to learn the normalization by themselves. We are adding hyperbolic tangent (Tanh) function in generator network to output the values from [-1,1] range. All networks are using spectral normalization [19] as this proved to improve GAN learning. It constrains the Lipschitz constant by restricting the spectral norm of each layer.
We also have used different heuristics to stabilize training by smoothing the target label from 1 to 0.9 and dividing the discriminator loss by 2 to further address slow learning. Moreover, we are updating the discriminator with the history of previously generated samples as it is stated to reduce the model oscillations [27]. The same approach is used for shadow mask, attention pairs which are added to a separate Queue.
All the parameters are initialized using zerocentered Gaussian distribution with 0.02 standard deviation. For data augmentation, we resize the images to 286×286 cropping them randomly to be 256×256 and flipping them horizontally with 0.5 probability. The method is implemented using PyTorch framework.

Evaluation
Dataset. There are many community datasets for shadow detection and removal. Mostly, they are supervised and not large enough for deep learning solution but there are some which fits just well: ISTD [24], SRD [14]. There is also an unpaired one where more complex scenes are presented USR [13].
In our experiments we only use ISTD dataset for evaluation clarity because it has ground truth shadow free images as well as shadow masks. We does not correct our evaluation method to cover the issues with illumination and color change between shadow and shadow free images. In future, we will extend our solution to USR dataset as being more appropriate to our solution.
Metrics. In our evaluations we aim to estimate how good our network is in removing the shadows as well as detecting them. We are also concerned about the global image consistency. That is why we present three metrics.
For shadow removal we follow recent works [14; 9; 4] and use RMSE(root mean squared error) between generated and real shadow free images in LAB color space. Evaluation is divided into region and global where the former is applied to shadow regions while the latter to the whole images. In general, lower RMSE score tells about better results.
Shadow detection is evaluated with the use of IOU (Intersection over Union), also known as the Jaccard index, which is a widely used metric for image segmentation and object detection tasks. It is computed between the generated and ground-truth binary shadow masks by dividing the area of overlap by the union of those two. Greater IOU indicates better shadow detection.

Experimental results
Our method is trained on supervised dataset that is why the unpaired strategy is used: the first one is sampled from shadow domain while the second one is randomly chosen from the shadow free. We also selected random 100 images for validation purposes. During the experiments multiple hypothesis were tested and the most successful of them are shown in Table 1. Global RMSE tells the quality of shadow removal coupled with background restoration. Shadow region RMSE directly estimates the removal comparing shadow regions. IOU is aimed to evaluate the shadow detection quality comparing with ground-truth shadow mask. All scores are evaluated after the models are trained on validation set of images. Other solutions from the field except the MaskSha-dowGAN [9] are not included but would be tested in the future.
As soon as we added two components to this work we wanted to test how they affect the generation quality.
We saw that method with dilated convolutions stays on roughly the same level as the one without them. However, we decided to take the one with dilated convolution as we expected it may improve the results after the attention would be added.
The majority of tests we conducted were about how to use an attention module and where to localize it. At first, we have added the attention to all networks following the original approach [13]. The results improved drastically given by RMSE score in both global and local contexts.

Table 2 Examples of images
However, IOU showed lower results for the reason we will go into details below. Adding attention module presented new instability issues while converging faster. We also encountered with messy outputs compared to initial methods where the shadow removal generator not only identified the shadow but the background behind it. Thus, it could be seen that shadow detection quality decreased significantly indicated by lower IOU.
For these reasons, we removed an attention module from G f→ s and D s which helped to suppress the instability but still it was a way higher than solutions without attention. The qualitative results improved a little but still there was a problem with over-detection.
Hence, the next solution added an attention map transfer from shadow removal generator to the input of the opposite one. We aimed to complement the bi nary mask with an attention map to make the learning more consistent. As the result, it reduced the model oscillations and raised the generated samples quality. However the shadow detection performance declined even more.
Attention map transfer showed the capacity to improve the results that is why we researched other ways to share this information with shadow generator network. For instance, shadow mask M s was removed from the input of G f→ s so the network should have used the information from attention map A s only. This resulted in again messy results with roughly the same metric values.
Rather than integrating the attention information to the input of the network, we researched the ways to insert it inside. After solid amount of experimenting we came up with transferring the CAM weights directly to rebalance the corresponding feature maps in G f→ s . This improved the shadow removal performance, however increased the problem with over-detection.

Conclusions
In this paper, we presented a solution to unsupervised shadow removal problem with the use of generative adversarial networks with attention modules and multi context aggregation. Our network produces better results compared to the existing approach in the field. Analysis showed that attention maps obtained from auxiliary classifier en courage the networks to concentrate on more distinctive regions between domains. However, GANs demand more accurate and consistent architecture to solve the problem in a more efficient way. We have also showed how attention modules can improve the quality of shadow removal while introducing the problems with the shadow overdetection. For that reason we will research further to address the problem of more consistent architecture in the future work.