Abstract:
Among the most prominent problems solved by machine learning (ML) techniques is that of image classification, and the best performing ML technique for solving this problem is the application of convolutional neural networks (CNNs). However, state-of-the-art CNNs are vulnerable to small adversarially-created perturbations. In response, many researchers have used the following technique to add a defense mechanism to their classification models. Given a classifier built based on a training set, they retrain it using the original training data augmented with adversarially-created perturbations. We refer to the resulting models as adversarially-trained classification models. There are two types of adversarially-trained classification models, depending on the characteristics of the adversarial perturbation generator (attacker). The attacker can be either a fixed algorithm (fixed attacker), or it can evolve based on the training data it is exposed to (adaptive attacker). A natural hypothesis is that adversarially-trained classification models born of adaptive attacks would be stronger than those born of fixed ones. However, we find that models resulting from adversarial training with both types of adversarially-trained models offer significant improvements over the original classification model but generally, the improvement is only against attacks by the same algorithm used during adversarial training. To conquer this weakness, we propose an end-to-end training framework that subjects the model being trained to multiple attackers, so that the resulting model is robust against different types of attacks. We find that classifiers trained under the framework can be adapted to be robust against adaptive attackers but it is more difficult to obtain robustness against fixed whitebox adversaries at the same time. To address this issue, we propose several regularization techniques such as weight clipping to improve classifier robustness against both types of adversaries, but we only obtain slight improvements. Our next step is to identify a suitable techniques to improve classification models’ ability to learn from both fixed and adaptive adversaries during training.