Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training

Shunsuke Kitada and Hitoshi Iyatomi

Attention mechanisms¹ are widely applied in natural language processing (NLP) field through deep neural networks (DNNs). As the effectiveness of attention mechanisms became apparent in various tasks² ³ ⁴ ⁵ ⁶ ⁷, they were applied not only to recurrent neural networks (RNNs) but also to convolutional neural networks (CNNs). Moreover, Transformers ⁸ which make proactive use of attention mechanisms have also achieved excellent results. However, it has been pointed out that DNN models tend to be locally unstable, and even tiny perturbations to the original inputs ⁹ or attention mechanisms can mislead the models ¹⁰. Specifically, Jain and Wallace ¹⁰ used a practical bi-directional RNN (BiRNN) model to investigate the effect of attention mechanisms and reported that learned attention weights based on the model are vulnerable to perturbations.

To tackle the models' vulnerability to perturbation, Goodfellow et al.¹¹ proposed adversarial training (AT) that increases robustness by adding adversarial perturbations to the input and the training technique forcing the model to address its difficulties. Previous studies¹¹ ¹² in the image recognition field have theoretically explained the regularization effect of AT and shown that it improves the robustness of the model for unseen images.

In this paper, we propose a new general training technique for attention mechanisms based on AT, called adversarial training for attention (Attention AT) and more interpretable adversarial training for attention (Attention iAT). The proposed techniques are the first attempt to employ AT for attention mechanisms. The proposed Attention AT/iAT is expected to improve the robustness and the interpretability of the model by appropriately overcoming the adversarial perturbations to attention mechanisms ¹³ ¹⁴ ¹⁵. Because our proposed AT techniques for attention mechanisms is model-independent and a general technique, it can be applied to various DNN models (e.g., RNN and CNN) with attention mechanisms. Our technique can also be applied to any similarity functions for attention mechanisms, e.g, additive function ¹ and scaled dot-product function ⁸, which is famous for calculating the similarity in attention mechanisms.

To demonstrate the effects of these techniques, we evaluated them compared to several other state-of-the-art AT-based techniques ¹⁶ ¹⁷ with ten common datasets for different NLP tasks. These datasets included binary classification (BC), question answering (QA), and natural language inference (NLI). We also evaluated how the attention weights obtained through the proposed AT technique agreed with the word importance calculated by the gradients ¹⁸. Evaluating the proposed techniques, we obtained the following findings concerning AT for attention mechanisms in NLP:

AT for attention mechanisms improves the prediction performance of various NLP tasks.
AT for attention mechanisms helps the model learn cleaner attention and demonstrates a stronger correlation with the word importance calculated from the model gradients.
The proposed training techniques are much less independent concerning perturbation size in AT.

Especially, our Attention iAT demonstrated the best performance in nine out of ten tasks and more interpretable attention, i.e., resulting attention weight correlated more strongly with the gradient-based word importance.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR preprint arXiv:1409.0473,2014. ↩︎
Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” in Proc. of the 5th International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2017. ↩︎
Y. Wang, M. Huang, and L. Zhao, “Attention-based LSTM for aspect-level sentiment classification,” in Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, ser. Associationfor Computational Linguistics (ACL), 2016, pp. 606–615. ↩︎
X. He and D. Golub, “Character-level question answering with attention,” in Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing ser. Association for Computational Linguistics (ACL), 2016, pp. 1598–1607. ↩︎
A. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” in Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, ser.Association for Computational Linguistics (ACL), 2016, pp. 2249–2255. ↩︎
T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, ser. Association for Computational Linguistics (ACL), 2015, pp. 1412–1421. ↩︎
A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, ser. Associationfor Computational Linguistics (ACL), 2015, pp. 379–389. ↩︎
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of the 30th International Conference on Neural Information Processing Systems, 2017, pp 5998–6008. ↩︎
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in 2nd International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2013. ↩︎
S. Jain and B. C. Wallace, “Attention is not explanation,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ser. Association for Computational Linguistics (ACL), 2019, pp. 3543–3556. ↩︎
I. J. Goodfellow, J. Shlens, and C. Szegedy, “HExplaining and harnessing adversarial examples,” in 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2014. ↩︎
U. Shaham, Y. Yamada, and S. Negahban, “Understanding adversarial training: Increasing local stability of supervised models through robust optimization,” Neurocomputing, vol. 307, pp. 195–204, 2018. ↩︎
D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in Proc. of the International Conference on Learning Representations, ICLR, 2019. ↩︎
T. Itazuri, Y. Fukuhara, H. Kataoka, and S. Morishima, “What doadversarially robust models look at?” CoRR preprint arXiv:1905.07666,2019. ↩︎
T. Zhang and Z. Zhu, “Interpreting adversarially trained convolutional neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 7502–7511. ↩︎
T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial training methods for semi-supervised text classification,” in Proc. of the 5th International Conference on Learning Representations, ICLR, Conference Track Proceedings, 2016. ↩︎
M. Sato, J. Suzuki, H. Shindo, and Y. Matsumoto, “Interpretable adversarial perturbation in input embedding space for text,” in Proc. of the 27th International Joint Conference on Artificial Intelligence, ser. AAAI Press, 2018, pp. 4323–4330. ↩︎
K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,”in Proc. of the 2nd International Conference on Learning Representations, ICLR, Workshop Track Proceedings, 2013. ↩︎

Common Model Architectures

Illustration of the base models to apply our proposed training technique: (a) a single sequence model for the binary classification (BC) task and (b) a pair sequence model for question answering (QA) and natural language inference (NLI) tasks. In (a), the input of the model is word embeddings, {${\bf w_1}$, $\cdots$, ${\bf w_{T_S}}$} associated with the input sentence $X_S$. In (b), the inputs are word embeddings {${\bf w_{1}^{(p)}}$, $\cdots$, ${\bf w_{T_P}^{(p)}}$} and {${\bf w_{1}^{(q)}}$, $\cdots$, ${\bf w_{T_Q}^{(q)}}$} from two input sequences, $X_P$ and $X_Q$, respectively. These inputs are encoded into hidden states through a bi-directional RNN (BiRNN) model. In conventional models, perturbation ${\bf r}$ is added to the hidden state of the words ${\bf h}$. In our adversarial training for attention mechanisms, we compute and add the worst-case perturbation ${\bf r}$ to attention ${\bf a}$ to improve both prediction performance and the interpretability of the model.

(a) Single sequence model	(b) Pair sequence model

Visualization

The following tables are visualizations of the attention weight for each word and gradient-based word importance in the Stanford Sentiment Treebank (SST) ¹ test dataset.

Vanilla vs. Attention AT vs. Attention iAT

Attention AT yielded clearer attention compared to the Vanilla model or Attention iAT. Specifically, Attention AT tended to strongly focus attention on a few words.

Vanilla	Attention AT	Attention iAT
an unabashedly schmaltzy and thoroughly enjoyable true story	an unabashedly schmaltzy and thoroughly enjoyable true story	an unabashedly schmaltzy and thoroughly enjoyable true story
one of the greatest romantic comedies of the past decade	one of the greatest romantic comedies of the past decade	one of the greatest romantic comedies of the past decade
an offbeat romantic comedy with a great meet cute gimmick	an offbeat romantic comedy with a great meet cute gimmick	an offbeat romantic comedy with a great meet cute gimmick
a film of precious artfully as everyday activities	a film of precious artfully as everyday activities	a film of precious artfully as everyday activities
it s not horrible just horribly mediocre	it s not horrible just horribly mediocre	it s not horrible just horribly mediocre
watching this film nearly provoked me to take my own life	watching this film nearly provoked me to take my own life	watching this film nearly provoked me to take my own life
too bad the former murphy brown does n t pop reese back	too bad the former murphy brown does n t pop reese back	too bad the former murphy brown does n t pop reese back
unfortunately the picture failed to capture me	unfortunately the picture failed to capture me	unfortunately the picture failed to capture me

Attention and Gradient

Regarding the correlation of word importance based on attention weights and gradient-based word importance, Attention iAT demonstrated higher similarities than other models.

Vanilla

Ground truth	Predicted	Correct?	Attention	Gradient
Pos.	Pos.	Yes	an unabashedly schmaltzy and thoroughly enjoyable true story	an unabashedly schmaltzy and thoroughly enjoyable true story
Pos.	Neg.	No	one of the greatest romantic comedies of the past decade	one of the greatest romantic comedies of the past decade
Pos.	Pos.	Yes	an offbeat romantic comedy with a great meet cute gimmick	an offbeat romantic comedy with a great meet cute gimmick
Pos.	Pos.	Yes	a film of precious artfully as everyday activities	a film of precious artfully as everyday activities
Neg.	Neg.	Yes	it s not horrible just horribly mediocre	it s not horrible just horribly mediocre
Neg.	Neg.	Yes	watching this film nearly provoked me to take my own life	watching this film nearly provoked me to take my own life
Neg.	Neg.	Yes	too bad the former murphy brown does n t pop reese back	too bad the former murphy brown does n t pop reese back
Neg.	Neg.	Yes	unfortunately the picture failed to capture me	unfortunately the picture failed to capture me

Attention AT

Ground truth	Predicted	Correct?	Attention	Gradient
Pos.	Pos.	Yes	an unabashedly schmaltzy and thoroughly enjoyable true story	an unabashedly schmaltzy and thoroughly enjoyable true story
Pos.	Pos.	Yes	one of the greatest romantic comedies of the past decade	one of the greatest romantic comedies of the past decade
Pos.	Pos.	Yes	an offbeat romantic comedy with a great meet cute gimmick	an offbeat romantic comedy with a great meet cute gimmick
Pos.	Pos.	Yes	a film of precious artfully as everyday activities	a film of precious artfully as everyday activities
Neg.	Neg.	Yes	it s not horrible just horribly mediocre	it s not horrible just horribly mediocre
Neg.	Neg.	Yes	watching this film nearly provoked me to take my own life	watching this film nearly provoked me to take my own life
Neg.	Neg.	Yes	too bad the former murphy brown does n t pop reese back	too bad the former murphy brown does n t pop reese back
Neg.	Neg.	Yes	unfortunately the picture failed to capture me	unfortunately the picture failed to capture me

Attention iAT

Ground truth	Predicted	Correct?	Attention	Gradient
Pos.	Pos.	Yes	an unabashedly schmaltzy and thoroughly enjoyable true story	an unabashedly schmaltzy and thoroughly enjoyable true story
Pos.	Pos.	Yes	one of the greatest romantic comedies of the past decade	one of the greatest romantic comedies of the past decade
Pos.	Pos.	Yes	an offbeat romantic comedy with a great meet cute gimmick	an offbeat romantic comedy with a great meet cute gimmick
Pos.	Pos.	Yes	a film of precious artfully as everyday activities	a film of precious artfully as everyday activities
Neg.	Neg.	Yes	it s not horrible just horribly mediocre	it s not horrible just horribly mediocre
Neg.	Neg.	Yes	watching this film nearly provoked me to take my own life	watching this film nearly provoked me to take my own life
Neg.	Neg.	Yes	too bad the former murphy brown does n t pop reese back	too bad the former murphy brown does n t pop reese back
Neg.	Neg.	Yes	unfortunately the picture failed to capture me	unfortunately the picture failed to capture me

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, ser. Association for Computational Linguistics (ACL), 2013, pp. 1631–1642. ↩︎

Conclusion

We proposed robust and interpretable attention training techniques that exploit AT. In the experiments with various NLP tasks, we confirmed that AT for attention mechanisms achieves better performance than techniques using AT for word embedding in terms of the prediction performance and the interpretability of the model. Specifically, the Attention iAT technique introduced adversarial perturbations that emphasized differences in the importance of words in a sentence and combined high accuracy with interpretable attention, which was more strongly correlated with the gradient-based method of word importance. The proposed technique could be applied to various models and NLP tasks. This paper provides strong support and motivation for utilizing AT with attention mechanisms in NLP tasks.

In the experiment, we demonstrated the effectiveness of the proposed techniques for RNN models that are reported to be vulnerable to attention mechanisms, but we will confirm the effectiveness of the proposed technique for large language models with attention mechanisms such as Transforme ¹ or BERT ² in the future. Because the proposed techniques are model-independent and general techniques for attention mechanisms, we can expect they will improve predictability and the interpretability for language models.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of the 30th International Conference on Neural Information Processing Systems, 2017, pp 5998–6008. ↩︎
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ser Association for Computational Linguistics (ACL), 2019, pp. 4171–4186. ↩︎

Links