Mozes, Maximilian Attila Janos;
(2024)
Understanding and Guarding against Natural Language Adversarial Examples.
Doctoral thesis (Ph.D), UCL (University College London).
Preview |
Text
mmozes_thesis.pdf - Other Download (2MB) | Preview |
Abstract
Despite their success, machine learning models have been shown to be susceptible to adversarial examples: carefully constructed perturbations of model inputs that are intended to lead a model into misclassifying those inputs. While this phenomenon was discovered in the context of computer vision, an increasing body of work focuses on adversarial examples in natural language processing (NLP). This PhD thesis presents an investigation into such adversarial examples in the context of text classification, focusing on studies to characterize them through both computational analyses and behavioral studies. As computational analysis, we present results showing that the effectiveness of adversarial word-level perturbations is due to the replacement of input words with low-frequency synonyms. Based on these insights, we propose an effective detection method for adversarial examples (Study 1). As behavioral analysis, we present (Study 2) a data collection effort comprising human-written word-level adversarial examples, and conduct statistical comparisons between human- and machine-generated adversarial examples with respect to their preservation of sentiment, naturalness, and grammaticality. We find that human- and machine-authored adversarial examples are of similar quality across most comparisons, yet humans can generate adversarial examples with much greater efficiency. In Study 3, we investigate the patterns of human behavior when authoring adversarial examples, and provide “human strategies” for generating adversarial examples that have the potential to advance automated attacks. Study 4 discusses the NLP-related scientific safety and security literature with respect to more recent large language models (LLMs). We provide a taxonomy of existing efforts related to that topic that are categorized into threats arising from the generative capabilities of LLMs, prevention measures developed to safeguard models against misuse, and vulnerabilities stemming from imperfect prevention measures. We conclude the thesis by discussing this work’s contributions and impact on the research community as well as potential future work arising from the obtained insights.
Type: | Thesis (Doctoral) |
---|---|
Qualification: | Ph.D |
Title: | Understanding and Guarding against Natural Language Adversarial Examples |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Additional information: | Copyright © The Author 2024. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
Keywords: | machine learning, natural language processing, adversarial machine learning |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Security and Crime Science |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10190224 |
Archive Staff Only
View Item |