Adverse attacks in machine learning: what they are and how to stop them
Elevate your enterprise data technology and strategy to Transform 2021.
Contradictory machine learning, a technique that attempts to trick models with deceptive data, is a growing threat in the AI and machine learning research community. The most common reason is to cause a malfunction in a machine learning model. An antagonistic attack can involve presenting a model with inaccurate or distorting data during training, or introducing data maliciously designed to deceive an already trained model.
As the 2019 Interim Report of the United States National Security Commission on Artificial Intelligence notes, a very small the percentage of current AI research is used to defend AI systems against conflicting efforts. Some systems already used in production could be vulnerable to attack. For example, by placing a few small stickers on the floor, the researchers showed that they could provoke an autonomous car to switch to the opposite traffic lane. Other studies have shown that imperceptible changes in an image can fool a medical analysis system by classifying a benign mole as malignant, and that pieces of duct tape can fool a computer vision system wrongly classify a stop sign as a speed limit sign.
The growing adoption of AI is likely to be correlated with an increase in adverse attacks. It’s a never-ending arms race, but fortunately, effective approaches exist today to mitigate the worst attacks.
Types of conflicting attacks
Attacks against AI models are often categorized along three main axes – the influence on the classifier, the security breach, and their specificity – and can be sub-categorized as “white box” or “black box”. In white box attacks, the attacker has access to model parameters, while in black box attacks, the attacker does not have access to these parameters.
An attack can influence the classifier – that is, the model – by disrupting the model when it makes predictions, while a security breach involves the delivery of malicious data that is classified as legitimate. A targeted attack attempts to allow a specific intrusion or disruption, or to create general chaos.
Breakout attacks are the most common type of attack, where data is altered to escape detection or to be classified as legitimate. The escape does not imply an influence on the data used to train a model, but is comparable to how spammers and hackers hide spam and malware content. An example of fraud is image-based spam in which spam content is embedded in an attached image to escape analysis by anti-spam models. Another example is spoofing attacks against AI-powered biometric verification systems.
Poisoning, another type of attack, is the “antagonistic contamination” of data. Machine learning systems are often recycled using data collected while they are running, and an attacker can poison this data by injecting malicious samples that subsequently interrupt the recycling process. An opponent can enter data during the training phase that is falsely labeled as harmless when in fact it is malicious. For example, large language models like OpenAI’s GPT-3 can reveal sensitive and private information when fed with certain words and phrases, according to research.
Meanwhile, model theft, also known as model extraction, involves an adversary probing a “black box” machine learning system in order to reconstruct the model or extract the data it was trained on. This can cause problems when the training data or the model itself is sensitive and confidential. For example, pattern theft could be used to extract a proprietary pattern of stock trading, which the opponent could then use for their own financial gain.
Attacks in nature
Numerous examples of contradictory attacks have been documented to date. One of them showed that it was possible to 3D print a toy turtle with a texture that forces Google’s object detection AI to classify it as a rifle from any angle. which the turtle is pictured. In another attack, a machine-modified image of a dog looked like a cat to computers and humans. The so-called “conflicting patterns” on glasses or clothing were designed to fool facial recognition systems and license plate readers. And the researchers created antagonistic audio inputs to mask commands from intelligent assistants in benign-sounding audio.
In an article published in April, researchers at Google and the University of California at Berkeley demonstrated that even the best forensic classifiers – AI systems trained to distinguish between real content and synthetic content – are susceptible to adversarial attacks. This is a disturbing, if not necessarily new, development for organizations attempting to produce bogus media detectors, especially given the meteoric rise in online deepfake content.
One of the most infamous recent examples is Microsoft’s Tay, a Twitter chatbot programmed to learn how to participate in a conversation through interactions with other users. While Microsoft’s intention was for Tay to engage in an “informal and playful conversation,” internet trolls noticed that the system did not have enough filters and began to feed Tay with profane and offensive tweets. The more these users engaged, the more offensive Tay’s tweets became, forcing Microsoft to shut down the bot just 16 hours after launch.
As a VentureBeat contributor Ben Dickson Remarks, recent years have seen an increase in research on adversarial attacks. In 2014, no contradictory machine learning article was submitted to the Arxiv.org preprint server, while in 2020, about 1,100 articles on conflicting examples and attacks were. Adverse attacks and methods of defense have also become a highlight of top conferences, including NeurIPS, ICLR, DEF CON, Black Hat, and Usenix.
With growing interest in adverse attacks and techniques to combat them, startups like Resistant AI are making a name for themselves with products that “harden” algorithms against adversaries. Beyond these new business solutions, emerging research holds promise for companies looking to invest in defenses against adverse attacks.
One way to test the robustness of machine learning models is to use what is called a Trojan horse attack, which involves modifying a model to respond to input triggers that cause it to infer an incorrect response. In order to make these tests more repeatable and scalable, researchers at Johns Hopkins University have developed a framework called TrojAI, a set of tools that generate triggered datasets and patterns associated with Trojans. They say this will allow researchers to understand the effects of various dataset configurations on the models generated “Trojan horses” and will help exhaustively test new methods of detecting Trojan horses to harden the models.
The Johns Hopkins team is far from alone in meeting the challenge of enemy machine learning attacks. In February, Google researchers published an article describing a framework that detects attacks or pressures attackers to produce images that resemble the target class of images. Baidu, Microsoft, IBM and Salesforce offer toolkits – Advbox, Counterfit, Contradictory Robustness Toolkitand Robustness Gym – to generate conflicting examples that may fool models in frameworks such as Facebook’s MxNet, Keras, PyTorch and Caffe2, Google’s TensorFlow, and Baidu’s PaddlePaddle. And MIT’s Computing and Artificial Intelligence Lab recently released a tool called TextFooler that generates contradictory text to strengthen natural language patterns.
Most recently, Microsoft, the nonprofit Miter Corporation and 11 organizations including IBM, Nvidia, Airbus and Bosch published Adversarial ML Threat Matrix, an open, industry-driven framework designed to help security analysts detect, respond to, and remediate threats to machine learning systems. Microsoft claims to have worked with Miter to create a diagram that organizes the approaches used by malicious actors to circumvent machine learning models, strengthening surveillance strategies around critical systems of organizations.
The future could bring out of the box approaches, including several inspired by neuroscience. For example, researchers at MIT and MIT-IBM Watson AI Lab have found that directly mapping features of the mammalian visual cortex onto deep neural networks creates AI systems that are more robust to adverse attacks. While antagonistic AI is likely to turn into a never-ending arms race, these kinds of solutions give hope that attackers won’t always have the upper hand – and that biological intelligence still has a lot of untapped potential.
VentureBeat’s mission is to be a digital city place for technical decision-makers to gain knowledge about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in running your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the topics that interest you
- our newsletters
- Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
- networking features, and more
Become a member