Omid Mirzaei | Cisco Talos | Adversarial Machine Learning

Machine Learning-based (ML-based) systems have been applied to different application areas in recent years, ranging from medical diagnosis to weather forecast and to fraud detection to name a few. Also, new ML-based services and architectures have been emerged thanks to the growth of cloud-based services and the scale of available data on online platforms. One of such services is known as Machine Learning-as-a-Service (MLaaS), which provides machine learning tools as part of cloud computing services. And, a relatively new architecture that has been applied to limited areas is federated learning, a decentralized type of machine learning which consists of geographically distributed parties aim at learning a shared global model (in a cloud) collaboratively such that local data in each party stays private. Key players providing cloud-based services are Amazon, Google and Microsoft at the moment. ML-based systems have penetrated to all aspects of human's life without even many people realizing it. For this reason, a security research line has been trendy in these years, known as adversarial machine learning, which studies the security and privacy issues of ML-based systems in the presence of adversaries. Thus, throughout this blog post, different adversarial attacks against ML-based systems are discussed by considering three important characteristics of an attacker, including its goals, its knowledge, and its capabilities. Thus, attacker-related assumptions are discussed initially, and, next, all adversarial attacks against ML-based systems are reviewed from two perspectives: attacks that compromise the security of ML-based systems, and those that compromise their privacy.

Attacker-related Assumptions

An attacker plays the key role in any type of adversarial attack. Thus, any threat model must consider different attacker-related characteristics, including its goals, knowledge, and capabilities.

The first characteristic, attacker’s goal, is defined based on the desired security or privacy violation. An attacker may decide to cause integrity or availability violation in an ML-based system, or it may decide to violate privacy constraints. In a classification system, integrity violation happens when the attacker makes the system misclassify only few samples, whereas availability violation occurs when an attacker makes the system misclassify the majority of samples, known as Denial-of-Service (DoS). Finally, privacy violation happens when confidential information about either the model or its features are leaked to an attacker.

Regardless of the goal, the attacker can have different levels of knowledge about the target ML-based system, including the training data and their labels, features used to train the model, and the learning algorithm. The attacker may know everything about the target system, known as white-box attack; it may know the features used to train the ML model, called gray-box attack, or it may have zero knowledge about the system, known as black-box attack.

Realizing the attacker’s capabilities in modeling the threat is important too. The attacker may only be able to manipulate the test data (called exploratory influence), or, in addition, the train data (known as causative influence). Furthermore, it may aim at influencing specific classes of data (targeted attack), or, instead, launching attacks on a wide class of instances (indiscriminate attack).

Security-based Attacks

Security-based attacks can be segregated into two different categories depending on the leaning phase where they target a machine learning algorithm, including training-phase attacks and testing-phase (or inference-phase) attacks. In what follows, different types of attacks in each category are discuss briefly.

Training-Phase Attacks

Security-based attacks that target the training phase of a machine learning algorithm include poisoning attacks and Trojaning (or backdooring) attacks.

Poisoning: Poisoning attacks fall into two major categories, including data poisoning and model poisoning attacks. Conventional poisoning attacks consider an adversary who manipulates (or poisons) some attributes and fraction of training data, aiming at decreasing the discrimination capability of a machine learning model by changing its decision boundary (sometimes called model skewing). These types of attacks are normally known as data poisoning attacks. In the federated learning context though, model updates (parameters) which are communicated between parties and the shared global model can be poisoned by a malicious agent as well, which introduces a new type of poisoning attacks, called model poisoning. Both attacks are considered as causative since they occur in the training phase. In addition, the attacker’s knowledge about the target system can be of any amount, including zero, partial and full knowledge.

Data poisoning attacks include any attempts by a malicious agent to manipulate training samples and is performed in two different ways: clean-label and dirty-label. Clean-label attacks assume that the label of any training data cannot be changed by an adversary, whereas dirty-label attacks assume that an adversary can introduce multiple copies of the data sample it wishes to be misclassified by the machine learning algorithm with the desired target label into the training dataset. These samples are called adversarial examples (or inputs) in the machine learning community. Data poisoning attacks have been explored in the distributed and collaborative machine learning as well where the authors believe that directed small changes to many parameters of few local clients (or systems) are capable of interfering with or gaining control over the training process.

In addition to training samples, an adversary can manipulate the training process directly via model poisoning attacks. Machine learning model needs to be under full control of an adversary for these types of attacks to be successful. In many application contexts, it is easier to access the training data than the model details itself. One of the most popular techniques used in these types of attacks is to manipulate the model’s hyper-parameters. The performance of machine learning algorithms is severely affected by changing some of these parameters.

Trojaning (Backdooring): The basic idea behind Trojaning and backdooring attacks is very simple. Trojaning attacks aim at changing the model’s behavior only in some circumstances (called triggers) in such a way that the intended behavior of the model stays unchanged. Normally, the attacker takes a previously trained model (e.g., a classifier) and embeds a (malicious) backdoor functionality into a joint model obtained by retraining the old model on samples which contain attacker-chosen triggers. Since the target model needs to be retrained with an extra dataset, these attacks can be considered as causative.

Trojaning attacks are considered as a specific type of data poisoning attacks where influencing the integrity of the ML-based system is of the attacker's interest. Thus, Trojaning attacks are also known as data poisoning integrity attacks. Instead of influencing the integrity of a system, an attacker may decide to produce misclassification for a great fraction of samples and to create Denial-of-Service (DoS), as it happens in data poisoning availability attacks.

Testing-Phase Attacks:

Evasion attacks and reprogramming attacks are two main security-based attacks that target the testing phase or inference phase of a machine learning algorithm.

Evasion: Evasion attacks aim at manipulating samples to have them misclassified by an ML-based system as desired. When a specific label is chosen for misclassification, evasion attacks are said to be targeted. In these types of attacks, the attacker’s knowledge about the target system is variable, ranging from zero (black-box) to full knowledge (white-box). Also, the adversary can modify samples, all of their features, or only specific features if they are interdependent. However, specific application contexts confine the attacker to manipulate any arbitrary feature. Contrary to poisoning attacks which occur at training phase, these types of attacks happen during the testing phase, and are thus considered as exploratory attacks. Thwarting simple evasion attacks is feasible by adjusting the system’s decision boundary. Therefore, an advanced attacker would create samples that are misclassified with high confidence.

Reprogramming: Reprogramming attacks are the most recent types of adversarial attacks where a machine learning model is repurposed to perform a new task as desired by an attacker. Here, the main goal is to use the resources of some open machine learning systems (or APIs) to solve other tasks chosen by the attacker. Thus, the objective of an adversary in reprogramming attacks is to produce specific functionality rather than a specific output/label for samples that contain triggers as in Trojaning attacks.

To better understand the concept of reprogramming attacks, consider a classifier trained to perform some original task, i.e., the model outputs f(x) as label for some input x. Also, consider an attacker who wishes to perform some adversarial task, i.e., the adversary wishes to compute an output g($\hat{x}$) as label for some input $\hat{x}$ which is not necessarily in the same domain as x. The adversary can achieve this goal by means of two functions, known as adversarial reprogramming functions. The first function, $h_g$, is a hard-coded one-to-one label mapping from the labels of the adversarial task to the label space of the classifier, whereas the second function, $h_f(.;\theta)$, converts an input from the input space of the adversarial task ($\hat{x}$) to that of the classifier (x). The latter transformation is referred to as adversarial program.

Privacy-based Attacks

Machine learning models have access to a wide range of training data, ranging from users’ Personally Identifiable Information (PII) to sensitive information of IoT devices. Privacy-based adversarial attacks refer to any types of attacks that allow an attacker to extract sensitive information from a machine learning model, including the underlying data (e.g., features) or the model itself. Privacy-based attacks are divided into several categories as will be discussed next.

Membership Inference: Membership inference attacks aim at finding out whether or not an input belongs to a specific training dataset used for training a model. A recent study divides these types of attacks into passive and active in a collaborative learning setting. In decentralized collaborative learning, a set of geographically distributed systems train their models on their local training datasets, while sharing their model updates with other systems in a global scale to improve the performance of classification without revealing their sensitive training data. In such a setting, a passive membership inference is any type of attack where the adversary observes the updates and performs inference without changing anything in the local (i.e., distributed systems with models trained on local data) or global (i.e., a centralized server used to aggregate model updates of local systems) scales. On the other hand, in an active inference attack, the adversary can perform additional local computations and submit the resulted values into the collaborative learning protocol. In addition to information about training data, membership inference attacks may decide to understand the amount of contribution from a local system (or client) during the training phase of a decentralized machine learning system as well.

Data Extraction (Model Inversion): Initially raised as a concern in the context of genomic privacy where the attacker could estimate the aspects of someone’s genotype by abusing black-box access to prediction models, data extraction (or model inversion) attacks try to find an average representation of each of the classes a machine learning model was trained on. These attacks are most specifically tailored for black-box MLaaS APIs where the attacker doesn’t have access to the remote model and wishes to obtain an approximate insight about each class of data kept in the remote server. For instance, in the case of a trained model for face recognition which is only accessible via an API, the attacker’s goal is to find out the probability of an input face image to be of some specific person on the remote server used to train this model. A recent study proposes a more effective type of model inversion attacks in the collaborative learning where an attacker in a local client (or system) tries to extract information about a class of data which belong to other clients. Also, the adversary tries to deceive other clients into revealing further details about a targeted class before waiting for the final model to be trained as it happens in conventional model inversion attacks.

Model Extraction (Model Stealing): Contrary to data extraction attacks where the attacker’s goal is to find an approximate representation of each class used in the training phase, model extraction attacks aim at stealing a machine learning model and/or its hyper-parameters. However, similar to data extraction, model extraction attacks have emerged by the appearance of cloud-based ML services offered by companies like Amazon, Google and Microsoft. The motivation of attacker in these types of attacks is twofold: model cloning (i.e., duplicating or reusing the model), or launching white-box evasion attacks.

Class Sniffing: More adversarial attacks have been launched and proposed in recent years by the emergence of novel machine learning algorithms and architectures such as federated learning which is a decentralized machine learning system. Class sniffing is one example of such attacks where the adversary’s goal is to infer whether a particular class of training data has been appeared in a single training round.

Quality Inference: Another type of adversarial attacks proposed for the federated learning setting is quality inference where the attacker can make a judgment about the presence of a certain training label in specific group of clients or even predict the exact number of clients which hold this training label.

Whole Determination: The last adversarial attack which is tailored for federated learning is whole determination where the malicious participant aims at obtaining the composition proportion of labels in the current global model normally kept in a centralized remote server.

Discussion and Conclusion

With the widespread application of machine learning to different areas, especially, in recent years, novel attacks are devised by adversaries to either affect the performance of systems which solely rely on machine learning, or to leak out sensitive information that are used to train machine learning models. Thus, such systems are required to be robust against adversarial attacks and to preserve the confidentiality of data used to train their models.