Machine Learning Integrity

The LAS Machine Learning Integrity group supports research intended to make trustworthy and useful ML-driven systems more widely available to intelligence analysts. In 2020, this work included focuses on ML explainability and fairness, data labeling solutions for supervised ML, knowledge graph population and sharing, and research in the cybersecurity application domain, among others.

Team Leads: Mike Green, Jascha Swisher, Stephen Shauger, Aaron Wiechmann

 

Research has shown that Neural Networks (NN) are vulnerable to transferable adversarial attacks. An adversarial example is an input that has been perturbed by an adversary to cause unintended behavior. An example is transferable when it can be created on one system and then proves effective against a second. The most effective attacks depend upon having a knowledge of a model’s internal parameters. Thus, it would be ideal for the adversary to create an attack on a known system and then transfer this to an unknown system. The researchers’ hypothesized adversarial techniques were dangerous since they are highly transferable and attacks that work on a white box model will similarly affect a black box model of similar foundations (ex: both CNNs). Developing a surrogate model enabled the team to deeply research and understand an attack vector’s intricacies and precisely deploy it to a black box model. The hypothesis was tested using three experiments and two datasets. A high SNR dataset and a low SNR dataset were used to test different types of communication data that could be found in the real world. Three types of adversarial attacks, Fast Gradient Sign Method, Deep Fool, and Carlini and Wagner, were used to perform three experiments in order to test how the different attacks would affect the models. Experiment 1 transferred an adversarial attack created with a white box model to a black box model to determine if it would affect the black box model in the same way as the white box model. Experiment 2 created a defense against attacks by training with a combined dataset consisting of original data and adversarial data. The model should become hardened and be able to defend against new adversarial examples when trained with similar adversarial examples. Lastly, Experiment 3 tested preconditioning the data using a low-pass filter to remove any noise created by the adversarial examples.

Participants: Kaitlyn Yingling (CACI), Adam Davis (CACI), Stephanie Beard (CACI)

Request follow-up on this project

RED THREAD is a knowledge extraction tool that allows an analyst to leverage Entity Extraction and Question Answering to populate knowledge graphs. It acts as a demonstration of a dedicated human-machine workflow and novel use of NLP for information extraction.

Participants: Liam Wilkinson (LAS)

Request follow-up on this project

Expert Extractor and Abstract Explorer are two practical applications of machine learning that allow us to explore some fundamental concepts of Machine Learning Operations. Furthermore these use cases demonstrate alternatives when a lot of hand-labeled data is not available.

Participants: Mike Green (LAS)

Request follow-up on this project

Text-based documents, such as reports, blogs, and news stories, may contain descriptions of malicious cyber activity. Text analytics techniques can be used for the automatic extraction of information (e.g., threat indicators) from these reports to produce actionable cyber threat intelligence (CTI). The information would be parsed to extract the most relevant threat indication which is used to populate a structured format. The incidents, once reported in such a structured format, can enable analysis across incidents to produce contextualized threat intelligence and situational awareness. The goal of this project is to aid analysts to produce contextualized threat intelligence and situational awareness through the use of NLP techniques and machine learning to identify attack techniques from text-based reports to populate a structured format that can enable analysis across incidents, based on the ATT&CK framework. First, we conducted a Systematic Literature Review (SLR) on 38 papers we found related to the research area. We identified seven text-based sources and seven goals to mine for CTI. Also, we proposed a high-level pipeline for mining CTI from text-based resources based on these papers and specify details of the pipeline for each of the seven goals. Second, we propose a novel approach to mine CTI from text-based reports. We use one-class SVM (Roberta sentence embeddings as features) to select technique-related sentences in the text reports. Then, we use 2 approaches to extract techniques from sentences: 1) Semantic similarity (Roberta sentence embeddings as features) between technique-related sentences and training dataset sentences; and 2) Bert transformer to perform the multi-class classification task. Both approaches assign one ATT&CK technique as a label to each technique-related sentence in the text reports.

Participants: Rezvan Mahdavi Hezaveh (NCSU), Laurie Williams (NCSU)

Request follow-up on this project

This video introduces the labeling service Infinitypool. Topics include a quick overview of machine learning, an explaination of how Infinitypool uses the tools of machine learning to assist classified and unclassified projects, a description of the current capabilities of Infinitypool and past project statistics, and information on how to access Infinitypool.

Participants: Aaron Wiechmann (LAS), Stephen Williamson (LAS), Stephen Shauger (LAS), Lauren Fulcher (LAS)

Request follow-up on this project

Data labeling is essential for supervised machine learning projects, yet labeling sessions can sometimes be long and tedious. LAS partnered with a Spring 2020 CS Senior Design team to design and prototype gamification for LAS's data labeling application, Infinitypool.

Participants: Aaron Wiechmann (LAS), Stephen Williamson (LAS), Stephen Shauger (LAS)

Request follow-up on this project

Production systems consume time and energy when applying neural network models. This study explored ways to increase energy demands significantly and slow responsiveness of systems. It tested a natural language processing (NLP) software as found in analytic systems. In conclusion, it recommended further studies with extension into image recognition software.

Participants: Paul Davis (LAS)

Request follow-up on this project

People should be able to understand the reasons that underlie an automated decision, and such decisions should be demonstrably fair and equitable to those they impact. This presentation describes several recent efforts by the NCSU RAISE lab, aimed at assessing and mitigating algorithmic bias, leveraging ML explainability methods to suggest improvements to software systems, and applying these techniques to assess the fairness of real-world decisions captured in an anonymized data set.

Participants: Joymalla Chakraborty (NCSU), Kewen Peng (NCSU), Tim Menzies (NCSU), Jascha Swisher (LAS), Aaron Wiechmann (LAS), Michael Green (LAS), Stephen Shauger (LAS)

Request follow-up on this project

We present a method for generating counterfactuals for text classifiers without a parallel corpus. The counterfactuals serve as an explanation for a single sample by highlighting changes to a document that would change the decision of a classifier. Through rationales, binary masks over input features, we identify portions of a sample where changes would have the greatest impact on a classifier's decision. We use a neural model to produce counterfactuals from the complement of the rationales. We show how these explanations are useful for understanding classifiers trained on a toy beer review corpus and a small scale version of the Twitter influencer identification problem.

Participants: Mitchell Plyler (NCSU), Min Chi (NCSU), Stephen Shauger (LAS), Jascha Swisher (LAS)

Request follow-up on this project

Knowledge Graphs have the potential to revolutionize the way intelligence analysts work. However, a monolithic graph is unlikely to be a silver bullet for every problem; a decentralized approach is required to accommodate the unique data environment of the IC. This presentation of my whitepaper proposes an Integrated Knowledge Ecosystem (IKE) which would go towards providing a framework for building more advanced human-machine collaborative systems and advancing analytic rigor.

Participants: Liam Wilkinson (LAS)

Request follow-up on this project

One of the most significant bottlenecks in developing machine learning (ML) applications in cybersecurity is efficiently collecting accurately labeled training data at the massive scale required to train transferable models. Many recent ML breakthroughs in vision and language have been catalyzed by the release of labeled training datasets. However, the few datasets available in cybersecurity are idealized, synthetic and not transferable to most applications. Labeling datasets manually (i.e. strongly supervised), is prohibitively slow, expensive, and too static to be useful to the average analyst. In contrast, weakly supervised methods aim to programmatically generate labels that are cheap and dynamic, at the reasonable cost of being noisy and originating from potentially overlapping sources. While this appears to trade off quality for quantity, recent studies at Google utilizing the open-source Snorkel package have shown that the generalization of models trained via weak supervision is asymptotically identical to the strongly supervised case. The benefits were clear: Snorkel helped build models 2.8× faster and increased predictive performance an average 45.5% versus seven hours of hand labeling. Motivated by these results, we introduce Cyber Snorkel, a weak supervision management system which extends Snorkel into the cyber landscape and enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that take in unlabeled data and output a label or abstain. Accuracies and correlations of the various labeling functions are estimated based on their observed agreements and disagreements, aiding in the final output of probabilistic labels that can then be incorporated by a noise-aware, transferable ML model. Ultimately, Cyber Snorkel incorporates organizational knowledge into an automated labeling pipeline allowing for easy creation of labeled datasets and freeing analysts from rudimentary data enrichment.

Participants: Alex Fitts (Punch Cyber), Mike Geide (Punch Cyber)

Request follow-up on this project

Contact Us