Machine Learning Integrity

Machine Learning Integrity (MLI) projects focused on helping machine learning (ML) practitioners get quality data and address barriers to integrating ML methods into existing human processes. Some of the MLI team’s efforts this year focused on ways of generating the necessary labeled data to train more bespoke ML models; both through scalable methods of data collection (Labeling Alternatives) and the application of model-driven data generation methods (Data / Text Generation). Other efforts of the MLI team focused on better integrating the user into the ML modeling process whether through training workflows that integrated user feedback (User-In-The-Loop), methods that attempted to provide insight into the workings of a ML model (Explainability), and applications of ML that can help address bottlenecks in cyber-threat indicator (CTI) workflows (Text Classification).

Team Leads: Michael Green, Lori Wachter, Jascha Swisher

Labeling Alternatives

Participants: Michael Green (LAS)

Request follow-up on this project

A significant trend in Machine Learning is the creation and widespread use of foundational models. Another trend is an emphasis on Data Centric AI. These two trends together suggest opportunities to produce high quality models with relatively little human-annotated training data. This video will summarize what is meant by foundational models and the opportunities and concerns surrounding them. In addition, the video will discuss what is meant by data centric modeling. Finally, the video will link these trends to current and future work at LAS.

Data / Text Generation

Participants: Baekkwan Park (ECU)

Request follow-up on this project

In many classification problems using text-as-data such as spam filtering, sentiment analysis, entity recognition, etc., the available annotated data is scarce, which often leads to sensitive and underperforming models. Preparing a sufficiently large amount of labeled data is time-consuming and costly. This paper proposes a way to alleviate this problem through the development of an informed textual data augmentation (ITDA) algorithm and tools for analyzing textual data. Building upon an approach using modified stochastic gradient descent (MSGD) and generalized large scale language models, the development of the proposed algorithm generates new training data that are sufficiently different but still preserves the original labels and improve model performance and robustness.

Participants: Felecia Morgan-Lopez (LAS)

Request follow-up on this project

Discuss the outcome of leveraging synthetic data generation to unclassified cyber data to predict the probably of a cyber attack

User-In-The-Loop

Participants: Michael Green (LAS)

Request follow-up on this project

Redthread is a user-in-the-loop application to experiment with machine-assisted information extraction workflows and to aid in the creation of knowledge graphs. First briefed at the 2020 Symposium, a fully functional prototype was completed this year. This video includes a brief overview of redthread and its information extraction workflow, a discussion of results from analyst testing, and our proposed way ahead with the project.

Explainability

Participants: Kewen Peng (NCSU), Joymallya Chakraborty (NCSU), Tim Menzies (NCSU), Jascha Swisher (LAS), Aaron Wiechmann (LAS), Michael Green (LAS), Stephen Shauger (LAS)

Request follow-up on this project

In 2021, we, the researchers of the RAISE lab, along with the LAS researchers, focused on improving traditional machine learning in three major aspects. The first one is explainability. Machine learning most of the time comes with very good performance but it remains a black-box mechanism. Nowadays, it is crucial to explain the output given by a specific model - how and why the model makes such a decision. Two different frameworks, TimeLIME and VEER, have been built to achieve better performance and explainability. The second problem was reducing the training time of deep learning (DL) models. DL is notorious for being very slow in training. Some very popular DL models (BERT, Transformer) are extremely slow and resource-intensive in the case of training. We found out that feature engineering and subsampling of training data can significantly reduce the training time of DL models. The third problem we focused on is to generate fair prediction results so that the prediction model does not discriminate against any human based on gender, race, nationality, or such ethical attributes. In the future, we will continue improving our solutions to generate explainable, fast, and fair predictions using various machine learning models.

Participants: Mitchell Plyler (NCSU), Min Chi (NCSU)

Request follow-up on this project

Rationales, snippets of extracted text that explain an inference, have emerged as a popular framework for interpretable natural language processing (NLP). Rationale models typically consist of two cooperating modules: a selector and a classifier with the goal of maximizing the mutual information (MMI) between the "selected" text and the document label. Despite their promises, MMI-based methods often pick up on spurious text patterns and result in models with nonsensical behaviors. In this work, we investigate whether counterfactual data augmentation (CDA), without human assistance, can improve the performance of the selector by lowering the mutual information between spurious signals and the document label. Our counterfactuals are produced in an unsupervised fashion using class-dependent generative models. From an information theoretic lens, we derive properties of the unaugmented dataset for which our CDA approach would succeed. The effectiveness of CDA is empirically evaluated by comparing against several baselines including an improved MMI-based rationale schema on two multi-aspect datasets. Our results show that CDA produces rationales that better capture the signal of interest.

Text Classification

Participants: Elemendar

Request follow-up on this project

In collaboration with the Laboratory of Analytical Sciences (LAS), Elemendar has conducted research into the application of Graph Neural Networks (GNNs) to Cyber Threat Intelligence (CTI). Over 2021, our research has broadly considered the following three areas. First - how can AI help solve CTI bottlenecks for the security community? Second - Can Graph Neural Networks help when applied to CTI data? Third - Can we build proof of concepts for different ML-assisted CTI tasks? Considering these three areas over 2021 has enabled a better understanding of how applicable GNN's are for supporting CTI analysis. Specifically, the research indicates that GNN's are likely to help automate aspects of the report ingestion and entity prediction tasks, both of which can help CTI analysts.