The Triage Team’s projects develop and demonstrate methods and tools that identify useful and analyzable collections of data to support an analysis. The process of finding the “right” data to support an analysis has the possibility of introducing significant errors and biases, so the methods and tools must be characterized to understand their appropriate use in rigorous analytic processes.Team Leads: Lori Wachter
Effective triage of voice media could greatly increase the value . This video describes LAS's efforts towards providing analysts with methods to select a manageable amount of voice data based on how people are talking. These efforts included developing and assessing the validity of voice corpora for the problem at hand, identifying where existing analytics met or fell short of requirements, and creating new analytics for unmet requirements, such as identifying abnormal speech in an explainable fashion.
Participants: Tina Kohler (LAS), Tracy Standafer (LAS)
Network communication logs contain a vast amount of information and the amount of network communication data being collected is growing exponentially. The amount of storage required to keep the communication data is also growing exponentially as well as the resources required to process and analyze that data. We conducted a model- and simulation-based experiment, using real data, to determine the feasibility of using unequal probability random sampling to reduce storage requirements while simultaneously maintaining the ability to conduct appropriate analyses. The experiment included the development of a target record generation model and an assessment of the ability of sampling to capture target records under different assumptions about the degree of information assumed known about the target generation model.
Participants: David Wilson (RTI)
In our work this year, we developed an automatic approach to address the problem of identifying user-interested unusual trends from temporal graph data along with their causing factors. Our approach consists of a scheme for targeting user-interested data values, which automatically translates user interests into graph queries and evaluates the queries onto the graphs, and of a greedy search strategy, which identifies the most unusual trends from the interested part of data for a given target number of causing factors. Our strategy identifies unusual trends and the causing factors at the same time. The preliminary experimental results demonstrated that our approach is able to address the problem for almost all cases of given target numbers of factors. We focus on addressing the failure cases in our ongoing work.
Participants: Jing Ao (NCSU), Rada Chirkova (NCSU)
With ever-increasing data, it can be challenging to make sense of large sets of images in a timely manner. Developing an image recognition capability (preferably in near real-time) would significantly enhance mission effectiveness and understanding of the environment. Image recognition is an important application of AI techniques, as images usually act as sensory input for further problems to be solved.
Participants: Steve Cook (LAS), Jenaye Minter (LAS), Liam Wilkinson (LAS), Sheila Bent (LAS), Felecia Morgan-Lopez (LAS), Stephen Shauger (LAS), Tina Kohler (LAS), Dawn Hendricks (LAS), Sandra Harrell-Cook (LAS), James Smith (LAS)
There is an increasing demand for content filtering and triage over large structured datasets for investigative analysis. There are several challenges for analysts in such a task. First, during exploration, search targets are vague. Second, there is a lack of systematic filtering processes for uncovering detailed evidence once a potential inconsistency has been identified. Finally, constructing a visual report by synthesizing disparate pieces of information is challenging. In this project, we developed an interactive narrative report generation system on two large email communication datasets. The resulting system integrates topic modeling, social network analysis, narrative generation, and visualization tools in a systematic process for automatic and user-controlled narrative analysis reports. We show results from 250K emails from the Enron dataset and nearly 1M emails from the Avocado dataset. This work has yielded three novel contributions. First, a classification of emails in terms of their weightage of content over intent through a Pointwise Mutual Information (PMI) metric across LDA and doc2vec based topic modeling algorithms. Second, the integration of an algorithm for labeling communicative intention within the social network analysis module to extract rich interaction patterns within the social network. Finally, a narrative generation system that identifies and presents embedded stories of email address owners in the dataset in the form of recognizable patterns of storytelling such as fall from grace and rags to riches.
Participants: Colin M. Potts (NCSU), Akshat Savaliya (NCSU), Arnav Jhala (NCSU), Tracy Standafer (LAS), Lori Wachter (LAS), Aaron Wiechmann (LAS)
In the big data era, methods for combining different sources of information and efficiently assimilating new data as it becomes available are essential. The process of merging multiple databases, often in the absence of unique identifiers, is known as record linkage (de-duplication or entity resolution). This project aims to develop new methodology to perform record linkage with streaming data from a fully model-driven perspective. To accomplish this we will leverage the idea of recursive Bayes to adapt the Bayesian model of Sadinle (2014) to perform record linkage with new incoming data in a scalable fashion.
Participants: Ian Taylor (Colorado State University), Brenda Betancourt (University of Florida), Andee Kaplan (Colorado State University)
Analysts often must search through large amounts of unstructured information (text, images, video, etc) to find knowledge relevant to their specific task. This year SAS developed a prototype analytic dashboard that facilitates this triage process in the specific application domain of captured enemy materials (CEM). Please note this presentation is not available to the general public.