Triage

Triage projects focused on the analytic process of finding information of intelligence value within large data sets. The Triage team’s efforts this year focused on assembling data for better data management and search processes (Search); integrating and presenting data to allow analysts to discover the "knowns and unknowns" (Exploration); discovering data most relevant to the user's workflow (Prioritization); extracting knowledge and value from unstructured data (Extraction); and addressing data deduplication and linkage issues to maximize value while minimizing retention costs (Retention). Additionally, the Triage team focused some of its efforts on understanding and characterizing the processes and behaviors that analysts demonstrate while using software tools for information search tasks (Analyst Behavior).

Team Leads: Felecia Morgan-Lopez, Jody Coward, Jascha Swisher, Lori Wachter

Search

Participants: John Slankas (LAS)

Request follow-up on this project

The Common Analytic Platform (CAP) is a system used by several research projects at LAS. We use the platform 1) to quickly explore an existing dataset; 2) as an instrumented platform to run experiments; 3) to compare the performance of different approaches and analytics; and 4) to host analysts generated by researchers in other projects. The system component consist primarily of a web application and an Elasticsearch database. We also leverage Kibana and Jupyter for direct data access. We deploy the application through a series of containers. CAP is currently used by the PandaJam and EyeReckon projects.

Participants: Jing Ao (NCSU), Rada Chirkova (NCSU)

Request follow-up on this project

Knowledge graph has become a popular data model for characterizing and organizing diverse real-world entities and their relationships. As a fundamental step of analyzing the data, querying knowledge graphs has been executed extensively every day across a wide range of applications. However, this step can be challenging for real-life users, due to the high difficulties of understanding complex knowledge-graph models and their corresponding formal query languages without sufficient background knowledge. To address the problem, in this paper, we propose an easy-to-use knowledge-graph-querying approach: The user simply provides an intuitive shape and some examples of expected query answers, and the developed system will automatically interpret the user’s hidden query intent by matching up the shape and the examples against the ground-truth knowledge-graph structure and content, and return to the user the most likely formal query that is supported by the knowledge graph for satisfying his query intent. This presentation features a high-level description of the approach purpose, architecture, and its design decisions, and a system demonstration of how it works in practice.

Participants: Jaime Arguello (UNC-CH), Rob Capra (UNC-CH), Bogeum Choi (UNC-CH), Sarah Casteel (UNC-CH)

Request follow-up on this project

In this research, we investigate the information-seeking practices of intelligence analysts(IAs) employed by a U.S. government agency. Specifically, we focus on the needs, practices, and challenges related to IAs searching for procedural knowledge using an internal system called the Tradecraft Hub (TC Hub). The TC Hub is a searchable repository of procedural knowledge documents written by agency employees. Procedural knowledge (as opposed to factual and conceptual knowledge) includes knowledge about step-by-step procedures, techniques, methods, tools, technologies, and skills, and is inherently task-oriented. We report on a survey study involving 22 IAs who routinely use the TC Hub. Our survey was designed to address four research questions. In RQ1, we investigate the types of work-related objectives that motivate IAs to search the TC Hub. In RQ2,we investigate the types of information IAs seek when they search the TC Hub. In RQ3, we investigate important relevance criteria used by IAs when judging the usefulness of information. Finally, in RQ4, we investigate the challenges faced by IAs when searching the TC Hub. Based on our findings, we discuss implications for improving and extending searchable knowledge base systems such as the TC Hub that exist in many organizations.

Participants: Ken Thompson (LAS)

Request follow-up on this project

This research project investigates the marketing research field's suitability as a proxy for analysts' triage, search, and discovery activities. The video describes the field's workflows, tools, and practitioners, and compares and contrasts them to "Technologically Complex Search and Discovery" workflows.

Exploration

Participants: Mengtian Guo (UNC-CH), Zhilan Zhou (UNC-CH), Yue Wang (UNC-CH), David Gotz (UNC-CH)

Request follow-up on this project

Our team's previous work proposed computational models for analyst’s focus during visual analysis, which enabled contextual recommendation of relevant documents during an analysis task. These results motivate us to further explore two threads of research this year: (1) what is the best strategy to make contextual recommendations in a visual analytics system? There exist many approaches for contextual recommendation in visual analytics systems. Can we systematically categorize them to characterize the design space, highlight frequently studied approaches, and identify underexplored alternatives? (2) what is an effective interface to surface structured concepts and relations present in unstructured documents? User’s interactions with these semantic units can then be used to model their analytic focus, which in turn can retrieve relevant structured data visualizations. We present our progress along both threads in this video. Towards the first thread, we present a multi-dimensional framework that categorizes recommendation interface approaches based on their characteristics (e.g. “what to recommend”, “where to recommend”, “when to recommend”, “how forcefully to recommend”). Towards the second thread, we present the design of a novel semantic search interface to help analysts construct a mental model of an unfamiliar information space during exploratory search. The interface organizes and visualizes important concepts and their relations present in a set of search results. We evaluate the interface against a baseline faceted search system in an open-ended search task on medical literature.

Participants: SAS, Lori Wachter (LAS)

Request follow-up on this project

The Avocado Research Email Collection was used to showcase the methods that the research team developed to increase analyst efficiencies in processing and analyzing unstructured communications data. The data corpus is a set of emails and attachments for 282 accounts, provided as Outlook PST files that store emails, calendar entries, contact details, and related metadata. Due to the volume of unstructured text, the research team demonstrates how an analyst can be augmented by a machine process in order to rapidly prioritize, search, filter, and query massive amounts of data. This video focuses on the various machine learning techniques the team applied to this email data set, including methods such as community detection and complex entity extraction.

Prioritization

Participants: PUNCH Cyber Analytics Group

Request follow-up on this project

The goal of RADS is to use the techniques of Recommender Systems and Machine Learning to develop a pipeline for prioritizing and recommending data based on the underlying properties and the interactions of analysts. The video will show how we used publicly available E-Commerce data as a proxy for a restricted dataset. It will show a breakdown of the various methodologies that went into the development of the pipeline, including pre-processing, various recommender system algorithms, and result evaluation. Finally, it will briefly discuss the application of this pipeline to classified data.

Extraction

Participants: Dominic Cassidy (Bricolage), Ross Marwood (Bricolage)

Request follow-up on this project

The mountain of electronic information being generated each day presents both problems and opportunities for intelligence analysts. The problems are related to the ability of organisations to process the volume and variety of data that is being constantly created. But the same volume and variety also creates new opportunities if the data can be captured and transformed into a useful analytic structure. One part of the challenge of volume and variety is to capture data from previously difficult and complex sources such as structured forms. The structure of a form is made up of questions, answers and, crucially, the relations between the two. Any system addressing data extraction from forms must be able to extract these three elements reliably. In addition, the hierarchy of a form, as well as question order, provide key information for form understanding. Current techniques for processing large volumes of forms rely on template matching, where a system has been pre-loaded with a series of standardised templates (e.g. tax forms) and the input form is matched to a specific template. Other work, such as the processing of cheques, invoices and receipts, relies on standardised field names and data formats to reliably extract the corresponding data. These techniques work well for these specific use cases. But for heterogeneous datasets that contain a variety of forms that do not follow a set template or have a set of standard field names and formats, we still rely on manual or semi-automated processing approaches that are slow and resource intensive. So, the question we set out to answer was: Is it possible to process a corpus of documents containing a variety of form structures using automated techniques to identify, categorize and extract data and related information about structure from those forms, and to provide an output in an accessible format that allows rapid information retrieval and analysis?

Participants: Dominic Cassidy (Bricolage), Ross Marwood (Bricolage)

Request follow-up on this project

The mountain of electronic information being generated each day presents both problems and opportunities for intelligence analysts. The problems are related to the ability of organisations to process the volume and variety of data that is being constantly created. But the same volume and variety also creates new opportunities if the data can be captured and transformed into a useful analytic structure. This presentation examines how to appropriate assess, i.e. score, extracted data from forms.

Retention

Participants: Ian Taylor (Col. St.), Brenda Betancourt (UFlorida), Andee Kaplan (Col. St.)

Request follow-up on this project

Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field in the records. In streaming record linkage, files arrive in time and estimates of links are desired after the arrival of each file. This problem arises in settings such as longitudinal surveys. The challenge in streaming record linkage is efficiently updating parameter estimates as new files arrive. We approach the problem from a Bayesian perspective with estimates in the form of posterior samples of parameters and present a method for updating link estimates after the arrival of a new file that is faster than starting an MCMC from scratch. We generalize a Bayesian Fellegi-Sunter model for two files and compare two methods for streaming sample updates. We examine the effect of prior distribution and strength of prior information on the resulting estimates.

Analyst Behavior

Participants: Brent Harrison (UKentucky), Stephen Ware (UKentucky), Chengxi Li (UKentucky), Anton Vinogradov (UKentucky)

Request follow-up on this project

In this presentation, we're exploring how analyst workflow can be hierarchically represented in graph form for the purposes of better understanding how analysts approach novel problems. This will enable us to provide guidance to analysts or develop interventions to help them explore novel datasets. We built a visualization of the analytic process from the bottom up, using logs from real analysts playing a serious game. We cluster events and show how analysts moves from one task to another. We evaluated the graph by showing that we can train a model to predict what action an analyst will do next based on their history.

Participants: Alvitta Ottley (WUSTL), Jordan Crouser (Smith), Sunwoo Ha (WUSTL)

Request follow-up on this project

We routinely use interactive visualization tools to support the analyst's workflow and hypothesis inquiries. However, the ongoing development of visual analytics tools for insertion into the analytical workflow creates a need to establish an evaluation infrastructure for such systems. Creating such a framework requires a thorough understanding of the analytic process and empirical measure of how analysts leverage existing visualization tools in their workflows. To this end, we present the results of a qualitative user study that captures investigative behaviors of analysts as they examine social media posts to understand and characterize the spread of an epidemic break out in the fictional city of Vastapolis. Our findings aim to provide an evidence-based understanding of human–visualization performance.