Coleridge Initiative

Show US the Data

About Rich Context

Our goal is to develop tools that show how data are used to produce evidence. The results can be used by agencies to show how their data are used, by researchers and analysts to collaborate and share code, and by academic institutions, publishers and editors to find experts.

Rich Context is aimed at building a knowledge community around government data using AI tools.  It is not feasible for agencies to provide high levels of support for the entire array of federal datasets, which are typically not well documented and consequently can be difficult to use. Agencies often do not know who is using their data or exactly how it is being used, if at all. Coleridge has worked with a number of agencies, and with over 1,600 teams in a high profile Kaggle data competition, to develop automated ways of finding out how any given dataset (e.g., NOAA climate data or USDA data on food security or crops), has been used. Modern machine learning (ML) and natural language processing (NLP) techniques can be used to move the data dissemination system to an equilibrium that can achieve the government’s goals. 

There are four steps:

  1. The first, and critical, step is to use ML/NLP techniques to find datasets that are used in publications – beginning with, but not restricted to, scientific research -the approach can be expanded to any text document, like government reports, Federal Register notices, or newspaper articles.

  2. The second step is to characterize the dataset ecosystem or its use by mining the associated publications to identify the topics, authors, and visibility of the work.

  3. The third is to characterize the use with metrics that can convey the value of the dataset to decision makers and to the public at large.  

  4. The fourth step is to create incentives for publishers, government agencies, the public and academic researchers to continuously update and validate the data ecosystem and associated.  The goal is to create a self-reinforcing “information marketplace” – an for data – so that knowledge about how data are used can be shared by all stakeholders.

The goal is to produce a metadata platform, easily tailored to individual agency needs, that has the right incentive structure.  With the AI tools, agencies can easily produce information about data use and produce usage scorecards with drill down capabilities – and get attention for their efforts. We would produce an API based on dataset-publication dyads providing a major digital enhancement for the public’s interactions with the agency.  Training programs based on agency data can use the rich context to enable experts to connect with each other, get visibility and form networks of interest. Researchers will get visibility for highlighting their use of data; publishers can provide information to editors and reviewers about experts; data repositories can demonstrate the value of their efforts.   By creating a value proposition for all participants, we will be producing a sustainable self-reinforcing incentive system.

The Show US the Data conference was intended to be the first step in creating a new data ecosystem that uses modern tools and technologies. Learn more.

Click here to view the winning Kaggle models.
Click to Read SPARC Newsletter Article
Click here to read the paper.
rich search and discovery
Click here for the Rich Search and Discovery for Research Datasets Book

What Stakeholders are Saying

Suzette Kent
Suzette Kentformer Federal CIO
Read More
We say that federal data is a strategic asset and important for use in policy making, research, the economy and serving American citizens. We now have a fact-driven pathway discovering where and how data is used for research and in turn, why the data matter. Excited to advance the ways that we prove the real impact of federal data!
Spiro Stefanou
Spiro StefanouAdministrator, USDA Economic Research Service
Read More
All of us want to understand the impact of what we do. This is an evidence based approach to showing the impact of data
Nancy Potok
Nancy Potokformer Chief Statistician
Read More
This game-changing capability will not only help agencies meet statutory mandates, it will provide both agencies and researchers with incredibly useful information at their fingertips in real time. It fills a massive gap in our knowledge of how federal data are actually being used.