Coleridge Initiative

Show US the Data

October 20, 2021


Evidence must be grounded in data and science. New legislation – the Foundations of Evidence-based Policymaking Act – requires agencies to report on the usage of their data, produce data inventories, and expand the public use of their data. The Office of Management and Budget requires data for evidence building across all agencies.

Yet critical questions must be answered. How can federal agencies show how their data are used? How can they make decisions about what data to invest in? How can researchers, academic institutions, and publishers help build data and evidence to better inform policy? 

Agencies have acted. They challenged over 1600 data science teams from across the world to develop artificial intelligence and machine learning approaches to data are being used in written text. This conference is designed to show the results. It brings together challenge winners, scientific journal publishers, the philanthropic foundations and government agencies who supported the competition, as well as the Government Accountability Office, the US General Services Administration, the federal Chief Data Officer Council, and the research community. Data scientists will summarize the Machine Learning models that, combined with text, produces usage statistics that can be fed into an API.  The sponsoring agencies will show examples of tools that can be used to help them produce dashboard, inventories and analysis.  

The conference is also intended to be the first step in creating a new data ecosystem that uses modern tools and technologies. Agencies can  publicly show which data are most used and for what purpose, researchers and academic institutions can receive credit for promoting and supporting data use, and publishers can foster the reproducibility and replicability of empirical research. The tools can be expanded to include the usage of data in government reports, the press, and by the public at large. Participants will be provided the opportunity to participate in shaping this vision. More details are provided in this SPARC newsletter, on the Coleridge Initiative website here, or an early draft of an academic paper here.

Three agencies – NSF, USDA and NOAA – as well as CHORUS partnered with the Coleridge Initiative to develop easy-to-use, evidence-based, automated approaches that government agencies can use to document and understand the public use of their data for research. 

Click here for an example of a possible API. 


All times are EDT

10:00 AM - 10:10 AM
Welcome and Context

“We say that federal data is a strategic asset and important for use in policy making, research, the economy and serving American citizens.  We now have a fact-driven pathway discovering where and how data is used for research and in turn, why the data matter.  Excited to advance the ways that we prove the real impact of federal data!”

10:10 AM - 10:20 AM
Keynote Speech - Data, Evidence and Policy

10:20 AM - 10:30 AM
Show US the Data
Machine Learning, Natural Language Processing and Data Science

Julia Lane, Coleridge Initiative

“This conference has the potential to create a new public service – a public ‘’ for data because it brings together public agencies, researchers, academic institutions and publishers to celebrate what has been achieved and identify a future roadmap.”

10:30 AM - 10:40 AM
The Winning Methods

Rayid Ghani

Presenting an Overview of the Winning Methods

Nguyen Tuan Khoi & Nguyen Quan Anh Minh

Mikhail Arkhipov

10:40 AM - 11:00 AM
The Agency Dashboards

Spiro Stefanou

USDA – Spiro Stefanou, Administrator, United States Department of Agriculture Economic Research Service

NSF – Dorothy Aronson, Chief Data Officer

NSF – Vipin Arora, Deputy Director, National Center for Science and Engineering Statistics

 “All of us want to understand the impact of what we do.   This is an evidence based approach to showing the impact of data”  ~ Spiro Stefanou

“This approach will empower USDA researchers to find out what work has been done using our data in a way that is efficient, relevant and timely” ~ Mark Denbaly

11:00 AM - 11:30 AM
Reactions from Stakeholders

Moderator: Tim Persons, Chief Scientist, GAO

This conference is intended to engage stakeholders (researchers, academic institutions, publishers and federal agencies) in discussing new approaches to documenting the use of agency data assets. We conducted a series of panels gather reacions, identify strenghts and weaknesses, and suggest ways to incorporate researcher feedback. To view the questions asked, click here.

11:30 AM - 11:45 AM
Audience Discussion


Functionality and Access

Next Steps

11:45 AM - 12:00 PM
Next Steps

Nancy Potok

“This game-changing capability will not only help agencies meet statutory mandates, it will provide both agencies and researchers with incredibly useful information at their fingertips in real time. It fills a massive gap in our knowledge of how federal data are actually being used.”


Khôi and Minh are colleagues at VNG, their areas of expertise include natural language and speech processing, as well as deploying machine learning models to real world applications. 

Chung Ming Lee

Transformer-Enhanced Heuristic Search

Chun Ming is a Data Scientist active in the Singaporean startup scene. He’s also worked as a Management Consultant at McKinsey & Co. and as a Software Developer in finance. Lee earned his MBA from London Business School and Bachelor’s in Computer Science from Carnegie Mellon University. 

Mikhail Arkhipov

Pure Pattern Matching

Mikhail Arkhipov is from Moscow, Russia. Since 2017, he works on open-source NLP tools and performs research on Multilingual Transfer Learning and Named Entity Recognition.