Coleridge Initiative

Rich Context Workshop


The new rich context leaderboard competition launched on Github. Please compete, and advertise widely!

The focus of the Rich Context workshop is to build a scientific basis for the empirical foundations of data science in government. Empirical research relies critically on knowing how data has been produced and used before: the required elements include what does the data measure, what research has been done by which researchers, with what code, and with what results. The interest of funders in supporting data science, combined with the recent passage of the Evidence-Based Policymaking Act and the launch of the federal data strategy, make this an opportune time for such a workshop.


When:  November 15-16, 2019
Where: The National Press Club
              529 14th St. NW, 13th Floor
              Washington, DC 20045
Hotel: We have a room block at the Sofitel Washington DC Lafayette Square located at 806 15th St NW, Washington, DC 20005. The room rates are $259 per night plus tax. Please reach out to Sofitel directly at (202) 730-8800 no later than Tuesday, October 15. Please ask for group reservations and identify yourself as a member of the Rich Context Workshop.


We expect the outcomes of the workshop to be:
1. A roadmap that will identify the opportunities, gaps, and necessary investments.
2. The development of an interdisciplinary community of computer scientists, life scientists, and social scientists who can work together to address the problems.
3. The engagement of key stakeholders, notably funding agencies, and government agencies.

Here is our final report on the workshop.


9:00 Opening and introduction
Visualizing the Workshop Goals
Attendees post unconference-session topics based on the goals / mingle
10:00 Lightning Talks (5 minutes plus 1 question from floor)
10:30 Unconference session (4x themes) “free for all” topics
11:30 Unconference session (4x themes) “free for all” topics
12:30 Lunch (Report Back and General Discussion)
13:30 Lightning Talks
14:00 Structured session (4x, tied to goals)
17:00 Wrap-up/disband

09:00 Summary and core foci for the day
09:30 Lightning Talks
10:30 Structured session (structure these to feed wrap-up session?)
11:30 Closing session
12:00 Disband (informal meet afterwards at ANXO)

Workshop Goals

Goal #1: Identify compelling use cases that would be transformed by access to dataset search and discovery tools (starting from Evidence-Based Policymaking)
Goal #2: Take stock of existing practices and identify the gaps for the following:

      Goal #2.1: entity linking and coreference (ML/NLP research)
      Goal #2.2: metadata exchange: persistent identifiers, crosswalks, data dictionary discovery, translation between W3C => ISO, etc.
      Goal #2.3: knowledge graph representation and inference
      Goal #2.4: how the resulting metadata can be used to augment scholarly infrastructure (SAGE Pub, MIT Press, RePEc, ResearchGate, etc.)
      Goal #2.5: human-in-the-loop approaches: semi-supervised learning, weak supervision, and other variations; plus, how to incorporate authorized contributors

Goal #3: Catalyze a community that works together to integrate open source projects for common needs in data/metadata infrastructure (JupyterLab, spaCy, PyTorch rdflib, Egeria, W3C standards for metadata, Amundsen and its emerging category, etc.)
Goal #4: Identify where we need centralized services (e.g., a global repository of datasets, having persistent identifiers) to complete the knowledge lifecycle.
Goal #5: Define a platform (akin to Amazon, Etsy, LinkedIn) for the initial use cases, which can be broadly adopted:

       – What are the desiderata?
       – What does an implementation look like?
       – How readily can that be implemented?

Goal #6: Generate business model(s) that can be seeded with initial research-funding support and subsequently become self-sustaining.

Lightning Talks - Request for Submissions

Share your work with us—in a Lightning Talk that lasts exactly five minutes. We’re looking for a concise presentation, with or without slides, on a topic of your choice–current research or projects, a work-in-progress, an idea you’d love to find collaborators for, or a lesson learned. Interested? Submit a talk proposal at before Monday, October 7. We’ll notify you by Friday, October 18, and if your talk was chosen then you’ll need to provide slides by Monday, November 11. We’ll have three half-hour Lightning Talks sessions during the Rich Context Workshop.

Important Dates
October 7: Submissions close
October 18: Notifications
November 11: Slides due


There is substantial interest in building an empirical basis for evidence-based policy. Doing so involves learning what data have been used by which experts to examine which topics, building better search and discovery tools for finding those datasets and experts, and building a platform that will both improve tools and disseminate the knowledge to both the scientific and policy making community.

There now exists the technical capacity to build such a platform, as demonstrated by a successful recent competition. Computer scientists and domain experts in the life sciences have developed the scientific underpinnings necessary to build each component: document corpus development, ontology development for dataset entity classification, natural language processing and machine learning models for dataset entity extraction, graph models for improving search and discovery, telemetry to capture dataset engagement and use. This workshop will bring together scientists who are working at the cutting edge of knowledge in each of these areas, policy makers who are in need of the results of their work, and funding agencies who have historically supported these efforts.

The scientific committee includes: Stefan Bender, Deutsche Bundesbank; Julia Lane, New York University; Paco Nathan,; Ian Mulvany, SAGE publications; Christine Borgman, UCLA; and Waleed Ammar, Allen Institute for Artificial Intelligence. The workshop will be convened as an unconference in the “Foo Camp” style.

Most of the focus for this work centers on complex data governance for data science workflows involving sensitive data in regulated environments. We believe that current efforts in open source projects are contributing to much improved dataset modeling and knowledge graph applications for these purposes.


This workshop is generously funded by Eric and Wendy Schmidt at the recommendation of Schmidt Sciences, the Alfred P. Sloan Foundation, the Overdeck Family Foundation, and the National Science Foundation.