Coleridge Initiative

Rich Context

The new rich context leaderboard competition launched on Github. Please compete, and advertise widely! Read more about this effort here.

Goal

The goal of the Rich Context project is to create a new platform that enables empirical analysts to search for and discover datasets.

The challenge that empirical researchers face is that, for a given dataset, it is difficult to find out who has worked with the data before, what methods and code were used, and what results were produced.

Some estimates suggest that more than a third of analysts’ time is spent in finding out about data rather than in model development and production, and the Federal Data Strategy has directed government agencies to streamline access to federal data assets.

There now exists the scientific capacity to build a platform that will automate search and discovery. A competition brought together computer scientists from around the globe to develop models that would identify and extract dataset mentions from full-text publications. The finalists convened in New York City in February 2019 to present their work. A follow on workshop brought together more than 70 international experts to identify gaps and an operational roadmap.

A recently-published SAGE publications book edited by Julia Lane, Ian Mulvany, and Paco Nathan provides an overview of initial work in each area. Download a free copy here!

Technical Approach

Much work has already been done

The Coleridge Initiative, in cooperation with Project Jupyter the Deutsche Bundesbank , several federal agencies and Derwen.ai has developed the following set of key inputs.

  1. A Python library richcontext-scholapi which provides API integrations for federating discovery services and metadata exchange across multiple scholarly infrastructure providers. Additional APIs are constantly being added to the library.

  2. A Knowledge Graph of known links between datasets, research publications, researchers and experts, and projects within the sciences.
  1. A leaderboard competition inviting NLP research teams to develop models to automatically detect dataset usage from the text of research publications.

    Model development relies on a corpus of known linkages between research publications and the datasets used to produce the research. The corpus continues to expand in size and scope, as new linkages, which are validated by domain experts, are added.

    Teams use a subset of the corpus as a training set to develop their models. Using another subset of the corpus as a test set, teams can test and improve their models. Our leaderboard evaluates and scores model performance; scores are updated as teams release new and improved model versions. Read more about how models are evaluated on our Competition Wiki.

    Teams are also contributing to shared tools for pdf retrieval and text extraction. The competition is open to the public.
  1. A set of Jupyter extensions which treat datasets as top-level constructs. Features like the Data Registrythe Metadata Explorer, and Commenting enable users to register datasets within research projects; browse metadata on their datasets while they work; and comment and collaborate in real-time on shared projects and datasets. The eventual goal is to enrich the knowledge graph with information gleaned from comments, collaboration and computing within the Jupyter environment; and developing a recommendation engine for datasets and code.
Data Registry (left) and Metadata Browser (right) in JupyterLab. Project Jupyter
Demo - Linked Data generated by user appears in Jupyter Data Registry. Project Jupyter
Users can comment on datasets and interact with others in the same project. Project Jupyter
  1. A Human-in-the-Loop pilot that enables scientists to validate and contribute metadata on research. In collaboration with RePEc (Research Papers in Economics), RePEc’s community of contributors will validate outputs from the machine learning models that are developed during the competition.

Read more about this work on RePEc’s blog: New initiative to help with discovery of dataset use in scholarly work, by Christian Zimmerman

Get Involved

Are you a researcher? Tell us about your recent publications and the datasets you used by filling our our Rich Context Publication Submission Form. Researchers from any domain are encouraged to submit. Your submission will be added to our Knowledge Graph and will help other researchers and data users benefit from your work.

Questions? Want to get involved in other ways? Get in touch!

Presentations