Coleridge Initiative

Rich Context - Technical Details

Technical Approach

Much work has already been done

The Coleridge Initiative, in cooperation with Project Jupyter the Deutsche Bundesbank , several federal agencies and Derwen.ai has developed the following set of key inputs.

  1. A Python library richcontext-scholapi which provides API integrations for federating discovery services and metadata exchange across multiple scholarly infrastructure providers. Additional APIs are constantly being added to the library.

  2. A Knowledge Graph of known links between datasets, research publications, researchers and experts, and projects within the sciences.
  1. A leaderboard competition inviting NLP research teams to develop models to automatically detect dataset usage from the text of research publications.

    Model development relies on a corpus of known linkages between research publications and the datasets used to produce the research. The corpus continues to expand in size and scope, as new linkages, which are validated by domain experts, are added.

    Teams use a subset of the corpus as a training set to develop their models. Using another subset of the corpus as a test set, teams can test and improve their models. Our leaderboard evaluates and scores model performance; scores are updated as teams release new and improved model versions. Read more about how models are evaluated on our Competition Wiki.

    Teams are also contributing to shared tools for pdf retrieval and text extraction. The competition is open to the public.
  1. A set of Jupyter extensions which treat datasets as top-level constructs. Features like the Data Registrythe Metadata Explorer, and Commenting enable users to register datasets within research projects; browse metadata on their datasets while they work; and comment and collaborate in real-time on shared projects and datasets. The eventual goal is to enrich the knowledge graph with information gleaned from comments, collaboration and computing within the Jupyter environment; and developing a recommendation engine for datasets and code.
Data Registry (left) and Metadata Browser (right) in JupyterLab. Project Jupyter
Demo - Linked Data generated by user appears in Jupyter Data Registry. Project Jupyter
Users can comment on datasets and interact with others in the same project. Project Jupyter
  1. A Human-in-the-Loop pilot that enables scientists to validate and contribute metadata on research. In collaboration with RePEc (Research Papers in Economics), RePEc’s community of contributors will validate outputs from the machine learning models that are developed during the competition.

Read more about this work on RePEc’s blog: New initiative to help with discovery of dataset use in scholarly work, by Christian Zimmerman

Presentations