Rich Context - Technical Details
Technical Approach
Much work has already been done
The Coleridge Initiative, in cooperation with Project Jupyter the Deutsche Bundesbank , several federal agencies and Derwen.ai has developed the following set of key inputs.
- A Python library richcontext-scholapi which provides API integrations for federating discovery services and metadata exchange across multiple scholarly infrastructure providers. Additional APIs are constantly being added to the library.
- A Knowledge Graph of known links between datasets, research publications, researchers and experts, and projects within the sciences.
- A leaderboard competition inviting NLP research teams to develop models to automatically detect dataset usage from the text of research publications.
Model development relies on a corpus of known linkages between research publications and the datasets used to produce the research. The corpus continues to expand in size and scope, as new linkages, which are validated by domain experts, are added.
Teams use a subset of the corpus as a training set to develop their models. Using another subset of the corpus as a test set, teams can test and improve their models. Our leaderboard evaluates and scores model performance; scores are updated as teams release new and improved model versions. Read more about how models are evaluated on our Competition Wiki.
Teams are also contributing to shared tools for pdf retrieval and text extraction. The competition is open to the public.

- A set of Jupyter extensions which treat datasets as top-level constructs. Features like the Data Registry, the Metadata Explorer, and Commenting enable users to register datasets within research projects; browse metadata on their datasets while they work; and comment and collaborate in real-time on shared projects and datasets. The eventual goal is to enrich the knowledge graph with information gleaned from comments, collaboration and computing within the Jupyter environment; and developing a recommendation engine for datasets and code.



- A Human-in-the-Loop pilot that enables scientists to validate and contribute metadata on research. In collaboration with RePEc (Research Papers in Economics), RePEc’s community of contributors will validate outputs from the machine learning models that are developed during the competition.
Read more about this work on RePEc’s blog: New initiative to help with discovery of dataset use in scholarly work, by Christian Zimmerman
Related Readings
- Rich Search and Discovery for Research Datasets: Building the next generation of scholarly infrastructure Julia Lane, Ian Mulvany and Paco Nathan. Forthcoming
- Google data set search. Ian Mulvany, SAGE Publishing, November 19, 2019
- Human-in-the-loop AI for scholarly infrastructure Paco Nathan, Derwen.ai, September 14, 2019.
- Federal Data Strategy Action Plan U.S. Federal Data Strategy
- Empty rhetoric over data sharing slows science Editorial. Nature, June 12 2017.
- The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. Piwowar, Heather, et.al. doi:10.7287/peerj.preprints.3119v1. August 2, 2017.
- Bringing Citations and Usage Metrics Together to Make Data Count. Cousijn, Helena, et.al. doi: 10.5334/dsj-2019-009. March 1, 2019
Presentations
- Graph Realities Paco Nathan, Derwen.ai. October 4 2019, Connected Data London, London, England.
- Using open competitions to drive innovation and collaboration Ian Mulvany, SAGE Publishing. October 17 2019, FORCE Conference, Edinburgh, Scotland.
- Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs. Stefan Dietze, GESIS. Keynote at LWDA2019, October 2 2019, Berlin, Germany
- Experiences of the Deutsche Bundesbank: Efforts to strenghten information sharing and dissemination Stefan Bender, Research Data and Service Center, Deutsche Bundesbank. May 28, 2019, Financial Information Forum of Latin American and the Caribbean Central Banks, Lima, Peru.
- Where’s Waldo: Finding datasets in empirical research publications Julia Lane, Coleridge Initiative. May 22 2019, Automated Knowledge Base Construction Conference, UMass Amherst, MA.
- Where’s the Data: A New Approach to Social Science Search & Discovery – May 23, 2019, Rev 2 – Data Science Leaders Summit
- Scientific Literature Knowledge Bases – May 22, 2019, UMass Amherst, MA
Resources
- NLP-Progress
- Open Academic Society Graph
- Social Science Research Network (SSRN)
- Papers with code
- Research Papers in Economics
- Google Dataset Search
- ORCID
- Force11
- unpaywall
- ResearchGate
- EuropePMC
- OpenAIRE
- Dimensions.ai
- Semantic Scholar
- arXiv
- Google Scholar
- Scholix
- CrossRef
- DataCite
- Project Freya
- Dissemin
- NIH Data Commons
- Knowledge Futures Group
- Microsoft Project Academic Knowledge
- The Ditchley Project
- SAGE Publishers
- ReplicationWiki