The goal of the Rich Context project is to create a new platform that enables empirical analysts to search for and discover datasets.
The challenge that empirical researchers face is that, for a given dataset, it is difficult to find out who has worked with the data before, what methods and code were used, and what results were produced.
Some estimates suggest that more than a third of analysts’ time is spent in finding out about data rather than in model development and production, and the Federal Data Strategy has directed government agencies to streamline access to federal data assets.
There now exists the scientific capacity to build a platform that will automate search and discovery. A competition brought together computer scientists from around the globe to develop models that would identify and extract dataset mentions from full-text publications. The finalists convened in New York City in February 2019 to present their work. A follow on workshop brought together more than 70 international experts to identify gaps and an operational roadmap.
A recently-published SAGE publications book edited by Julia Lane, Ian Mulvany, and Paco Nathan provides an overview of initial work in each area. Download a free copy here!
Much work has already been done
The Coleridge Initiative, in cooperation with Project Jupyter the Deutsche Bundesbank , several federal agencies and Derwen.ai has developed the following set of key inputs.
- A Python library richcontext-scholapi which provides API integrations for federating discovery services and metadata exchange across multiple scholarly infrastructure providers. Additional APIs are constantly being added to the library.
- A Knowledge Graph of known links between datasets, research publications, researchers and experts, and projects within the sciences.
- A leaderboard competition inviting NLP research teams to develop models to automatically detect dataset usage from the text of research publications.
Model development relies on a corpus of known linkages between research publications and the datasets used to produce the research. The corpus continues to expand in size and scope, as new linkages, which are validated by domain experts, are added.
Teams use a subset of the corpus as a training set to develop their models. Using another subset of the corpus as a test set, teams can test and improve their models. Our leaderboard evaluates and scores model performance; scores are updated as teams release new and improved model versions. Read more about how models are evaluated on our Competition Wiki.
Teams are also contributing to shared tools for pdf retrieval and text extraction. The competition is open to the public.
- A set of Jupyter extensions which treat datasets as top-level constructs. Features like the Data Registry, the Metadata Explorer, and Commenting enable users to register datasets within research projects; browse metadata on their datasets while they work; and comment and collaborate in real-time on shared projects and datasets. The eventual goal is to enrich the knowledge graph with information gleaned from comments, collaboration and computing within the Jupyter environment; and developing a recommendation engine for datasets and code.
- A Human-in-the-Loop pilot that enables scientists to validate and contribute metadata on research. In collaboration with RePEc (Research Papers in Economics), RePEc’s community of contributors will validate outputs from the machine learning models that are developed during the competition.
Are you a researcher? Tell us about your recent publications and the datasets you used by filling our our Rich Context Publication Submission Form. Researchers from any domain are encouraged to submit. Your submission will be added to our Knowledge Graph and will help other researchers and data users benefit from your work.
Questions? Want to get involved in other ways? Get in touch!
- Rich Search and Discovery for Research Datasets: Building the next generation of scholarly infrastructure Julia Lane, Ian Mulvany and Paco Nathan. Forthcoming
- Google data set search. Ian Mulvany, SAGE Publishing, November 19, 2019
- Human-in-the-loop AI for scholarly infrastructure Paco Nathan, Derwen.ai, September 14, 2019.
- Federal Data Strategy Action Plan U.S. Federal Data Strategy
- Empty rhetoric over data sharing slows science Editorial. Nature, June 12 2017.
- The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. Piwowar, Heather, et.al. doi:10.7287/peerj.preprints.3119v1. August 2, 2017.
- Bringing Citations and Usage Metrics Together to Make Data Count. Cousijn, Helena, et.al. doi: 10.5334/dsj-2019-009. March 1, 2019
- Graph Realities Paco Nathan, Derwen.ai. October 4 2019, Connected Data London, London, England.
- Using open competitions to drive innovation and collaboration Ian Mulvany, SAGE Publishing. October 17 2019, FORCE Conference, Edinburgh, Scotland.
- Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs. Stefan Dietze, GESIS. Keynote at LWDA2019, October 2 2019, Berlin, Germany
- Experiences of the Deutsche Bundesbank: Efforts to strenghten information sharing and dissemination Stefan Bender, Research Data and Service Center, Deutsche Bundesbank. May 28, 2019, Financial Information Forum of Latin American and the Caribbean Central Banks, Lima, Peru.
- Where’s Waldo: Finding datasets in empirical research publications Julia Lane, Coleridge Initiative. May 22 2019, Automated Knowledge Base Construction Conference, UMass Amherst, MA.
- Where’s the Data: A New Approach to Social Science Search & Discovery – May 23, 2019, Rev 2 – Data Science Leaders Summit
- Scientific Literature Knowledge Bases – May 22, 2019, UMass Amherst, MA
- Open Academic Society Graph
- Social Science Research Network (SSRN)
- Papers with code
- Research Papers in Economics
- Google Dataset Search
- Semantic Scholar
- Google Scholar
- Project Freya
- NIH Data Commons
- Knowledge Futures Group
- Microsoft Project Academic Knowledge
- The Ditchley Project
- SAGE Publishers