The goal of the Rich Context project is to create a new platform that enables the search and discovery of datasets.
The challenge that empirical researchers face is that it is very difficult to find out, for a given dataset, who has worked with the data before, what methods were used, what was produced, and where those products can be found. Addressing that lack of information consumes an enormous amount of time and energy for many social scientists. A platform that provided automated tools and services to improve search and discovery could greatly facilitate the process—often by passively capitalizing on the accumulated labor of one’s extended research community.
There now exists the scientific capacity to build such a platform. There is a burgeoning scientific field of knowledge databases which provide a natural representation for scientific hypotheses in many domains. Manually curated ontologies and knowledge graphs are central to many scientific processes in domains such as biomedical research. The existing knowledge bases cover just a small fraction of the findings that are reported in the scientific literature. The field is developing automated approaches to accelerate the development of these resources. In particular, the scientific underpinnings necessary to build the automated approaches include: (i) document corpus development, (ii) ontology development for dataset entity classification, (iii) natural language processing and machine learning models for dataset entity extraction, (iv) graph models for improving search and discovery, and (v) engagement to get scientists to contribute code and knowledge.
Our first step in developing a model was hosting our first competition in which we asked participants around the globe to develop named entity recognition models that would identify and extract dataset mentions from full-text social science publications. Twenty teams from across the world submitted letters of intent; four finalist teams presented their models in New York in February 2019. Their papers and presentations can be viewed here; The Allen Institute for Artificial Intelligence were the winners of the competition.
The first application of the rich context work is our Data Stewardship module, which enables data owners and stewards to manage permissions and access to their datasets. Metadata from this module will interact with Rich Context - for example, information on usage of the datasets, results from analyses using them, and linkage of datasets with other pieces of data will emerge from the Data Stewardship model and be used to enrich the knowledge graph conveying relationships in the Rich Context Module.
Our book on this subject, to be published in November 2019, gives an in-depth background on use cases for the Rich Context, development of models and the training data they rely on, background on ontology and knowledge-graph, and the technical and cultural considerations around creating tools for development of metadata and sustained engagement with the research community.
We also are working on developing a better training corpus for our next competition and implementing a knowledge graph to develop better recommendations. We’re using existing, known connections between datasets and research publications to build out a structure for the knowledge graph that will be the basis of our Rich Context Explorer.
HELP BUILD RICH CONTEXT
Research datasets are the bread and butter of your work - we need your help in creating rich context around them to make it easier for you and your colleagues to search and discover new datasets. Please submit metadata on your work at https://coleridgeinitiative.org/pubsubmission.
FOR MORE INFORMATION
Recent presentations on our project
May 22, 2019, UMass Amherst, MA
Workshop: “Scientific Literature Knowledge Bases”
May 23, 2019, Rev 2 - Data Science Leaders Summit
Presentation: “Where's the Data: A New Approach to Social Science Search & Discovery”