Researchers and analysts who want to use data for evidence and policy cannot easily find out who else worked with the data, on what topics and with what results. As a result, good research is underused, great data go undiscovered and are undervalued, and time and resources are wasted redoing empirical work.
We want you to help us develop and identify the best text analysis and machine learning techniques to discover relationships between data sets, researchers, publications, research methods, and fields. We will use the results to create a rich context for empirical research – and build new metrics to describe data use.
This challenge is the first step in that discovery process.
The goal of this competition is to automate the discovery of research datasets and the associated research methods and fields in social science research publications. Participants should use any combination of machine learning and data analysis methods to identify the datasets used in a corpus of social science publications and infer both the scientific methods and fields used in the analysis and the research fields.
The competition has two phases (details below).
First Phase: You will be provided a listing of datasets and a labeled corpus of 5,000 publications with an additional dev fold of 100 publications. Each publication will be labeled to indicate which of the datasets from our list are referenced within and what specific text is used to refer to each dataset. You can use this data to train and tune your algorithms to detect mentions of data in publication text and, when a data set in our list is mentioned, tie each mention to the appropriate data set. A separate corpus of 5,000 labeled publications will be held back and serve as an evaluation corpus against which we will run submitted models and use the results to evaluate your submissions. Before final submission, you will be able to submit a model for test validation against this evaluation corpus up to 2 times. On final submission, you will primarily be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields used in each of the publications.
Second Phase: Up to five teams will be asked to participate in the second phase. If selected, you will be provided with a large corpus of unlabeled publications and asked to discover which of the datasets were used in each publication as well as the associated research methods and fields. As in the first phase, you will be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields in the associated passage retrieval.
Teams reaching the second phase will be awarded a prize of $2,000 and economy-class travel costs for one participant to the finalist workshop in New York City. A stipend of $20,000 will be awarded to the winning team; the winning team will work with the sponsors in the subsequent implementation of the algorithm.
All submitted algorithms will be made publicly available as open source tools.
New York University’s Coleridge Initiative
Funding provided by: the Schmidt Family Foundation, Overdeck Family Foundation, and Alfred P. Sloan Foundation
Data provided by: ICPSR, Digital Science and SAGE Publications
THE BIGGER PICTURE
Our long range goal is to transform the empirical foundation of social and health sciences research. We are building a computational research platform, the NYU Administrative Data Research Facility, and working in collaboration with SAGE Publications, Digital Science, and Project Jupyter to do so.
Our eventual goal is to build a set of tools for this platform that enable collaborative knowledge creation and discovery with confidential microdata. We propose to accomplish this through working with the research community to create rich contextual metadata about the data sets: Jupyter notebooks with analysis and code, user-created annotations, and machine-learned relationships from publications about the methods used and fields studies by researchers using the datasets.
HOW TO PARTICIPATE
We welcome the participation of US and international researchers who can bring creative new approaches to this problem. We invite you to review parameters below, explore the publications data, and send a brief letter of intent to participate to [email protected] by September 30, 2018. Submissions that detail a viable and compelling approach to solving the challenge will be accepted for participation.
In your intent to participate please provide the following details:
* We understand that the exact specifications of your platform may change
** All algorithms will be distributed under a BSD-2 license.
Second phase finalists will be awarded a prize of $2,000. One $20,000 stipend will be provided to the winning research team at the end of the competition.
The Coleridge Initiative will provide a stipend to cover travel expenses for one representative from up to 5 participating teams to attend a workshop in New York, NY for the final presentation and judging of submissions on February 15, 2019.
Ophir Frieder, Georgetown University
Rayid Ghani, University of Chicago
Ian Mulvany, SAGE Publications
Jordan Boyd Graber, University of Maryland
Alex Wade, Chan Zuckerberg
Stefan Bender, Deutsche Bundesbank
Julia Lane, New York University
Frauke Kreuter, University of Maryland
Accepted participants will be given access to a github repository that contains a DockerFile and related resources, including starting materials. For your submission, you will update the DockerFile and other files in a branch of this repository to install needed software or libraries and include a command line program that will run any set of publications through the team’s model that conforms to an API so it can be plugged in to automated means of evaluating the model post-submission. The program will accept standard input and create standard output of which datasets are mentioned, which mentions refer to data sets in our list, and which research fields and research methods are described in the provided texts. Detailed specifications for this docker container and the program to process articles will be provided to participants. The complete docker container DockerFile plus any accompanying materials must be submitted by 5:00 pm EST on December 1, 2018.
The input files consist of 5,000 plain text publications and a dev fold of 100 plain text publications for validation along with metadata about these publications, a list of data sets of interest, and the subset of these data sets that are explicitly referenced in the curated corpus. Note: There could be additional datasets mentioned in these articles that are not in our list of data sets. The metadata are provided in JSON format, with the text_file_name field in each publication’s JSON object in article_metadata.json providing the name of the text file associated with the publication. In addition, each publication will be given a unique integer ID in the publication JSON, and this ID will be the name of the publication’s associated text file, with the file extension “.txt”.
Download the compressed competition dataset here.
|Training Corpus||Set of article PDFs and their plain text conversion equivalent corresponding to the publications in the training labels and mentions file. Each publication’s metadata includes the name of its corresponding files, and each file is named the ID of its publication followed by an appropriate file extension for the type (“.txt” for plain text, “.pdf” for PDF file).||files/pdf files/text|
|Dataset Metadata||Metadata for specific datasets submitted models should be trained to identify, some of which have been manually identified and labeled in the provided curated training corpus. Includes all the discrete text strings used to refer to each data set across all publications that refer to it in our training corpus.||data_sets.json|
|Article Metadata||Metadata for articles in the curated training corpus, including paths to the related text and PDF files for each publication.||publications.json|
|Dataset Citation Training Labels and Mentions||File containing article-dataset pairs of each data set from our list that is mentioned in provided articles, including the specific human-annotated text string(s) used to refer to each dataset.||data_set_citations.json|
|Example of Social Science Methods Vocabulary||Set of social science research methods; an example is provided by SAGE Publications, but others can be identified||sage_research_methods.skos sage_research_methods.json|
|Example of Social Science Research Fields Vocabulary||Set of social science fields as identified by the team; example set from SAGE Publications provided||sage_research_fields.csv sage_research_fields.json|
Output files format
Please submit four output files for the first phase: a dataset output file; a methods output file; and a research field output file.
The algorithm should not run for more than 24 hours when processing all publications text and dataset metadata.
The implementation should be able to run on hardware equivalent to a single Amazon Web Services (AWS) T2 Large instance or smaller. The panel will review any requests for software or hardware updates that might be required to accommodate the incorporation of a novel algorithm into the proposed infrastructure. These requests must be submitted in your letter of intent to participate.
Algorithms will be evaluated for accuracy, run-time, usability, and novelty. Those terms are defined further as:
First Phase Participation
Participants will indicate datasets and infer methods and fields used in each of the provided corpus of 5,000 labeled publications. Participants will be able to validate their trained model on a dev fold of 100 additional publications. Algorithms submitted by participants will be tested by the competition organizers on a separate, holdout corpus of 5,000 labeled publications; the precision, recall, and F1 score of the dataset identification will be returned to the team. Participants can validate their models up to 2 times prior to final submission.
At the end of the first phase, the team will submit a docker container specification (template provided in the competition starter kit) including:
First Phase Evaluation
During this round, the results of the application of the participants’ model on the held out corpus will be evaluated by the expert team using the following criteria:
The team submission will also be evaluated on the following:
Second Phase Participation
Up to five competition participants will be invited to a second evaluation phase, where they will be provided the results of the first phase scoring. They will be given the opportunity to revise and refine their algorithms. Their final algorithm will then be applied to a large corpus (approximately 10,000) of unlabeled publications in a server environment provided by the sponsors. Each finalist team will be awarded a $2,000 prize.
Second Phase Evaluation
During this phase, participants will be evaluated on:
COMPETITION TERMS AND CONDITIONS
All submitted algorithms and related information will be subject to the same open source copyright (BSD-2 open source license) and will be made available to the public on the Rich Context GitHub repository. The copyright holders will be New York University and the creator(s) of the submitted algorithm.
You will be provided with 5,100 published journal articles as inputs for building and training your model in this competition. We are making these materials available for noncommercial, scholarly use only. You may not redistribute or re-license these materials or use them for purposes outside of the scope of this competition.
There is no limit to the size or composition of participating researcher teams. We will provide economy-class travel expenses for one representative from the top teams (up to 5 teams) to present their technology to the workshop attendees.