Rich Context Competition

PROBLEM DESCRIPTION

Researchers and analysts who want to use data for evidence and policy cannot easily find out who else worked with the data, on what topics and with what results. As a result, good research is underused, great data go undiscovered and are undervalued, and time and resources are wasted redoing empirical work.

We want you to help us develop and identify the best text analysis and machine learning techniques to discover relationships between data sets, researchers, publications, research methods, and fields. We will use the results to create a rich context for empirical research – and build new metrics to describe data use.

This challenge is the first step in that discovery process.

COMPETITION GOAL

The goal of this competition is to automate the discovery of research datasets and the associated research methods and fields in social science research publications. Participants should use any combination of machine learning and data analysis methods to identify the datasets used in a corpus of social science publications and infer both the scientific methods and fields used in the analysis and the research fields.

COMPETITION SPECIFICS

The competition has two phases (details below).

First Phase: You will be provided a listing of datasets and a labeled corpus of 5,000 publications with an additional dev fold of 100 publications. Each publication will be labeled to indicate which of the datasets from our list are referenced within and what specific text is used to refer to each dataset. You can use this data to train and tune your algorithms to detect mentions of data in publication text and, when a data set in our list is mentioned, tie each mention to the appropriate data set. A separate corpus of 5,000 labeled publications will be held back and serve as an evaluation corpus against which we will run submitted models and use the results to evaluate your submissions. Before final submission, you will be able to submit a model for test validation against this evaluation corpus up to 2 times. On final submission, you will primarily be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields used in each of the publications.

Second Phase: Up to five teams will be asked to participate in the second phase. If selected, you will be provided with a large corpus of unlabeled publications and asked to discover which of the datasets were used in each publication as well as the associated research methods and fields. As in the first phase, you will be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields in the associated passage retrieval.

Teams reaching the second phase will be awarded a prize of $2,000 and economy-class travel costs for one participant to the finalist workshop in New York City. A stipend of $20,000 will be awarded to the winning team; the winning team will work with the sponsors in the subsequent implementation of the algorithm.

All submitted algorithms will be made publicly available as open source tools.

SPONSORS

New York University’s Coleridge Initiative

Funding provided by: the Schmidt Family Foundation, Overdeck Family Foundation, and Alfred P. Sloan Foundation

Data provided by: ICPSR, Digital Science and SAGE Publications

THE BIGGER PICTURE

Our long range goal is to transform the empirical foundation of social and health sciences research. We are building a computational research platform, the NYU Administrative Data Research Facility, and working in collaboration with SAGE Publications, Digital Science, and Project Jupyter to do so.

Our eventual goal is to build a set of tools for this platform that enable collaborative knowledge creation and discovery with confidential microdata. We propose to accomplish this through working with the research community to create rich contextual metadata about the data sets: Jupyter notebooks with analysis and code, user-created annotations, and machine-learned relationships from publications about the methods used and fields studies by researchers using the datasets.

COMPETITION SCHEDULE

  • September 30 2018: Participants submit a letter of intent (see How to Participate)
  • October 15 2018: Participants notified and the first phase data is provided (see First Phase Participation)
  • December 1 2018: Final first phase algorithms submitted (see Program Requirements)
  • December 14 2018: 5 finalists selected (see First Phase Evaluation)
  • December 15 2018 - January 15 2019: Finalists refine algorithms (see Second Phase Participation)
  • January 15, 2019: Refined second phase algorithms submitted (see Second Phase Evaluation)
  • January 15, 2019 - February 14, 2019: Second phase algorithms applied to second phase corpus and evaluated by competition panels
  • February 15 2019: Workshop is held in New York, NY for final presentation and selection of winning algorithms (see Second Phase Evaluation)

HOW TO PARTICIPATE

We welcome the participation of US and international researchers who can bring creative new approaches to this problem. We invite you to review parameters below, explore the publications data, and send a brief letter of intent to participate to [email protected] by September 30, 2018. Submissions that detail a viable and compelling approach to solving the challenge will be accepted for participation.

In your intent to participate please provide the following details:

  • The outline of an algorithmic approach, including the feature engineering steps and a means by which to validate and test your model
  • Any external datasets (i.e., not provided by this competition) or non-traditional computing environments (e.g., other than an AWS T2 Large instance or similar) that you wish to incorporate in your proposed solution
  • A description of your development platform, including operating system, CPU specifications, available memory, and GPU specifications if applicable *
  • Any software that cannot be redistributed under a BSD license that you intend to use **

* We understand that the exact specifications of your platform may change

** All algorithms will be distributed under a BSD-2 license.

REMUNERATION

Second phase finalists will be awarded a prize of $2,000. One $20,000 stipend will be provided to the winning research team at the end of the competition.

  • That stipend will support the team’s work to refine their algorithm
  • The stipend will additionally support the research team’s technical guidance as competition sponsor staff integrate their work into the broader project

The Coleridge Initiative will provide a stipend to cover travel expenses for one representative from up to 5 participating teams to attend a workshop in New York, NY for the final presentation and judging of submissions on February 15, 2019.

JUDGES


Technical

Ophir Frieder, Georgetown University

Rayid Ghani, University of Chicago

Ian Mulvany, SAGE Publications

Jordan Boyd Graber, University of Maryland

Sam Molyneux, Chan Zuckerberg Meta


Social Science

Stefan Bender, Deutsche Bundesbank

Julia Lane, New York University

Frauke Kreuter, University of Maryland

PROGRAM REQUIREMENTS

Accepted participants will be given access to a github repository that contains a DockerFile and related resources, including starting materials. For your submission, you will update the DockerFile and other files in a branch of this repository to install needed software or libraries and include a command line program that will run any set of publications through the team’s model that conforms to an API so it can be plugged in to automated means of evaluating the model post-submission. The program will accept standard input and create standard output of which datasets are mentioned, which mentions refer to data sets in our list, and which research fields and research methods are described in the provided texts. Detailed specifications for this docker container and the program to process articles will be provided to participants. The complete docker container DockerFile plus any accompanying materials must be submitted by 5:00 pm EST on December 1, 2018.

Input files

The input files consist of 5,000 plain text publications and a dev fold of 100 plain text publications for validation along with metadata about these publications, a list of data sets of interest, and the subset of these data sets that are explicitly referenced in the curated corpus. Note: There could be additional datasets mentioned in these articles that are not in our list of data sets. The metadata are provided in JSON format, with the text_file_name field in each publication’s JSON object in article_metadata.json providing the name of the text file associated with the publication. In addition, each publication will be given a unique integer ID in the publication JSON, and this ID will be the name of the publication’s associated text file, with the file extension “.txt”.

Data Files

Download the compressed competition dataset here.

Data fileDescriptionFile(s)
Training Corpus Set of article PDFs and their plain text conversion equivalent corresponding to the publications in the training labels and mentions file. Each publication’s metadata includes the name of its corresponding files, and each file is named the ID of its publication followed by an appropriate file extension for the type (“.txt” for plain text, “.pdf” for PDF file). files/pdf
files/text
Dataset Metadata Metadata for specific datasets submitted models should be trained to identify, some of which have been manually identified and labeled in the provided curated training corpus. Includes all the discrete text strings used to refer to each data set across all publications that refer to it in our training corpus. data_sets.json
Article Metadata Metadata for articles in the curated training corpus, including paths to the related text and PDF files for each publication. publications.json
Dataset Citation Training Labels and Mentions File containing article-dataset pairs of each data set from our list that is mentioned in provided articles, including the specific human-annotated text string(s) used to refer to each dataset. data_set_citations.json
Example of Social Science Methods Vocabulary Set of social science research methods; an example is provided by SAGE Publications, but others can be identified sage_research_methods.skos
sage_research_methods.json
Example of Social Science Research Fields Vocabulary Set of social science fields as identified by the team; example set from SAGE Publications provided sage_research_fields.csv
sage_research_fields.json

Output files format

Please submit four output files for the first phase: a dataset output file; a methods output file; and a research field output file.

  • The publication-dataset relation file should be a JSON file that contains publication-dataset pairs for each detected mention of any of the data sets provided in the contest data set file. The JSON file should contain a JSON list of objects, where each object represents a single publication-dataset pair and includes four properties. The first property should be the publication document id. The second property should be an integer ID that identifies the cited dataset. The third property should be a score on a scale of 0 to 1 representing the level of confidence that the dataset is referenced in the publication. The fourth property should be a list of the text of explicit mentions of the data set in the publication.

  • The dataset mention output file should contain the text string detected as a data set mention for every data set detected within each publication, regardless of whether they are one of the data sets provided in the contest data set file. The JSON file should contain a JSON list of objects, where each object contains a single publication-mention pair and includes three properties. The first property should be the publication document id. The second property should be the data set mention text found in the publication. The third property should be a score on a scale of 0 to 1 representing the level of confidence that the dataset is referenced in the publication.

  • The methods output file should be a JSON file where each object has three properties. The first property should be the publication document id. The second property should be the inferred method used by the research in the publication. The third property should be a score on a scale of 0 to 1 representing the level of confidence that the method is referenced in the publication. An illustrative example of what such research methods might include is provided in the Sage Publications research methods vocabulary.

  • The research field output file should be a JSON file where each object has three properties. The first property should be the publication document id. The second property should be the inferred primary research field of the publication. The third property should be a score on a scale of 0 to 1 representing the level of confidence that the method is referenced in the publication. An illustrative example of such research fields is provided in the Sage Publication research fields vocabulary.

The algorithm should not run for more than 24 hours when processing all publications text and dataset metadata.

The implementation should be able to run on hardware equivalent to a single Amazon Web Services (AWS) T2 Large instance or smaller. The panel will review any requests for software or hardware updates that might be required to accommodate the incorporation of a novel algorithm into the proposed infrastructure. These requests must be submitted in your letter of intent to participate.

External data

  • Proprietary datasets cannot be included in any algorithm.
  • The panel will review any requests to incorporate additional non-proprietary data into a submitted algorithm. Please specify any additional data you intend to use in your letter of intent to participate.

FIRST PHASE

Algorithms will be evaluated for accuracyrun-time, usability, and novelty. Those terms are defined further as:

  • Accuracy: Precision, recall and F1 score for datasets from the provided list referenced in a given document
  • Run-time: The time it takes for model takes to train and predict. Should be run on a single machine and complete in a specified time period
  • Usability: Ability for code to be understood by another informed user and for model to be re-run to predict dataset references in new, unseen articles
  • Novelty: Solution creatively tackles the problem of identifying research methods and research fields

First Phase Participation

Participants will indicate datasets and infer methods and fields used in each of the provided corpus of 5,000 labeled publications. Participants will be able to validate their trained model on a dev fold of 100 additional publications. Algorithms submitted by participants will be tested by the competition organizers on a separate, holdout corpus of 5,000 labeled publications; the precision, recall, and F1 score of the dataset identification will be returned to the team. Participants can validate their models up to 2 times prior to final submission.

At the end of the first phase, the team will submit a docker container specification (template provided in the competition starter kit) including:

  • DockerFile that captures all container setup needed to train and run the model, and any necessary related files.
  • Trained model, ready to be run against a set of plain text articles derived from the publications
  • Documentation on how to install, train, configure, and run the model program
  • Documentation on any pre-processing of the text files that takes place before training the model
  • All source code and data used to train and test the model
  • Written high-level summary of what the program does, including intermediate steps in the process and external data sources used
  • Written high-level summary of results of research methods

First Phase Evaluation

During this round, the results of the application of the participants’ model on the held out corpus will be evaluated by the expert team using the following criteria:

Dataset identification

  • For data set identification, we expect your model to output a list of publication-data set pairs where each pair represents a data set cited in the publication. We ask that this include the score assigned to each pair by your model so we can get an idea of the range of raw scores that indicate whether a data set is or is not present in a publication.
  • Evaluation is based on standard binary classification precision, recall and F1 measure.
  • Recall is defined as Recall = (# of true positives) / ( (# of true positives) + (# of false negatives) )
  • Precision is defined as Precision = (# of true positives) / ( (# of true positives) + (# of false positives) )
  • F1 is defined as F1 = 2 / ( ( 1 / Recall ) + ( 1 / Precision ) )
  • As a baseline, we'll compare the resulting Precision, Recall, and F1 scores to those generated by a model that searches for the dataset title in the publication text and considers a data set present in a publication if the publication is in the top 20 matches to the title. It is expected that a successful algorithm would perform better than searching for a data set title in publication text as indexed in an off-the-shelf, open source search engine such as Apache Lucene or Terrier

The team submission will also be evaluated on the following:

  • Self-reported algorithm training run-time
  • Quality of implementation, including:
    • Quality of code (Scored by expert technical team)
    • Replicability and generalizability (Scored by expert technical team)
    • Documentation (Scored by expert technical team)
  • Identification of methods
    • Closeness to actual method (scored by social science team)
    • Novelty of new methods identified (scored by social science team)
  • Identification of research fields
    • Closeness to actual field (scored by social science team)
    • Novelty of new fields identified (scored by social science team)

SECOND PHASE

Second Phase Participation

Up to five competition participants will be invited to a second evaluation phase, where they will be provided the results of the first phase scoring. They will be given the opportunity to revise and refine their algorithms. Their final algorithm will then be applied to a large corpus (approximately 10,000) of unlabeled publications in a server environment provided by the sponsors. Each finalist team will be awarded a $2,000 prize.

Second Phase Evaluation

During this phase, participants will be evaluated on:

  • Algorithm generalizability, which we will assess by computing the recall rate both overall and at kprecision rate overall and at k for different sets of training and evaluation data, and F1 measure.
  • Baseline comparison against searching for the article title in the publication text. It is expected that a winning algorithm would perform better than searching over the publication text as indexed in an off the shelf, open source search engine such as Apache Lucene or Terrier
  • Self-reported algorithm run-time
  • Quality of implementation, including:
    • Methods used (Scored by expert team)
    • Research field (Scored by expert team)
  • Quality of implementation, including:
    • Quality of code (Scored by expert technical team)
    • Replicability and generalizability (Scored by expert technical team)
    • Documentation (Scored by expert technical team)
  • Identification of methods
    • Closeness to actual method (scored by social science team)
    • Novelty of new methods identified (scored by social science team)
  • Identification of fields
    • Closeness to actual field (scored by social science team)
    • Novelty of new fields identified (scored by social science team)

COMPETITION TERMS AND CONDITIONS

Intellectual Property

All submitted algorithms and related information will be subject to the same open source copyright (BSD-2 open source license) and will be made available to the public on the Rich Context GitHub repository. The copyright holders will be New York University and the creator(s) of the submitted algorithm.

Competition Data Terms of Use

You will be provided with 5,100 published journal articles as inputs for building and training your model in this competition. We are making these materials available for noncommercial, scholarly use only. You may not redistribute or re-license these materials or use them for purposes outside of the scope of this competition.

TEAMS

There is no limit to the size or composition of participating researcher teams. We will provide economy-class travel expenses for one representative from the top teams (up to 5 teams) to present their technology to the workshop attendees.