Coleridge Initiative

Show US the Data

Kaggle Winners' Methods

Nguyễn Tuấn Khôi & Nguyễn Quán Anh Minh - Context Similarity via Deep Metric Learning

The authors of this approach emphasized the importance of using semantic context in the model and aimed for a general solution that uses a semantic context to detect a dataset citation. In this model, an input text is randomly split into Support Set and Query Set. In the Support Set,  the authors used a <MASK> label to replace all dataset titles in the input text, and then passed all input text from the Support Set into a Bert Model. In the Query Set,  they passed the raw input into a shared Bert Model used in the Support Set. Output from the Support/Query Set is passed into an ArcFace Metric Learning layer. Also, a part of output from the Support Set is used to calculate a scaled cosine similarity with information from the input text in the Query Set, which is later used for training the named entity recognition (NER) model via BCE loss. After training, the NER model is also used to predict the dataset in the test set.


The authors of this approach also did some post processing. They used a threshold frequency of two, which means if  their model can detect that dataset title twice,  they will get that dataset.  They chose a threshold frequency of two because in the training set there are many dataset titles appearing only one-two times, they tried thresholds in range from 1 to 4, however, the result didn’t change much (less than 1%). The final submission is based on an ensemble of allenai/scibert and allenai/biomed_roberta models.


In the following example, their model predicts “Information Resource Incorporated (IRI) Consumer Network Panel (CNP)” and “Consumer Expenditure Survey (CES)” as datasets: “Results from shoppers’ intercept survey data collected at 13 stores in LI areas in nine Northeastern locations were compared with those obtained using secondary household food purchasing data from the Information Resource Incorporated (IRI) Consumer Network Panel (CNP) courtesy of the USDA Economic Research Service (ERS), and food expenditures from the Consumer Expenditure Survey (CES) of the US Bureau of Labor Statistics”.

Chun Ming Lee - Transformer-Enhanced Heuristic Search

The key insight of his approach is that datasets are often referenced in academic papers as mixed cap words followed by an acronym in parentheses e.g., “Baltimore Longitudinal Study of Aging (BLSA)”. His solution searches across documents for strings in the format “LONG-NAME (ACRONYM)” using the Schwartz-Hearst non-learning string search algorithm, an advanced regex search algorithm. Candidate strings are then classified as datasets (e.g., “Baltimore Longitudinal Study of Aging (BLSA)) or other types of non-dataset references (e.g., “Organization for Economic Co-operation and Development (OECD)), using a RoBERTa-base Transformer binary classifier fine-tuned on hand-annotated labels. Given not all papers refer to datasets using the full “LONG-NAME (ACRONYM)” format, a  search of LONG-NAME strings without acronym is also performed if LONG-NAME (ACRONYM) occurs frequently, over 50 times, across documents. 

The predictions are then collated for each document, and only candidates that meet key requirements are accepted – either minimum document frequency or regex match strings that are very likely to be datasets (e.g., mixed capitalized words ending with Study/Survey). In the following example, the “National Longitudinal Survey of Youth” is predicted by the classifier as a dataset with 0.99 probability: “Using data from the National Longitudinal Survey of Youth, the purpose of this thesis is to investigate whether and how parental divorce affects children’s post-baccalaureate educational attainment.” In lieu of the Transformer classifier, a simple regex match of “(A-Z)[a-z]+ Study/Survey$” and “(Study/Survey) of ” may capture a significant chunk of datasets. 

Chun Ming believes that the successful performance of his model is due to the fact that he focused on building a model that could reject non-dataset strings. Scraping data for positive examples of datasets (e.g., from and building models on such data could induce a bias to making more predictions, as such models are trained with many positive examples and relatively fewer negative examples. This is important given the competition metric which punished False Positives more than False Negatives.Any attempt to inject semantic context into the models seemed to worsen the performance; thus, the decision was made to focus on identifying datasets purely by title instead of using semantic context. 

Mikhail Arkhipov- Pure Pattern Matching

The solution is based on pattern matching with a simple heuristic: a capitalized sequence of words that includes a keyword and that is followed by parenthesis usually refers to a dataset; therefore, any sequence like Xxx Xxx Keyword Xxx (XXX) is considered as a good candidate to be a dataset. For both train and test documents, all capitalized sequences followed by brackets and containing data-specific keywords (from a manually created list of keywords, such as “study”, “survey”, “assessment”, etc.) are extracted from the text to form a candidate set. All candidates containing stop-words (from a manually created list of stop-words, mainly specifying organizations, such as “lab”, “centre”, “consortium”, etc.) are discarded. The candidates that co-occur (in the same sentence) with the word “data” less than in 10% of the cases are also discarded. 

Moreover, capitalized words are allowed to connect through prepositions or conjunctions. However, sequences ending with parentheses that enclose abbreviations are restricted to partially prevent merging multiple datasets connected via conjunctions. The final prediction is a simple substring search for all extracted and filtered mentions without brackets for each document. In addition, for each detected mention, it is important to search for parentheses with abbreviations right after the mention. Sometimes they contain specified versions of the dataset (e.g., NELS:88). The abbreviations from the parenthesis are added as separate predictions.

Example of a dataset prediction by this model.

Mikhail Arkhipov notes that there is some evidence that in a “low-resource regime”, simple string-matching methods work surprisingly well for Entity Linking, which is quite similar to Detection of dataset mentions. Even the application of modern sophisticated neural methods shows significantly lower results in absence of such “mention string collection”. Also, there are many successful cases of pattern matching (e.g., Hearst Patterns) solving such problems, which motivated Mikhail Arkhipov to focus on the simple string/pattern matching heuristics.