About Rich Context
Reproducibility, replicability, and value are critical for any data infrastructure. Coleridge is collaborating with federal agencies, 24 major publishers represented by the not-for-profit consortium CHORUS, and Kaggle to support a national data science competition to show how publicly funded data are used to serve science and society. The goal is to help federal agencies use state-of-the-art methods to develop automated ways of finding out what datasets are being used to solve problems, what measures are being generated, and which researchers are the experts – and respond to the Foundations of Evidence-based Policymaking Act which requires agencies to modernize their data management and Presidential Executive Orders urging evidence-based decisions based on the best available data and science. The Open Government Act (Title II) directed agencies to assist the public in expanding the use of public data assets by hosting challenges, competitions, events and other initiatives to do so.
Almost 1500 data science teams are developing machine-learning and natural language processing techniques to a set of publications to develop automated tools that can automatically answer critical questions such as what research has been done using what data, by which researchers, on what topics, and in what publications. Agencies as diverse as the National Institutes of Health, NCSES, USDA, Commerce, and the US Geological Survey will learn from the results; also useful for federally supported data repositories. The former US Chief Statistician calls the approach “transformative.”
We are building three proofs of concept:
- Data usage scorecards that summarize usage for high priority datasets.
- Automated Data Inventories that provide prioritized lists of agency data assets.
- Evidence basis for high priority topics that provide lists of data assets that can be used to address cross-cutting administration issues. Examples include the link between educational credentials, such as postsecondary, continuing/tech ed and apprenticeships and jobs; the digital divide between rural and urban communities, and social and racial differences in the take-up and receipt of government transfer programs.
The Rich Context Approach
Government agencies have massive amounts of administrative and programmatic data to curate and disseminate. The Federal Data Strategy charges agencies with leveraging their data as a strategic asset, and producing inventories of their data. The Information Quality Act requires agencies to consider the appropriate level of quality (utility, integrity, and objectivity) for each of the products that it disseminates based on the likely use of that information. The Foundations for Evidence-based Policymaking Act requires the federal government to modernize its data management practices, and agencies strategic plans to contain an assessment of the coverage, quality, methods, effectiveness, and independence of the agency’s statistics, evaluation, research, and analysis efforts.
Ironically, there is little data on data – on what different datasets measure, what research has been done using the data by which researchers, with what code, and with what results. The lack of an automated way to search for and discover what datasets are used in empirical research leads to fundamental irreproducibility of empirical science and threatening its legitimacy and utility.
What We've Done So Far
Our Show Us the Data Kaggle Competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data is critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimer’s Disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet, much of the information about data necessary to inform evidence and science is locked inside publications.
For the competition, we created a corpus of 22,000 publications for contestants to use for their models. Competitors will use natural language processing (NLP) to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas, they will identify data sets that the publications’ authors used in their work.
Estimates suggest that more than a third of analysts’ time is used to find out about data rather than in model development and productionization. The scientific capacity to build a platform that will automate search and discovery of data is now possible. Our current KG has incorporated metadata from nearly 100 different projects at over a dozen participating agencies, initially acquired as preparation materials for the ADRF classes.
As of 2020 Q1, we began to include authors, keywords, projects, stewards, and other entities into this graph. Whenever possible, we leverage persistent identifiers for these entities: