Estimates suggest that more than a third of analysts’ time is spent in finding out about data rather than in model development and production, and the Federal Data Strategy has directed government agencies to streamline access to federal data assets.
There now exists the scientific capacity to build a platform that will automate search and discovery. A competition brought together computer scientists from around the globe to develop models that would identify and extract dataset mentions from full-text publications. The finalists convened in New York City in February 2019 to present their work. A follow on workshop brought together more than 70 international experts to identify gaps and an operational roadmap.
What We've Done So Far
Our current KG has incorporated metadata from nearly 100 different projects at over a dozen participating agencies, initially as preparations for the ADRF classes:
- ~4000 publications linked to datasets
- ~600 datasets formally described
- ~300 providers formally described
- ~1000 journals formally described
As of 2020 Q1 we’re now beginning to include authors, keywords, projects, stewards, and other entities into this graph. Whenever possible, we leverage persistent identifiers for these entities:
The latest release of the public version of our corpus (v.0.1.8, 2020–01–03) includes ~1500 publications linked to datasets which have open access PDFs available. By definition, the public version is a subset of the overall KG, since it has the additional constraints that each publication much have an open access PDF available.