Modernizing Statistical Processing Pipeline (USDA)

Overview

Coleridge Initiative, in collaboration with USDA’s Economic Research Service (ERS), works on streamlining and updating the code used to process the Agricultural Resource Management Survey (ARMS) and U.S. Agricultural Productivity (USAgP) data products. Modernization efforts include a complete rewrite of the codebase used to produce these data products. This rewrite migrates the processing from proprietary to open-source tools (R), and implements the best practices in software engineering, such as modular code, unit testing, versioning, and automated documentation. This effort aims to help ERS achieve both increased reliability and reproducibility by using a more rigorous testing and debugging process in the workflow, enhancing continuity of operations by documenting the processes so that they can be rewritten or recreated, and increasing the usability of the data for end users.

Challenges & Objectives

The current processing of the ARMS Phase 2 and Phase 3 and USAgP data products involves legacy code with multiple routines and several areas for improvement have been identified. Over successive years of processing, idiosyncratic changes to the processing workflow have accumulated. This has led to a finalized code base that is increasingly complicated to understand and maintain. In order to provide for the future reliability of the data products, ERS is therefore looking to both streamline the statistical processing of the data and provide both researchers and maintainers with more comprehensive documentation describing the workflow.

The main objectives of the project are to 1) rewrite the data ingestion, preparation, and analysis codebase, 2) document and describe the process flow for data preparation and processing, 3) create a web-based codebook and navigation tools to help users understand the code used to produce each data product. The increased transparency these updates will provide for the data processing will result in a more maintainable and user-friendly codebase for each of the ARMS 2, ARMS 3, and USAgP data products. For code maintenance, future ERS developers will be able to more plainly understand the exact steps taken to source, ingest, and process the relevant data for each project. For end users, the web-based access to variable definitions and constructions will result in more well-informed research projects which implement appropriate decision rules more accurately aligned to those used in the USDA processing.

Process & Work

Coleridge Initiative will rewrite the SAS modules used for processing the ARMS Phase III Cost and Returns Report data and for processing the ARMS Phase II Crop Production Practices data for populating the public-facing webtool. Coleridge will also help assess the current code for the USAgP data product and help create specifications for the rewrite using open-source tools (R). Coleridge Initiative will also perform comparisons to ensure reproducibility. An Agile approach is used for creating specifications for each statistical processing module (what it needs to do), the business logic involved (why we need it a particular way), and the acceptance criteria (how we know it worked). The rewrite will be organized in sprints of develop, test, demo.

This work has been presented at the Federal Computer Assisted Survey Information Collection (FedCASIC) Workshop and at the Symposium on Data Science and Statistics.

In COLLABORATION WITH