Exporting Results

The ADRF is a FedRAMP authorized remote access environment that uses the “five safes” (safe projects, safe people, safe settings, safe data, and safe export) approach to protect confidential data. All users sign a non-disclosure agreement (NDA) before they access the ADRF that guarantees that they will not attempt to re-identify or re-disclose information about any individual or entity. No data can be uploaded or downloaded from the ADRF. Users agree not to use other means of extracting data, including taking handwritten notes, taking screenshots or pictures, talking to an unauthorized individual about empirical results, or working on the ADRF in a public space.

Analytical results can be exported from the ADRF if FedRAMP security protocols are followed to limit the potential for disclosure. The ADRF team reviews all export requests to ensure compliance with those protocols. The export guidelines are described in detail below. If you have any questions about the disclosure review guidelines, please email [email protected].

To submit an export request, please follow the instructions below:

  1. Create a subfolder in your home folder titled, “For-Export”
  2. Inside this folder, please create separate “Input” and “Output” folders, and add the files you would like to receive outside of the ADRF in the Output folder, and the underlying code and required supplemental statistics (if necessary) in your input folder.
    • Additionally, there is an uncompleted version of the Export Request Memo in the shared folder. Copy this document into the Input folder and complete it.
  3. Once you have completed the above steps, send an email to [email protected] (this email) notifying us that you have submitted an export request.

Context

The aim of disclosure control is to ensure that no unauthorized individual, technically competent with public data and private information could: I) identify any information not already public knowledge with a reasonable degree of confidence, and 2) associate that information with the supplier of the information.

The ADRF team implements the statistical disclosure guidelines for each dataset that are established by the agency data steward. The guidelines described here generally follow standard U.S. statistical guidelines1, 2 as well as international standards.4, 5, 6

For more information on disclosure limitation, please refer to our textbook3 or view our videos.

Export Guidelines

Required Information

A completed version of the Export Request Memo must be filled out. A blank version of the Export Request Memo is available in your shared folder. If you cannot locate the Export Request Memo, please email [email protected].

General Guidelines

Disclosure review means that the ADRF disclosure review staff will manually look at all your data output/study results, which may be time-consuming. Please limit the volume of your output requests for two reasons. One is that each additional release adds disclosure risk and limits subsequent releases. The other is that although we will try to release tables as quickly as possible, large requests are very time consuming and slow to review. In sum, only request the minimum number of tables or graphs necessary to produce your report or paper. If you are requesting the export of more than 10 files, there may be an additional charge. Please email [email protected] for more information about those charges. Do not request the release of intermediate output.

Please structure your input and output folders to enable ADRF Staff to find information quickly if needed. This section provides guidelines about how to do so.

Naming Conventions:  The more detailed your documentation the easier it is for ADRF staff to follow your study during disclosure control. Please use meaningful file names to provide the ADRF team with information about the content of the file. In particular, provide one folder in which you store all the code, one folder for graphs, one folder for data, etc. If you have multiple files of code that depend on each other please name them so that it is clear which files run first, second, etc. (1_DataPrep.py, 2_SummaryStats.py, 3_MultivarAnalyses.py).

Every file of code should have a header including a description of the content of the file, a timestamp, and summary information about data manipulation (e.g. regression; table; graph).

Please also use meaningful variable names as well. For instance, if you are calculating outflows it is better to name the variable outflows instead of var1.

Code documentation is also helpful for disclosure reviews. The better the documentation, the faster the turnaround of export requests. If data files are aggregated, please provide documentation on the level of aggregation and during which step in the program the aggregation took place.

Specific Guidelines

Sample Description: Please clearly describe the construction of the sample, the datasets used, the time period, and the sample frame. Please also clearly denote the unit of analysis (individuals, regions, etc.).

Code and logs: Always report the total number of observations as an input for descriptive statistics. Do not include actual numbers in programming code and logs. 

Please do not report actual numbers in table titles or figures.

Tables

Cell sizes: Each agency has specific disclosure rules: please refer to those rules when preparing the export memo.   For individual data please report the number of observations in each cell. The default rule applied by the team for individual-level data is to suppress cells with fewer than 10 observations unless otherwise designated in the agency’s guidelines. If your table includes row or column totals, or if it is dependent on a preceding or subsequent table, complementary suppression will need to be applied. Review the references included below for details. For business data: please report the proportion of the cell count or value accounted for by the top 4 businesses in a cell. The default rule applied by the team for business data is to suppress cells with more than 80% of a cell accounted for by the top 4 businesses in the cell.  

Cell values: Round all reported values to the nearest sensible units. For an illustrative example, refer to NCES guidelines. 

Weighted data: If weighted results are to be exported, you must report both weighted and unweighted counts.

Ratios: If ratios are reported, please report the number of valid cases for both the numerator or the denominator (e.g. number of men in state X and number of women in state X in addition to the ratio of women in state X).

Percentiles: Do not report exact percentiles. For example, you can calculate a fuzzy median by averaging the true 45 and 55 percentiles.

Maxima and minima: Suppress maximum and minimum values in general. You can replace an exact maximum or minimum with a top-coded value.

Graphs

Graphs are representations of tables. Thus for each graph (with a jpg, pdf, png, or tif extension), provide information about the source data underlying any graph representation in the same manner as the tables above and in a csv or txt file.

Please generate graphs only if it is not possible to create them based on tables on your own. The number of generated graphs should be as low as possible as the effort of disclosure control on graphs is very high.

Graphs can usually fit into one of the following categories:

  1. Type A: Graphs produced from aggregated data, or tables that have been confidentialized (e.g. frequency histograms, bar charts of magnitudes). Provide the underlying tables.
  2. Type B: Graphs produced directly from the unit record data but aggregated in the process by the software (e.g. frequency histograms). Provide the underlying tables.
  3. Type C: Graphs produced directly from the unit record data that display unit record values (e.g. scatterplots, residual plots). For this type of graph to be released, you need to ensure that individuals cannot be reidentified and that values can only be estimated with a high level of uncertainty- Further processing can include but is not restricted to: cutting off the tails of a distribution, removing outliers, jittering the actual values, and removing or modifying axis values
  4. Type D: Graphs produced from the results of modeling or derivation that use the unit record data (e.g. regression curves). These graphs can be released only if the values cannot be used to find original data values, and are generally automatically cleared. For precision/recall graphs, you would need to report the sample size used to generate your model(s).

Modeled Output

Output from regression or machine learning models are generally non-disclosive, as long as they are not based on small samples. Only request the release of the key coefficients—suppress the coefficients of control variables.

References

[1] Confidential Information Protection and Statistical Efficiency Act of 2002:. (2002). Washington, D.C.: U.S. G.P.O.

[2] FCSM. 2005. “Report on Statistical Disclosure Limitation Methodology.” 22 (Second version, 2005). {Federal Committee on Statistical Methodology}. https://nces.ed.gov/fcsm/pdf/spwp22.pdf

[3] Foster, Ian, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane, eds. Big data and social science: A practical guide to methods and tools, 2nd ed.. CRC Press, 2020.(textbook.coleridgeinitiative.org)

[4] How to use microdata properly: Self-study material for the Users of Eurostat microdata sets. (2018). Retrieved from https://ec.europa.eu/eurostat/web/microdata/overview/self-study-material-for-microdata-users

[5] Research Data Centre of the German Federal Employment Agency at the Institute for Employment Research. (2020, December 8). Remote Data Access and On-Site Use at the FDZ of the BA at the IAB. Retrieved from http://doku.iab.de/fdz/access/Vorgaben_DAFE_EN.PDF

[6] Welpton, Richard (2019): SDC Handbook. figshare. Book. https://doi.org/10.6084/m9.figshare.9958520.v1