Automatic Documentation of Datasets

Automatic documentation of datasets

This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.

Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.

That's a conceptual start, now to address practical issues:

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.

Welcome to everyone's nightmare. :)

Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.

No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.

It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)

Does such a tool exist?

Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.

For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.

For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...

1. A markup language should be used (YAML?)

Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.

2. All sub-directories should be scanned

Done: listDirectory()

3. To facilitate (2), a standard extension for a dataset descriptor should be used

Trivial: ".json". ;-) Or ".SecretSauce" works, too.

4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.

As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.

5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.

That's where others' suggestions of roxygen and roxygen2 can apply.

6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.

Hmm, I'm stumped here. :)

(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.


Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

How can I document data sets with roxygen?

Roxygen can be used anywhere within an R file (in other words, it doesn't have to be followed by a function). It can also be used to document any docType in the R documentation.

So you can just document your data in a separate block (something like this):

#' This is data to be included in my package
#'
#' @name data-name
#' @docType data
#' @author My Name \email{blahblah@@roxygen.org}
#' @references \url{data_blah.com}
#' @keywords data
NULL

What's the best way to automatically generate roxygen2 documentation for a data frame?

Start with a list of the frames' names, then something like this is a quick hack:

frames <- c("iris","mtcars")
unlist(sapply(frames, function(d) c(paste("#'", d), "#' @format data.frame",
gsub("^","#'",capture.output(str(get(d)))),
dQuote(d)),
simplify=FALSE), use.names=FALSE)
# [1] "#' iris"
# [2] "#' @format data.frame"
# [3] "#''data.frame':\t150 obs. of 5 variables:"
# [4] "#' $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ..."
# [5] "#' $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ..."
# [6] "#' $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ..."
# [7] "#' $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ..."
# [8] "#' $ Species : Factor w/ 3 levels \"setosa\",\"versicolor\",..: 1 1 1 1 1 1 1 1 1 1 ..."
# [9] "\"iris\""
# [10] "#' mtcars"
# [11] "#' @format data.frame"
# [12] "#''data.frame':\t32 obs. of 11 variables:"
# [13] "#' $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ..."
# [14] "#' $ cyl : num 6 6 4 6 8 6 8 4 4 6 ..."
# [15] "#' $ disp: num 160 160 108 258 360 ..."
# [16] "#' $ hp : num 110 110 93 110 175 105 245 62 95 123 ..."
# [17] "#' $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ..."
# [18] "#' $ wt : num 2.62 2.88 2.32 3.21 3.44 ..."
# [19] "#' $ qsec: num 16.5 17 18.6 19.4 17 ..."
# [20] "#' $ vs : num 0 0 1 1 0 1 0 1 1 1 ..."
# [21] "#' $ am : num 1 1 1 0 0 0 0 0 0 0 ..."
# [22] "#' $ gear: num 4 4 4 3 3 3 3 4 4 4 ..."
# [23] "#' $ carb: num 4 4 1 1 2 1 4 2 2 4 ..."
# [24] "\"mtcars\""

Then you can cat it out to a file and have most of what you need.

How can I document datasets without adding them to the Collate field?

Since it is good practise to document your package at the package level as well as function level, I always have a file inside the R folder called packagename-package.R (e.g. granovaGG-package.R in your case) where I keep the package documentation as well as data documentation.

So your granovaGG-package.R file might look something like:

#' One sentence summary of your package.
#'
#' More detail
#' ...
#' @name granovaGG-package
#' @aliases granovaGG
#' @docType package
#' @title One sentence summary of your package.
#' @author \email{your.name@@email.com}
#' @keywords package
#' @seealso \code{\link{...}}
NULL
#' Your dataset documentation goes here.
#'
#' Exactly as in your example.
#' @docType data
#' etc.
#' ...
NULL

Automation for Generating Reports

Since you're using Stata to do the analysis, you can let it do the heavy lifting of the report automation as well.

The trick is using a Stata package like -rtfutil- to export the tables and graphics you describe to a single document. At that point you'll need to convert that to pdf before emailing it.

Here some sample code for using -rtfutil- automate the creation of a document including a table and two graphics (plus some paragraphs of text) in a RTF document(using the system dataset "auto.dta" as an example):

    ******

clear

//RTF UTILITY FOR INSERTING GRAPHICS & TABLES//

local sf "/users/ppri/desktop/"

//SETUP
sysuse auto, clear
twoway scatter mpg price, mlabel(make) || lfitci mpg price
graph export "`sf'myplot1.eps", replace
twoway scatter price mpg, mlabel(make) by(for)
graph export "`sf'myplot2.eps", replace

**
tempname handle1

//RTFUTIL
rtfopen `handle1' using "`sf'mydoc1.rtf", replace
file write `handle1' _n _tab "{\pard\b SAMPLE DOCUMENT \par}" _tab _n
file write `handle1' _n "{\line}"
// Figure1
file write `handle1' "{\pard\b FIGURE 1: Plot of Price\par}" _n
rtflink `handle1' using "`sf'myplot1.eps"
// Figure2
file write `handle1' _n "{\page}" _n /*
*/ "{\pard Here is the plot and a paragraph about it. Here is the plot and a paragraph about it. Here is the plot and a paragraph about it. Here is the plot and a paragraph about it.....blah blah blah blah blah \line}" _n
file write `handle1' _n "{\line}"
file write `handle1' "{\pard\b FIGURE2: Plots of Price vs. MPG\par}" _n
rtflink `handle1' using "`sf'myplot2.eps"
// Table Title
file write `handle1' _n "{\page}" _n
file write `handle1' _n "{\par\pard}" _n /*
*/ "{\par\pard HERE IS A TABLE WITH THE CARS: \par\pard}" _n
file write `handle1' _n "{\par\pard}" _n

// Summary Table
rtfrstyle make mpg weight, cwidths(2000 1440 1440) local(b d e)
listtex make foreign mpg if mpg<15, /*
*/ handle(`handle1') begin("`b'") delim("`d'") end("`e'") /*
*/ head("`b'\ql{\i Make}`d'\qr{\i Foreign}`d'\qr{\i MPG }`e'")
file write `handle1' _n "{\line}"
file write `handle1' _n _tab(2) /*
*/ "{\pard\b Sources: Census Data, etc... \par}" _n _n
**
rtfclose `handle1'

******

This will put all the elements you asked about into a RTF document (be careful with any issues with wrapping of this code when copy/paste it from the webpage).

In your question, you also mentioned wanting to create a PDF during this process. Here you'll need to go to use some non-Stata solution. If you're using Mac OSX you can use the Terminal -convert- utility or automator to do this, or here are some other solutions: http://codesnippets.joyent.com/posts/show/1601

I don't use windows so I'm not sure about solutions with that OS. Good luck.

Is there a standard way to document data frames?

There is a base function: comment which can assign or retrieve text which is stored in an attribute.

(And I do not understand the question about why does str print the label. Shouldn't all (non-name, non-class, non-rowname) attributes be displayed by str?)

build documentation automatically for macros

No need to re-invent the wheel - a great approach for documentation is doxygen.

We use it for the open source SASjs Macro Core library (which also lists a lot of good practices for SAS Macro development).

Simply define your attributes in the header (markdown is accepted), eg:

/**
@file
@brief Logs a key value pair a control dataset
@details If the dataset does not exist, it is created. Usage:

%mp_setkeyvalue(someindex,22,type=N)
%mp_setkeyvalue(somenewindex,somevalue)

@param key Provide a key on which to perform the lookup
@param value Provide a value
@param type= either C or N will populate valc and valn respectively. C is
default.
@param libds= define the target table to hold the parameters
@version 9.2
@author Allan Bowe
@source https://github.com/sasjs/core

**/

Then simply point doxygen at your source folder, tell it which config file to use (a good one for SAS is here) and then choose an output directory for your documentation.

It'll look like this.

There's no pdf option, but it can create files in DOCBOOK format that can be used to generate a pdf: http://www.doxygen.nl/manual/config.html#config_docbook

UPDATE - we recently added doxygen support to SASjs - with a single command (sasjs doc) you can document all your jobs, and even generate a graphviz data lineage diagram, integrated into the output.

Overview: https://www.youtube.com/watch?v=ESNdCtXKRrw



Related Topics



Leave a reply



Submit