How to Create, Structure, Maintain and Update Data Codebooks in R

How to create, structure, maintain and update data codebooks in R?

You could add any special attribute to any R object with the attr function. E.g.:

x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And see the given attribute in the structure of the object:

> str(x)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
 - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And could also load the specified attribute with the same attr function:

> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x)) while altering the structure will erease the given attributes (see: str(cbind(x,x))).

UPDATE: based on comments

If you want to list all non-standard attributes, check the following:

setdiff(names(attributes(x)),c("names","row.names","class"))

This will list all non-standard attributes (standard are: names, row.names, class in data frames).

Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)

First, define the uniqe (=non standard) attributes:

uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))

And make a matrix which will hold the names and values:

attribs <- matrix(0,0,2)

Loop through the non-standard attributes and save in the matrix the names and values:

for (i in 1:length(uniqueattrs)) {
    attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}

Convert the matrix to a data frame and name the columns:

attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')

And save in any format, eg.:

write.csv(attribs, 'foo.csv')

To your question about the variable labels, check the read.spss function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo'] where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.

I hope these could help you to write the required functions in a lot neater way than I tried above! :)

How do I handle multiple kinds of missingness in R?

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

Adding Descriptive Text to a Variable

Brandon,

comment() and attr() can be useful here. This recent post has some really good information on this.

From the help page for comment():

x <- matrix(1:12, 3,4)
comment(x) <- c("This is my very important data from experiment #0234",
                "Jun 5, 1998")
x
comment(x)

and str(x) returns:

> str(x)
 int [1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "comment")= chr [1:2] "This is my very important data from experiment #0234" "Jun 5, 1998"

Best way to store company and employee data

You can create a third table with the name of shop_customers where you have both the ids of shop and customer as foreign key mapping.

Create table shops(
shop_id integer,
primary key (shop_id)
);

Create table customers(
customer_id integer,
primary key (customer_id)
);

CREATE TABLE shop_customers(
shop_id integer,
customer_id integer,
Primary Key (shop_id,customer_id),
Foreign Key (shop_id) REFERENCES shops(shop_id),
Foreign Key (customer_id) REFERENCES customers(customer_id)
);

Update Picker Input, reuse selected data

That is because your disabled_choices had all the rows, while the choices had unique Polygons. The number of elements should be same in both. Try this

library(shiny)
library(shinyWidgets)
library(shinythemes)
library(shinycssloaders)
library(shinydashboard)
library(DT)

codebook <- structure(list(X = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
                                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Polygon = c("polygon_a",
                                                                                      "polygon_a", "polygon_a", "polygon_a", "polygon_a", "polygon_a",
                                                                                      "polygon_a", "polygon_a", "polygon_b", "polygon_b", "polygon_b",
                                                                                      "polygon_b", "polygon_b", "polygon_b", "polygon_b", "polygon_b",
                                                                                      "polygon_c", "polygon_c", "polygon_c", "polygon_c"), Year = c("year_1",
                                                                                                                                                    "year_1", "year_1", "year_1", "year_2", "year_2", "year_2", "year_2",
                                                                                                                                                    "year_1", "year_1", "year_1", "year_1", "year_2", "year_2", "year_2",
                                                                                                                                                    "year_2", "year_1", "year_1", "year_1", "year_1"), Variable = c("Variable1",
                                                                                                                                                                                                                    "Variable2", "Variable3", "Variable4", "Variable1", "Variable2",
                                                                                                                                                                                                                    "Variable3", "Variable4", "Variable1", "Variable2", "Variable3",
                                                                                                                                                                                                                    "Variable4", "Variable1", "Variable2", "Variable3", "Variable4",
                                                                                                                                                                                                                    "Variable1", "Variable2", "Variable3", "Variable4"), Value = c(1L,
                                                                                                                                                                                                                                                                                   245L, 23L, 2L, 0L, 34L, 1L, 245L, 1L, 23L, 2L, 0L, 0L, 34L, 0L,
                                                                                                                                                                                                                                                                                   34L, 0L, 34L, 90L, 9L)), class = "data.frame", row.names = c(NA,
                                                                                                                                                                                                                                                                                                                                                -20L))

##########

# Define UI -----------------------------------------------
ui <- fluidPage(

  # Application title
  titlePanel("Colonial Concessions Within DRC"),

  # Parameters
  sidebarLayout(
    sidebarPanel(
      selectInput(inputId = "input_period", label = "Period",
                  choices = c("2001" = "year_1", "2002" = "year_2", "2003" = "year_3")),
      pickerInput(
        inputId = "picker_cny",
        label = "Select Polygon",
        choices = unique(codebook$Polygon),
        options = list(`actions-box` = TRUE),
        multiple = TRUE),
      width = 2),

    # Displat the reactive map
    mainPanel(
      #leafletOutput("m"),
      DTOutput("t1"),
      #tableOutput("table"),
      width = 10)
  ))

# Define Server ------------------------------------------
server <- function(input, output, session) {

  output$t1 <- renderDT({
    ###  this will display all data when no Polygon is selected 
    # if (is.null(input$input_period)) myperiod <- unique(codebook$Year)  else myperiod <- input$input_period
    # if (is.null(input$picker_cny)) mypolygon <- unique(codebook$Polygon) else mypolygon <- input$picker_cny
    # code1 <- codebook[codebook$Year %in% myperiod & (codebook$Polygon %in% mypolygon),]
    
    ###  below code will display data only when at least one Polygon is selected; comment the line below and uncomment 3 lines above to display the other way
    code1 <- codebook[codebook$Year %in% input$input_period & (codebook$Polygon %in% input$picker_cny),]
    
    code1
  })

  # Reactive pickerInput ---------------------------------
  observeEvent(input$input_period, {

    #codebook_mod <- codebook[codebook$Year == paste0(input$input_period), ]

    # Generate reactive picker input
    code1 <- codebook[codebook$Year %in% input$input_period,]
    codeu <- unique(codebook$Polygon)
    code1u <- unique(code1$Polygon)
    disabled_choices <- ifelse(codeu %in% code1u, 0,1)
    #print(disabled_choices)
    updatePickerInput(session = session,
                      inputId = "picker_cny",
                      choices = unique(codebook$Polygon),
                      choicesOpt = list(
                        disabled = disabled_choices,
                        style = ifelse(disabled_choices,
                                       yes = "color: rgba(119, 119, 119, 0.5);",
                                       no = "")
                      )
                      )
  }, ignoreInit = TRUE)

}

# Run the application
shinyApp(ui = ui, server = server)

making the distinction between missing value types (non-response vs skip patterns)

Using NA, Inf, -Inf and NaN we can represent 4 categories of numeric missing values. Below we show the use of NA with Inf and then NA with NaN. In the third approach we discuss the use of naniar package.

1) Recode q2 values of Yes, No, structural missing and missing to 1, 0, Inf and NA respectively. Note that is.na(x) will only report TRUE for an actual NA, is.infinite(x) will only report TRUE for an Inf and !is.finite(x) will report TRUE for NA or Inf in case you need to perform tests. Optionally recode the output back.

df %>% 
  count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", Inf, NA)))

giving:

# A tibble: 3 x 2
# Groups:   q2 [3]
     q2     n
  <dbl> <int>
1     1     1
2   Inf     2
3    NA     1

2) A variation on this is to use NaN in place of Inf. In that case tests can use is.na(x), is.nan(x) and !is.finite(x)

df %>% 
  count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", NaN, NA)))

giving:

# A tibble: 3 x 2
     q2     n
  <dbl> <int>
1     1     1
2    NA     1
3   NaN     2

3) The naniar package can create auxilliary columns that define the type of each NA using bind_shadow. We can then recode the auxilliary columns using recode_shadow and then use those in our counting.

library(naniar)

library(naniar)
df %>%
  bind_shadow %>%
  recode_shadow(q2 = .where(is.na(q2) & q1 == "No" ~ "struct")) %>%
  count(q2, q2_NA)

giving:

# A tibble: 3 x 3
  q2    q2_NA         n
  <chr> <fct>     <int>
1 Yes   !NA           1
2 <NA>  NA            1
3 <NA>  NA_struct     2

How to Create, Structure, Maintain and Update Data Codebooks in R