What Can R Do About a Messy Data Format

What can R do about a messy data format?

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))

#          Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01    A   A1        0                        0
# 2: 2018-06-03    A   A2        0                        1
# 3: 2018-06-03    A   A3        0                        2
# 4: 2018-06-03    A   A4        1                        1
# 5: 2018-06-03    A   A5        2                        1
# 6: 2018-06-04    A   A6        0                        3
# 7: 2018-06-01    B   B1        0                        1
# 8: 2018-06-02    B   B2        0                        2
# 9: 2018-06-03    B   B3        0                        3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

How to clean messy date formats in a dataframe using R

Use extract from tidyr. If there are two years it will use the first.

library(dplyr)
library(tidyr)

df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year
1   Date 1      2000-10-22 2000
2   Date 2       9/21/2001 2001
3   Date 3            2003 2003
4   Date 4       2017/2018 2017
5   Date 5       9/28/2010 2010
6   Date 6       9/27/2011 2011
7   Date 7       2019/2020 2019
8   Date 8 2017-10/2018-12 2017
9   Date 9              NA   NA
10 Date 10                   NA

If the second year is needed as well then:

df %>% 
  extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
  extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year Year2
1   Date 1      2000-10-22 2000    NA
2   Date 2       9/21/2001 2001    NA
3   Date 3            2003 2003    NA
4   Date 4       2017/2018 2017  2018
5   Date 5       9/28/2010 2010    NA
6   Date 6       9/27/2011 2011    NA
7   Date 7       2019/2020 2019  2020
8   Date 8 2017-10/2018-12 2017  2018
9   Date 9              NA   NA    NA
10 Date 10                   NA    NA

Tidy complicated messy data into long data format in R

Here is a solution using functions from dplyr and tidyr. dt2 is the final output.

# Load package
library(dplyr)
library(tidyr)

# Create example data frame
dt <- read.table(text = "id A B
1  2 3
1  3 4
1  5 5
1  NA 6
1  NA 7",
                 header = TRUE, stringsAsFactors = FALSE)

# Process the data
dt2 <- dt %>%
  gather(Label, Value, -id) %>%
  drop_na(Value) %>%
  group_by(id, Label) %>%
  mutate(Label_Id = 1:n()) %>%
  unite(Col, Label, Label_Id, sep = "") %>%
  spread(Col, Value)

Update: Create a function to generalize the process

Based on the comment, the OP asks for a more "generalized" approach, which I may not fully understand, but here I demonstrated how to convert the above code to a function and design three test cases. The function flatten has one argument, which is the input tbl or a data frame. The input tbl or data frame should have column as id, A, B, C, D ....

# Load package
library(dplyr)
library(tidyr)

# Process the data
flatten <- function(dt){
  dt %>%
    gather(Label, Value, -id) %>%
    drop_na(Value) %>%
    group_by(id, Label) %>%
    mutate(Label_Id = 1:n()) %>%
    unite(Col, Label, Label_Id, sep = "") %>%
    spread(Col, Value)
}


### Test Case 1
test1 <- data_frame(id = rep(1, 5),
                    A = c(2, 3, 5, NA, NA),
                    B = 3:7)
test1_result <- flatten(test1)

### Test Case 2
test2 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
                    A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
                    B = 3:15)
test2_result <- flatten(test2)

### Test Case 3
test3 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
                    A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
                    B = 3:15,
                    C = c(rep(c(1, 2, 3, 4, 5), each = 2), NA, NA, NA),
                    D = seq(2, 26, 2))

test3_result <- flatten(test3)

fixing a messy dataframe with tidyr in R

I don't have a solution that is completely tidyr (and dplyr), though perhaps somebody more familiar with it can assist. (There is room to include more of the tidyverse, specifically purrr, to replace some of the base R code, but I thought it unnecessary.) I'll walk through each step with the solution at the bottom.

Data

First, I think some of the columns are misnamed (lower-case "debt"), so I fixed it; that's not absolutely critical, but it makes some things much easier. I also disable factors, as some operations (on debt, below) require strings. If having factors is important, I suggest you re-factor after this process.

df <- data.frame(
  id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5), 
  age = c(54,20,23,17, 60,57,28,33,19), 
  educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA), 
  educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
  educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
  educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
  educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
  bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA), 
  bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
  bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
  bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
  bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
  stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)

Step-through

Eventually, we're going to merge in age, and since all respondents are identified by both id and order, so we separate the three:

maintbl <- select(df, id, order, age)

The first thing to realize (for me) is that you need to convert from wide-to-tall, but individually for each three column group. I'll start with the first bunch of three:

grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
  gather(debt, order, -id) %>%
  filter(! is.na(order)) %>%
  arrange(id, order)
#   id      debt order
# 1  1 educDebt1     1
# 2  1 educDebt2     3
# 3  2 educDebt1     3
# 4  2 educDebt3     4
# 5  2 educDebt2     5

(BTW 1: the reason I'm using grp will be apparent later.)

(BTW 2: I used the regex [0-9]+ to match one or more digit, in case this is expanded to include either more than 9 or "arbitrary" numbering. Feel free to omit the +.)

This seems fine. We now need to cbind the *t variant of these three:

select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
  gather(debt, type, -id) %>%
  filter(! is.na(type)) %>%
  mutate(debt = gsub("t$", "", debt))
#   id      debt         type
# 1  1 educDebt1 student loan
# 2  2 educDebt1 student loan
# 3  1 educDebt2 student fund
# 4  2 educDebt2  bank credit
# 5  2 educDebt3  bank credit

I changed debt to remove the trailing t, as I'm going to use that as a merging column later. I do the same thing for the third group of three (for "educDebt"), the t_r columns.

These three columns need to be combined, so here I place them in a list and Reduce them:

Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
       list(
         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
           gather(debt, order, -id) %>%
           filter(! is.na(order)) %>%
           arrange(id, order),
         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
           gather(debt, type, -id) %>%
           filter(! is.na(type)) %>%
           mutate(debt = gsub("t$", "", debt)),
         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
           gather(debt, r, -id) %>%
           filter(! is.na(r)) %>%
           mutate(debt = gsub("t_r$", "", debt))
       ))
#   id      debt order         type   r
# 1  1 educDebt1     1 student loan yes
# 2  1 educDebt2     3 student fund  no
# 3  2 educDebt1     3 student loan  no
# 4  2 educDebt3     4  bank credit yes
# 5  2 educDebt2     5  bank credit  no

I'll need to rename the last two columns, and since I'm done combining the type and r columns, I can drop debt. (I'd normally suggest dplyr::rename_, but since it is being deprecated shortly, I'm doing it manually. If you have significantly more columns than shown here, you may need to adjust the column numbering, etc.)

Lastly, we need to do this for each of "educDebt" and "bankDebt", join these by id and order (using another Reduce), and finally re-merge in the age.

TL;DR

Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
       lapply(c("educDebt", "bankDebt"), function(grp) {
         ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
                       list(
                         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
                           gather(debt, order, -id) %>%
                           filter(! is.na(order)) %>%
                           arrange(id, order),
                         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
                           gather(debt, type, -id) %>%
                           filter(! is.na(type)) %>%
                           mutate(debt = gsub("t$", "", debt)),
                         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
                           gather(debt, r, -id) %>%
                           filter(! is.na(r)) %>%
                           mutate(debt = gsub("t_r$", "", debt))
                       ))
         names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
         select(ret, -debt)
       })
       ) %>%
  left_join(maintbl, ., by = c("id", "order"))
#   id order age     educDebt educDebt_r      bankDebt bankDebt_r
# 1  1     1  54 student loan        yes      car loan        yes
# 2  1     2  20         <NA>       <NA>          <NA>       <NA>
# 3  1     3  23 student fund         no          <NA>       <NA>
# 4  1     4  17         <NA>       <NA>          <NA>       <NA>
# 5  2     1  60         <NA>       <NA>          <NA>       <NA>
# 6  2     2  57         <NA>       <NA>          <NA>       <NA>
# 7  2     3  28 student loan         no consumer loan        yes
# 8  2     4  33  bank credit        yes          <NA>       <NA>
# 9  2     5  19  bank credit         no          <NA>       <NA>

How can I structure and recode messy categorical data in R?

Make it "long":

library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results

What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.

importing messy data using R

How is this?

> nicelyFormatted
     [,1]           [,2]    [,3]    [,4]   [,5] [,6]  [,7]  [,8]  [,9]   [,10]
[1,] "Black Eagles" "01/12" "12/11" "1500" "W"  "7.0" "420" "48"  "Away" "+3" 
[2,] "Blue State"   "02/18" "04/21" "1293" "L"  "8.0" "490" "48"  "Home" "+1" 
[3,] "Hawks"        "01/13" "02/17" "1028" "L"  "4.0" "46"  "460" "Away" NA   
[4,] "New Apple"    "09/23" "11/23" "563"  "L"  "3.0" "470" "47"  "Home" "+2" 
[5,] "Black White"  "07/05" "09/26" "713"  "L"  "5.2" "500" "45"  "Home" "+4" 
[6,] "PBO"          "10/24" "10/30" "1495" "L"  "1.9" "47"  "410" "Away" NA

Here is the code that was used to get the above table:

library(stringr)

# Open Connection to file
pathToFile <- path.expand("~/path/to/file/myfile.txt")
f <- file(pathToFile, "rb")  

# Read in lines
rawText <- readLines(f)


# Find the dahses
dsh <- str_locate_all(rawText, " - ")

# Splice, using the dashes as a guide
lng <- length(rawText)
spliced <- sapply(1:lng, function(i) 
  spliceOnDash(rawText[[i]], dsh[[c(i, 1)]], dsh[[c(i, 2)]])
)

# make it purtty
nicelyFormatted <- formatNicely(spliced)
nicelyFormatted


#-------------------#
#    FUNCTIONS      #
#-------------------#


spliceOnDash <- function(strn, start, end)  {

  # split around the date
  pre <- substr(strn, 1, start-6)
  dates <- substr(strn, start-5, end+5)
  post <- substr(strn, end+6, str_length(strn))

  # Clean up
  pre <- str_trim(pre)

  # replace all double spaces with single spaces
  while(str_detect(post, "  ")) {
    post <- str_replace_all(str_trim(post), "  ", " ")    
  }

  # splice on space
  post <- str_split(post, " ")

  # if dates are one field, remove this next line
  dates <- str_split(dates, " - ")

  # return
  c(unlist(pre), unlist(dates), unlist(post))
}

# Function to clean up the list into a nice table
formatNicely <- function(spliced)  {
  lngst <- max(sapply(spliced, length))
  t(sapply(spliced, function(x)  
      if(length(x) < lngst) c(x, rep(NA, lngst-length(x))) else x ))
}

Is there an R function to clean messy salaries in character format?

Here are some first steps you can try.
I define two functions: one replaces a k or K with three zeros.
The other adds leading zeros if one number is denoted in thousands and the other is not.

rem_k <- function(x) {
  sub("(\\d)[kK]", "\\1,000", x)
}

add_zero <- function(x) {
  ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
}

Finally, I remove all non essential characters:

df %>% 
  mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))

               salary       salary2
1       40,000-60,000 40,000-60,000
2              40-80K 40,000-80,000
3            $100,000       100,000
4              $70/hr            70
5 Between $65-80/hour         65-80
6               $100k       100,000
7    50-60,000 a year 50,000-60,000
8                  90            90

Use function to convert messy data in many data frames to single tidier format

I think you over complicate the task:

Use read.table with skip option to skip first lines
Convert extract x columns as a single vector : volume
Convert extract y columns as a single vector : flow
Use recycling to create other columns

here my code:

## here you change text by your file=file_name
d_f <- read.table(text=dat,header=TRUE,skip=3)[,-1]
## extract only X column and expand them in a single vector
volume <- unlist(as.list(d_f[grep('X',colnames(d_f))]))
## extract only Y column and expand them in a single vector
flow <- unlist(as.list(d_f[grep('Y',colnames(d_f))]))
## create you data frame using recycling for other columns
data.frame(id=1,name='Ball',routine='tech',
       trial='pos1',volume=volume,flow=flow)

   id name routine trial     volume   flow
X1    1 Ball    tech  pos1 0.00000000 0.0000
X2    1 Ball    tech  pos1 0.05022222 2.1945
X3    1 Ball    tech  pos1 0.10044444 5.7684
X.11  1 Ball    tech  pos1 0.00000000 0.0000
X.12  1 Ball    tech  pos1 0.05022222 1.9437
X.13  1 Ball    tech  pos1 0.10044444 4.7652
X.21  1 Ball    tech  pos1 0.00000000 0.0000
X.22  1 Ball    tech  pos1 0.05000000 1.2540
X.23  1 Ball    tech  pos1 0.10000000 4.2636

What Can R Do About a Messy Data Format