What Can R Do About a Messy Data Format

What can R do about a messy data format?

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))

# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

How to clean messy date formats in a dataframe using R

Use extract from tidyr. If there are two years it will use the first.

library(dplyr)
library(tidyr)

df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2017/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2019/2020 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10 NA

If the second year is needed as well then:

df %>% 
extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)

giving:

         z            Date Year Year2
1 Date 1 2000-10-22 2000 NA
2 Date 2 9/21/2001 2001 NA
3 Date 3 2003 2003 NA
4 Date 4 2017/2018 2017 2018
5 Date 5 9/28/2010 2010 NA
6 Date 6 9/27/2011 2011 NA
7 Date 7 2019/2020 2019 2020
8 Date 8 2017-10/2018-12 2017 2018
9 Date 9 NA NA NA
10 Date 10 NA NA

Tidy complicated messy data into long data format in R

Here is a solution using functions from dplyr and tidyr. dt2 is the final output.

# Load package
library(dplyr)
library(tidyr)

# Create example data frame
dt <- read.table(text = "id A B
1 2 3
1 3 4
1 5 5
1 NA 6
1 NA 7",
header = TRUE, stringsAsFactors = FALSE)

# Process the data
dt2 <- dt %>%
gather(Label, Value, -id) %>%
drop_na(Value) %>%
group_by(id, Label) %>%
mutate(Label_Id = 1:n()) %>%
unite(Col, Label, Label_Id, sep = "") %>%
spread(Col, Value)

Update: Create a function to generalize the process

Based on the comment, the OP asks for a more "generalized" approach, which I may not fully understand, but here I demonstrated how to convert the above code to a function and design three test cases. The function flatten has one argument, which is the input tbl or a data frame. The input tbl or data frame should have column as id, A, B, C, D ....

# Load package
library(dplyr)
library(tidyr)

# Process the data
flatten <- function(dt){
dt %>%
gather(Label, Value, -id) %>%
drop_na(Value) %>%
group_by(id, Label) %>%
mutate(Label_Id = 1:n()) %>%
unite(Col, Label, Label_Id, sep = "") %>%
spread(Col, Value)
}


### Test Case 1
test1 <- data_frame(id = rep(1, 5),
A = c(2, 3, 5, NA, NA),
B = 3:7)
test1_result <- flatten(test1)

### Test Case 2
test2 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
B = 3:15)
test2_result <- flatten(test2)

### Test Case 3
test3 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
B = 3:15,
C = c(rep(c(1, 2, 3, 4, 5), each = 2), NA, NA, NA),
D = seq(2, 26, 2))

test3_result <- flatten(test3)

fixing a messy dataframe with tidyr in R

I don't have a solution that is completely tidyr (and dplyr), though perhaps somebody more familiar with it can assist. (There is room to include more of the tidyverse, specifically purrr, to replace some of the base R code, but I thought it unnecessary.) I'll walk through each step with the solution at the bottom.

Data

First, I think some of the columns are misnamed (lower-case "debt"), so I fixed it; that's not absolutely critical, but it makes some things much easier. I also disable factors, as some operations (on debt, below) require strings. If having factors is important, I suggest you re-factor after this process.

df <- data.frame(
id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)

Step-through

Eventually, we're going to merge in age, and since all respondents are identified by both id and order, so we separate the three:

maintbl <- select(df, id, order, age)

The first thing to realize (for me) is that you need to convert from wide-to-tall, but individually for each three column group. I'll start with the first bunch of three:

grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order)
# id debt order
# 1 1 educDebt1 1
# 2 1 educDebt2 3
# 3 2 educDebt1 3
# 4 2 educDebt3 4
# 5 2 educDebt2 5

(BTW 1: the reason I'm using grp will be apparent later.)

(BTW 2: I used the regex [0-9]+ to match one or more digit, in case this is expanded to include either more than 9 or "arbitrary" numbering. Feel free to omit the +.)

This seems fine. We now need to cbind the *t variant of these three:

select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt))
# id debt type
# 1 1 educDebt1 student loan
# 2 2 educDebt1 student loan
# 3 1 educDebt2 student fund
# 4 2 educDebt2 bank credit
# 5 2 educDebt3 bank credit

I changed debt to remove the trailing t, as I'm going to use that as a merging column later. I do the same thing for the third group of three (for "educDebt"), the t_r columns.

These three columns need to be combined, so here I place them in a list and Reduce them:

Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
# id debt order type r
# 1 1 educDebt1 1 student loan yes
# 2 1 educDebt2 3 student fund no
# 3 2 educDebt1 3 student loan no
# 4 2 educDebt3 4 bank credit yes
# 5 2 educDebt2 5 bank credit no

I'll need to rename the last two columns, and since I'm done combining the type and r columns, I can drop debt. (I'd normally suggest dplyr::rename_, but since it is being deprecated shortly, I'm doing it manually. If you have significantly more columns than shown here, you may need to adjust the column numbering, etc.)

Lastly, we need to do this for each of "educDebt" and "bankDebt", join these by id and order (using another Reduce), and finally re-merge in the age.

TL;DR

Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
lapply(c("educDebt", "bankDebt"), function(grp) {
ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
select(ret, -debt)
})
) %>%
left_join(maintbl, ., by = c("id", "order"))
# id order age educDebt educDebt_r bankDebt bankDebt_r
# 1 1 1 54 student loan yes car loan yes
# 2 1 2 20 <NA> <NA> <NA> <NA>
# 3 1 3 23 student fund no <NA> <NA>
# 4 1 4 17 <NA> <NA> <NA> <NA>
# 5 2 1 60 <NA> <NA> <NA> <NA>
# 6 2 2 57 <NA> <NA> <NA> <NA>
# 7 2 3 28 student loan no consumer loan yes
# 8 2 4 33 bank credit yes <NA> <NA>
# 9 2 5 19 bank credit no <NA> <NA>

How can I structure and recode messy categorical data in R?

Make it "long":

library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results

What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.

importing messy data using R

How is this?

> nicelyFormatted
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "Black Eagles" "01/12" "12/11" "1500" "W" "7.0" "420" "48" "Away" "+3"
[2,] "Blue State" "02/18" "04/21" "1293" "L" "8.0" "490" "48" "Home" "+1"
[3,] "Hawks" "01/13" "02/17" "1028" "L" "4.0" "46" "460" "Away" NA
[4,] "New Apple" "09/23" "11/23" "563" "L" "3.0" "470" "47" "Home" "+2"
[5,] "Black White" "07/05" "09/26" "713" "L" "5.2" "500" "45" "Home" "+4"
[6,] "PBO" "10/24" "10/30" "1495" "L" "1.9" "47" "410" "Away" NA



Here is the code that was used to get the above table:

library(stringr)

# Open Connection to file
pathToFile <- path.expand("~/path/to/file/myfile.txt")
f <- file(pathToFile, "rb")

# Read in lines
rawText <- readLines(f)


# Find the dahses
dsh <- str_locate_all(rawText, " - ")

# Splice, using the dashes as a guide
lng <- length(rawText)
spliced <- sapply(1:lng, function(i)
spliceOnDash(rawText[[i]], dsh[[c(i, 1)]], dsh[[c(i, 2)]])
)

# make it purtty
nicelyFormatted <- formatNicely(spliced)
nicelyFormatted


#-------------------#
# FUNCTIONS #
#-------------------#


spliceOnDash <- function(strn, start, end) {

# split around the date
pre <- substr(strn, 1, start-6)
dates <- substr(strn, start-5, end+5)
post <- substr(strn, end+6, str_length(strn))

# Clean up
pre <- str_trim(pre)

# replace all double spaces with single spaces
while(str_detect(post, " ")) {
post <- str_replace_all(str_trim(post), " ", " ")
}

# splice on space
post <- str_split(post, " ")

# if dates are one field, remove this next line
dates <- str_split(dates, " - ")

# return
c(unlist(pre), unlist(dates), unlist(post))
}

# Function to clean up the list into a nice table
formatNicely <- function(spliced) {
lngst <- max(sapply(spliced, length))
t(sapply(spliced, function(x)
if(length(x) < lngst) c(x, rep(NA, lngst-length(x))) else x ))
}

Is there an R function to clean messy salaries in character format?

Here are some first steps you can try.
I define two functions: one replaces a k or K with three zeros.
The other adds leading zeros if one number is denoted in thousands and the other is not.

rem_k <- function(x) {
sub("(\\d)[kK]", "\\1,000", x)
}

add_zero <- function(x) {
ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
}

Finally, I remove all non essential characters:

df %>% 
mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))

salary salary2
1 40,000-60,000 40,000-60,000
2 40-80K 40,000-80,000
3 $100,000 100,000
4 $70/hr 70
5 Between $65-80/hour 65-80
6 $100k 100,000
7 50-60,000 a year 50,000-60,000
8 90 90

Use function to convert messy data in many data frames to single tidier format

I think you over complicate the task:

  1. Use read.table with skip option to skip first lines
  2. Convert extract x columns as a single vector : volume
  3. Convert extract y columns as a single vector : flow
  4. Use recycling to create other columns

here my code:

## here you change text by your file=file_name
d_f <- read.table(text=dat,header=TRUE,skip=3)[,-1]
## extract only X column and expand them in a single vector
volume <- unlist(as.list(d_f[grep('X',colnames(d_f))]))
## extract only Y column and expand them in a single vector
flow <- unlist(as.list(d_f[grep('Y',colnames(d_f))]))
## create you data frame using recycling for other columns
data.frame(id=1,name='Ball',routine='tech',
trial='pos1',volume=volume,flow=flow)

id name routine trial volume flow
X1 1 Ball tech pos1 0.00000000 0.0000
X2 1 Ball tech pos1 0.05022222 2.1945
X3 1 Ball tech pos1 0.10044444 5.7684
X.11 1 Ball tech pos1 0.00000000 0.0000
X.12 1 Ball tech pos1 0.05022222 1.9437
X.13 1 Ball tech pos1 0.10044444 4.7652
X.21 1 Ball tech pos1 0.00000000 0.0000
X.22 1 Ball tech pos1 0.05000000 1.2540
X.23 1 Ball tech pos1 0.10000000 4.2636


Related Topics



Leave a reply



Submit