What can R do about a messy data format?
Using data.table::fread
:
x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'
fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))
# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3
The gsub
part removes the horizontal rules. drop
removes the extra columns caused by delimiters at the line ends.
How to clean messy date formats in a dataframe using R
Use extract from tidyr. If there are two years it will use the first.
library(dplyr)
library(tidyr)
df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2017/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2019/2020 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10 NA
If the second year is needed as well then:
df %>%
extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year Year2
1 Date 1 2000-10-22 2000 NA
2 Date 2 9/21/2001 2001 NA
3 Date 3 2003 2003 NA
4 Date 4 2017/2018 2017 2018
5 Date 5 9/28/2010 2010 NA
6 Date 6 9/27/2011 2011 NA
7 Date 7 2019/2020 2019 2020
8 Date 8 2017-10/2018-12 2017 2018
9 Date 9 NA NA NA
10 Date 10 NA NA
Tidy complicated messy data into long data format in R
Here is a solution using functions from dplyr
and tidyr
. dt2
is the final output.
# Load package
library(dplyr)
library(tidyr)
# Create example data frame
dt <- read.table(text = "id A B
1 2 3
1 3 4
1 5 5
1 NA 6
1 NA 7",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
gather(Label, Value, -id) %>%
drop_na(Value) %>%
group_by(id, Label) %>%
mutate(Label_Id = 1:n()) %>%
unite(Col, Label, Label_Id, sep = "") %>%
spread(Col, Value)
Update: Create a function to generalize the process
Based on the comment, the OP asks for a more "generalized" approach, which I may not fully understand, but here I demonstrated how to convert the above code to a function and design three test cases. The function flatten
has one argument, which is the input tbl
or a data frame
. The input tbl
or data frame
should have column as id, A, B, C, D ...
.
# Load package
library(dplyr)
library(tidyr)
# Process the data
flatten <- function(dt){
dt %>%
gather(Label, Value, -id) %>%
drop_na(Value) %>%
group_by(id, Label) %>%
mutate(Label_Id = 1:n()) %>%
unite(Col, Label, Label_Id, sep = "") %>%
spread(Col, Value)
}
### Test Case 1
test1 <- data_frame(id = rep(1, 5),
A = c(2, 3, 5, NA, NA),
B = 3:7)
test1_result <- flatten(test1)
### Test Case 2
test2 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
B = 3:15)
test2_result <- flatten(test2)
### Test Case 3
test3 <- data_frame(id = c(rep(1, 5), rep(2, 8)),
A = c(2, 3, 5, NA, NA, 3, 4, 6, 8, 9, NA, 10, 12),
B = 3:15,
C = c(rep(c(1, 2, 3, 4, 5), each = 2), NA, NA, NA),
D = seq(2, 26, 2))
test3_result <- flatten(test3)
fixing a messy dataframe with tidyr in R
I don't have a solution that is completely tidyr
(and dplyr
), though perhaps somebody more familiar with it can assist. (There is room to include more of the tidyverse
, specifically purrr
, to replace some of the base R code, but I thought it unnecessary.) I'll walk through each step with the solution at the bottom.
Data
First, I think some of the columns are misnamed (lower-case "debt"), so I fixed it; that's not absolutely critical, but it makes some things much easier. I also disable factors, as some operations (on debt
, below) require strings. If having factor
s is important, I suggest you re-factor
after this process.
df <- data.frame(
id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5),
age = c(54,20,23,17, 60,57,28,33,19),
educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA),
educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA),
bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA),
bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)
Step-through
Eventually, we're going to merge in age
, and since all respondents are identified by both id
and order
, so we separate the three:
maintbl <- select(df, id, order, age)
The first thing to realize (for me) is that you need to convert from wide-to-tall, but individually for each three column group. I'll start with the first bunch of three:
grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order)
# id debt order
# 1 1 educDebt1 1
# 2 1 educDebt2 3
# 3 2 educDebt1 3
# 4 2 educDebt3 4
# 5 2 educDebt2 5
(BTW 1: the reason I'm using grp
will be apparent later.)
(BTW 2: I used the regex [0-9]+
to match one or more digit, in case this is expanded to include either more than 9 or "arbitrary" numbering. Feel free to omit the +
.)
This seems fine. We now need to cbind
the *t
variant of these three:
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt))
# id debt type
# 1 1 educDebt1 student loan
# 2 2 educDebt1 student loan
# 3 1 educDebt2 student fund
# 4 2 educDebt2 bank credit
# 5 2 educDebt3 bank credit
I changed debt
to remove the trailing t
, as I'm going to use that as a merging column later. I do the same thing for the third group of three (for "educDebt"), the t_r
columns.
These three columns need to be combined, so here I place them in a list and Reduce
them:
Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
# id debt order type r
# 1 1 educDebt1 1 student loan yes
# 2 1 educDebt2 3 student fund no
# 3 2 educDebt1 3 student loan no
# 4 2 educDebt3 4 bank credit yes
# 5 2 educDebt2 5 bank credit no
I'll need to rename the last two columns, and since I'm done combining the type
and r
columns, I can drop debt
. (I'd normally suggest dplyr::rename_
, but since it is being deprecated shortly, I'm doing it manually. If you have significantly more columns than shown here, you may need to adjust the column numbering, etc.)
Lastly, we need to do this for each of "educDebt"
and "bankDebt"
, join these by id
and order
(using another Reduce
), and finally re-merge in the age
.
TL;DR
Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
lapply(c("educDebt", "bankDebt"), function(grp) {
ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
list(
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
gather(debt, order, -id) %>%
filter(! is.na(order)) %>%
arrange(id, order),
select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
gather(debt, type, -id) %>%
filter(! is.na(type)) %>%
mutate(debt = gsub("t$", "", debt)),
select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
gather(debt, r, -id) %>%
filter(! is.na(r)) %>%
mutate(debt = gsub("t_r$", "", debt))
))
names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
select(ret, -debt)
})
) %>%
left_join(maintbl, ., by = c("id", "order"))
# id order age educDebt educDebt_r bankDebt bankDebt_r
# 1 1 1 54 student loan yes car loan yes
# 2 1 2 20 <NA> <NA> <NA> <NA>
# 3 1 3 23 student fund no <NA> <NA>
# 4 1 4 17 <NA> <NA> <NA> <NA>
# 5 2 1 60 <NA> <NA> <NA> <NA>
# 6 2 2 57 <NA> <NA> <NA> <NA>
# 7 2 3 28 student loan no consumer loan yes
# 8 2 4 33 bank credit yes <NA> <NA>
# 9 2 5 19 bank credit no <NA> <NA>
How can I structure and recode messy categorical data in R?
Make it "long":
library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results
What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.
importing messy data using R
How is this?
> nicelyFormatted
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "Black Eagles" "01/12" "12/11" "1500" "W" "7.0" "420" "48" "Away" "+3"
[2,] "Blue State" "02/18" "04/21" "1293" "L" "8.0" "490" "48" "Home" "+1"
[3,] "Hawks" "01/13" "02/17" "1028" "L" "4.0" "46" "460" "Away" NA
[4,] "New Apple" "09/23" "11/23" "563" "L" "3.0" "470" "47" "Home" "+2"
[5,] "Black White" "07/05" "09/26" "713" "L" "5.2" "500" "45" "Home" "+4"
[6,] "PBO" "10/24" "10/30" "1495" "L" "1.9" "47" "410" "Away" NA
Here is the code that was used to get the above table:
library(stringr)
# Open Connection to file
pathToFile <- path.expand("~/path/to/file/myfile.txt")
f <- file(pathToFile, "rb")
# Read in lines
rawText <- readLines(f)
# Find the dahses
dsh <- str_locate_all(rawText, " - ")
# Splice, using the dashes as a guide
lng <- length(rawText)
spliced <- sapply(1:lng, function(i)
spliceOnDash(rawText[[i]], dsh[[c(i, 1)]], dsh[[c(i, 2)]])
)
# make it purtty
nicelyFormatted <- formatNicely(spliced)
nicelyFormatted
#-------------------#
# FUNCTIONS #
#-------------------#
spliceOnDash <- function(strn, start, end) {
# split around the date
pre <- substr(strn, 1, start-6)
dates <- substr(strn, start-5, end+5)
post <- substr(strn, end+6, str_length(strn))
# Clean up
pre <- str_trim(pre)
# replace all double spaces with single spaces
while(str_detect(post, " ")) {
post <- str_replace_all(str_trim(post), " ", " ")
}
# splice on space
post <- str_split(post, " ")
# if dates are one field, remove this next line
dates <- str_split(dates, " - ")
# return
c(unlist(pre), unlist(dates), unlist(post))
}
# Function to clean up the list into a nice table
formatNicely <- function(spliced) {
lngst <- max(sapply(spliced, length))
t(sapply(spliced, function(x)
if(length(x) < lngst) c(x, rep(NA, lngst-length(x))) else x ))
}
Is there an R function to clean messy salaries in character format?
Here are some first steps you can try.
I define two functions: one replaces a k
or K
with three zeros.
The other adds leading zeros if one number is denoted in thousands and the other is not.
rem_k <- function(x) {
sub("(\\d)[kK]", "\\1,000", x)
}
add_zero <- function(x) {
ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
}
Finally, I remove all non essential characters:
df %>%
mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))
salary salary2
1 40,000-60,000 40,000-60,000
2 40-80K 40,000-80,000
3 $100,000 100,000
4 $70/hr 70
5 Between $65-80/hour 65-80
6 $100k 100,000
7 50-60,000 a year 50,000-60,000
8 90 90
Use function to convert messy data in many data frames to single tidier format
I think you over complicate the task:
- Use read.table with
skip
option to skip first lines - Convert extract x columns as a single vector : volume
- Convert extract y columns as a single vector : flow
- Use recycling to create other columns
here my code:
## here you change text by your file=file_name
d_f <- read.table(text=dat,header=TRUE,skip=3)[,-1]
## extract only X column and expand them in a single vector
volume <- unlist(as.list(d_f[grep('X',colnames(d_f))]))
## extract only Y column and expand them in a single vector
flow <- unlist(as.list(d_f[grep('Y',colnames(d_f))]))
## create you data frame using recycling for other columns
data.frame(id=1,name='Ball',routine='tech',
trial='pos1',volume=volume,flow=flow)
id name routine trial volume flow
X1 1 Ball tech pos1 0.00000000 0.0000
X2 1 Ball tech pos1 0.05022222 2.1945
X3 1 Ball tech pos1 0.10044444 5.7684
X.11 1 Ball tech pos1 0.00000000 0.0000
X.12 1 Ball tech pos1 0.05022222 1.9437
X.13 1 Ball tech pos1 0.10044444 4.7652
X.21 1 Ball tech pos1 0.00000000 0.0000
X.22 1 Ball tech pos1 0.05000000 1.2540
X.23 1 Ball tech pos1 0.10000000 4.2636
Related Topics
Why Would R Use the "L" Suffix to Denote an Integer
How to Round Up to the Nearest 10 (Or 100 or X)
Force Character Vector Encoding from "Unknown" to "Utf-8" in R
"Correct" Way to Specifiy Optional Arguments in R Functions
Convert Data.Frame Column to a Vector
Use Trycatch Skip to Next Value of Loop Upon Error
How to Use R with Google Colaboratory
Check for Installed Packages Before Running Install.Packages()
How to Use Objects from Global Environment in Rstudio Markdown
Remove Multiple Objects with Rm()
Creating a Prompt/Answer System to Input Data into R
How to Install Development Version of R Packages Github Repository
Display a Time Clock in the R Command Line
How to Draw a Line Across a Multiple-Figure Environment in R
How to Detect the Right Encoding for Read.Csv