Fill option for fread
Not currently; I wasn't aware of read.csv
's fill feature. On the plan was to add the ability to read dual-delimited files (sep2
as well as sep
as mentioned in ?fread
). Then variable length vectors could be read into a list
column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread
is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list
columns (each cell itself a vector) when sep2
is implemented; not filled in separate columns as read.csv
can do.
Simple fread operation with fill=TRUE fails
We could find out maximum number of columns and add that many columns, then fread:
x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")
# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)
# check output
dim(d1)
# [1] 102 1101
d1[100:102, 1101]
# V1101
# 1: NA
# 2: NA
# 3: 1101
But as we already have the data imported with readLines, we could just parse it:
x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)
# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))
# check output
dim(d2)
# [1] 102 1101
d2[100:102, 1101]
# V1101
# 1: <NA>
# 2: <NA>
# 3: 1101
It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:
d <- fread(input = "notworking1.dat", fill = 1101)
r - Error: Text after processing all cols in fread (data.table)
Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE
and it worked with both files. So I think that something is done differently with the fill
option of the fread.
I would expect since fill=TRUE
, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.
R data.table problem when read file with inconsistent column
Not sure why you still have the problem even with fill=T
... But if nothing helps, you can try playing with something like this:
tryCatch(
expr = {dt1 <<- fread(file_path)},
warning = function(w){
cat('Warning: ', w$message, '\n\n');
n_line <- as.numeric(gsub('Stopped early on line (\\d+)\\..*','\\1',w$message))
if (!is.na(n_line)) {
cat('Found ', n_line,'\n')
dt1_part1 <- fread(file_path, nrows=n_line)
dt1_part2 <- fread(file_path, skip=n_line)
dt1 <<- rbind(dt1_part1, dt1_part2, fill=T)
}
},
finally = cat("\nFinished. \n")
);
tryCatch()
construct catches warning message so you can extract the line number and process it accordingly.
Related Topics
How to Retry a Statement on Error
R - What Algorithm Does Geom_Density() Use and How to Extract Points/Equation of Curves
Rselenium: Server Signals Port Is Already in Use
Round a Posix Date (Posixct) with Base R Functionality
Downloading Png from Shiny (R)
Ggplot2: Drop Unused Factors in a Faceted Bar Plot But Not Have Differing Bar Widths Between Facets
What Type of Graph Is This? and Can It Be Created Using Ggplot2
How to Use the 'Sweep' Function
Error: Vector Memory Exhausted (Limit Reached) R 3.5.0 MACos
Label X Axis in Time Series Plot Using R
How to Calculate the Probability for a Given Quantile in R
Find Out the Number of Days of a Month in R
Can Ggplot2 Control Point Size and Line Size (Lineweight) Separately in One Legend
Change Geom_Text's Default "A" Legend to Label String Itself