Fill Option for Fread

Fill option for fread

Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.

Could you add it to the list please? That way you'll get notified when its status changes.

Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.

UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Simple fread operation with fill=TRUE fails

We could find out maximum number of columns and add that many columns, then fread:

x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")

# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)

# check output
dim(d1)
# [1] 102 1101
d1[100:102, 1101]
# V1101
# 1: NA
# 2: NA
# 3: 1101

But as we already have the data imported with readLines, we could just parse it:

x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)

# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))

# check output
dim(d2)
# [1] 102 1101
d2[100:102, 1101]
# V1101
# 1: <NA>
# 2: <NA>
# 3: 1101

It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:

d <- fread(input = "notworking1.dat", fill = 1101)

r - Error: Text after processing all cols in fread (data.table)

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.

The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').

On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.

I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.

I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).

In the comments there are some nice workarounds you can use.

R data.table problem when read file with inconsistent column

Not sure why you still have the problem even with fill=T... But if nothing helps, you can try playing with something like this:

tryCatch(
expr = {dt1 <<- fread(file_path)},
warning = function(w){
cat('Warning: ', w$message, '\n\n');
n_line <- as.numeric(gsub('Stopped early on line (\\d+)\\..*','\\1',w$message))
if (!is.na(n_line)) {
cat('Found ', n_line,'\n')
dt1_part1 <- fread(file_path, nrows=n_line)
dt1_part2 <- fread(file_path, skip=n_line)
dt1 <<- rbind(dt1_part1, dt1_part2, fill=T)
}
},
finally = cat("\nFinished. \n")
);

tryCatch() construct catches warning message so you can extract the line number and process it accordingly.



Related Topics



Leave a reply



Submit