How to Load Xlsx File Using Fread Function

How to load xlsx file using fread function?

Here's how: Using command line tools directly in conjunction with csvkit like this

my.dt<-fread('in2csv my.xls')

data.table::fread Read all worksheets in an Excel workbook

I used openxlsx::read.xlsx the last time I needed to read many sheets from an XLSX.

#install.packages("openxlsx")
library(openxlsx)
#?openxlsx::read.xlsx

#using file chooser:
filename <- file.choose()
#or hard coded file name:
#filename <- "filename.xlsx"

#get all the sheet names from the workbook
SheetNames<-getSheetNames(filename)

# loop through each sheet in the workbook
for (i in SheetNames){

#Read the i'th sheet
tmp_sheet<-openxlsx::read.xlsx(filename, i)

#if the input file exists, append the new data;; else use the first sheet to initialize the input file
ifelse(exists("input"),
input<-rbind(input, tmp_sheet),
input<-tmp_sheet)
}

Note: This assumes each worksheet has identical column structure and data types. You may need to standardize\normalize the data (ex. tmp_sheet <- as.data.frame(sapply(tmp_sheet,as.character), stringsAsFactors=FALSE)), or load each sheet into it's own dataframe and pre-process further before merging.

I can't read excel file using dt.fread from datatable AttributeError

The issue is that datatable package is not updated yet to make use of xldr>1.2.0, so in order to make it work you have to install xldr = 1.2.0

pip install xldr==1.2.0

I hope it helped.

How to read tab separated file into data.table using fread?

This has been fixed recently in the devel version, v1.9.5 (will be soon available on CRAN as v1.9.6):

require(data.table) # v1.9.5+
fread("~/Downloads/tmp.txt")
# V1 V2 V3
# 1: Beth 4.00 0
# 2: Dan 3.75 0
# 3: Kathy 4.00 10
# 4: Mark 5.00 20
# 5: Mary 5.50 22
# 6: Susie 4.25 18

See README.md in the project page for more info. fread gained strip.white argument (amidst other functionalities / bug fixes) which is by default TRUE.


Update: it also has col.names argument now:

fread("~/Downloads/tmp.txt", col.names = c("Name", "PayRate", "HoursWorked"))
# Name PayRate HoursWorked
# 1: Beth 4.00 0
# 2: Dan 3.75 0
# 3: Kathy 4.00 10
# 4: Mark 5.00 20
# 5: Mary 5.50 22
# 6: Susie 4.25 18

Fastest way to read large Excel xlsx files? To parallelize or not?

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)

In my case, the parallel version is faster:

Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15

Note:
you can use stopCluster(cl) to shut down the workers when the process is finished.

Cannot import XLSX file

At least two issues here:

  • you have a bogus-looking tilde (~) at the beginning of your file name
  • data.table::fread() reads "delimited" files (i.e., space or whitespace or tab or comma-separated), not XLSX files

Try e.g.

readxl::read_excel("C:/matly/Desktop/Grad School/Class 4/Customer.xlsx")

Other style points:

  • read_excel automatically uses stringsAsFactors=FALSE; it returns a "tibble", which is almost (but not quite!) the same as a data frame
  • using / as a path separator works cross-platform and is a little easier to read
  • I'd strongly encourage you to change your working directory and use relative path names, e.g.
setwd("C:/matly/Desktop/Grad School/Class 4/")
readxl::read_excel("Customer.xlsx")


Related Topics



Leave a reply



Submit