Using R to List All Files with a Specified Extension

Using R to list all files with a specified extension

files <- list.files(pattern = "\\.dbf$")

$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).

list.files and file extension selection - R

We need to use the pattern argument to match all files with a . (as . is a metacharacter, we escape (\\) it) followed by the string 'txt' and specify that it is at the end ($) of the string

lf <- list.files(path = "E:/UUU/", pattern = "\\.txt$", full.names=TRUE)

By default, the pattern is set as NULL, so it will select all the files in the folder. If we check the Usage from ?list.files

list.files(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

List files with specific word and file extension

The list.files command has the options for wildcards, so you should be able to do something like:

list.files("/../directory", pattern = "*_2000*//.bil")

or maybe

list.files("/../directory", pattern = ".*_2000.*\\.bil")

I'm not 100% clear on whether list.files uses a regex pattern and I don't have access to R at the moment, so let me know if that works.

list.files pattern for specific word and file extension

try

   pattern = '.*students.*\\.csv$'

You can test regular expressions in R with this tester

Reading in all files with a specific extension

I will reiterate that fread is significantly quicker as is shown in this post on Stack Overflow: Quickly reading very large tables as dataframes in R. In summary, the tests (on a 51 Mb file - 1e6 rows x 6 columns) showed an performance improvement of over 70% against the best alternative methods including sqldf, ff and read.table with and without the optimised setting recommended in the answer by @lukeA. This was backed up in the comments which report a 4GB file loading in under a minute with fread, compared to 15 hours with base functions.

I ran some tests of my own, to compare alternative methods of reading and combining CSV files. The experimental setup is as follows:

  1. Generate 4 column CSV file (character x 1, numeric x 3) for each run. There are 6 runs, each with a different number of rows, ranging from 10^1, 10^2,...,10^6 records in the data file.
  2. Import the CSV file into R 10 times, joining with rbind or rbindlist to create a single table.
  3. Test out read.csv & read.table, with and without optimised arguments such as colClasses, against fread.
  4. Using microbenchmark repeat every test 10 times (probably unnecessarily high!), and collect the timings for each run.

The results show again in favour of fread with rbindlist over optimised read.table with rbind functionality.

This table shows the median total duration for 10 file reads & combines for each method and number of rows per file. The first 3 columns are in microseconds, the last 3 in seconds.

              expr       10       100     1000     10000    1e+05     1e+06
1: FREAD 3.93704 5.229699 16.80106 0.1470289 1.324394 12.28122
2: READ.CSV 12.38413 18.887334 78.68367 0.9609491 8.820387 187.89306
3: READ.CSV.PLUS 10.24376 14.480308 60.55098 0.6985101 5.728035 51.83903
4: READ.TABLE 12.82230 21.019998 74.49074 0.8096604 9.420266 123.53155
5: READ.TABLE.PLUS 10.12752 15.622499 57.53279 0.7150357 5.715737 52.91683

This plot shows the comparison of timings when run 10 times on the HPC:

Normalising these values against the fread timing shows how much longer these other methods take for all scenarios:

                      10      100     1000    10000    1e+05     1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 3.145543 3.611553 4.683256 6.535784 6.659941 15.299223
READ.CSV.PLUS 2.601893 2.768861 3.603998 4.750835 4.325023 4.221001
READ.TABLE 3.256838 4.019352 4.433693 5.506811 7.112887 10.058576
READ.TABLE.PLUS 2.572370 2.987266 3.424355 4.863232 4.315737 4.308762

Table of results for 10 microbenchmark iterations on the HPC

Interestingly for 1 million rows per file the optimised version of read.csv and read.table take 422% and 430% more time than fread whilst without optimisation this leaps to around 1500% and 1005% longer.

Note that when I conducted this experiment on my powerful laptop as opposed to the HPC cluster the performance gains were somewhat less (around 81% slower as opposed to 400% slower). This is interesting in itself, not sure that I can explain it however!

                      10      100     1000    10000    1e+05     1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 2.595057 2.166448 2.115312 3.042585 3.179500 6.694197
READ.CSV.PLUS 2.238316 1.846175 1.659942 2.361703 2.055851 1.805456
READ.TABLE 2.191753 2.819338 5.116871 7.593756 9.156118 13.550412
READ.TABLE.PLUS 2.275799 1.848747 1.827298 2.313686 1.948887 1.832518

Table of results for only 5 `microbenchmark` iterations on my i7 laptop

Given that the data volume is reasonably large I'd suggest the benefits will not only be in the reading of the files with fread but in the subsequent manipulation of the data with the data.table package as opposed to traditional data.frame operations! I am lucky to have learned this lesson at an early stage and would recommend others to follow suit...

Here is the code used in the tests.

rm(list=ls()) ; gc()
library(data.table) ; library(microbenchmark)

#=============== FUNCTIONS TO BE TESTED ===============

f_FREAD = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = fread("file.csv") else x = rbindlist(list(x, fread("file.csv")))
}
}
f_READ.TABLE = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.table("file.csv") else x = rbind(x, read.table("file.csv"))
}
}
f_READ.TABLE.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
f_READ.CSV = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.csv("file.csv") else x = rbind(x, read.csv("file.csv"))
}
}
f_READ.CSV.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.csv("file.csv", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.csv("file.csv", comment.char="", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}

#=============== MAIN EXPERIMENTAL LOOP ===============
for (i in 1:6)
{
NUM_ROWS = (10^i) # the loop allows us to test the performance over varying numbers of rows
NUM_READS = 10

# create a test data.table with the specified number of rows and write it to file
dt = data.table(
col1 = sample(letters[],NUM_ROWS,replace=TRUE),
col2 = rnorm(NUM_ROWS),
col3 = rnorm(NUM_ROWS),
col4 = rnorm(NUM_ROWS)
)
write.csv(dt, "file.csv", row.names=FALSE)

# run the imports for each method, recording results with microbenchmark
results = microbenchmark(
FREAD = f_FREAD(NUM_READS),
READ.TABLE = f_READ.TABLE(NUM_READS),
READ.TABLE.PLUS = f_READ.TABLE.PLUS(NUM_READS),
READ.CSV = f_READ.CSV(NUM_READS),
READ.CSV.PLUS = f_READ.CSV.PLUS(NUM_READS),
times = NUM_ITERATIONS)
results = data.table(NUM_ROWS = NUM_ROWS, results)
if (i == 1) results.all = results else results.all = rbindlist(list(results.all, results))
}

results.all[,time:=time/1000000000] # convert from nanoseconds

Using R to read all files in a specific format and with specific extension

@jay.sf solution works for creating a regular expression to pull out the condition that you want.

However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect().

intersect(master1, master2)

Will show you all the files that satisfy pattern 1 and pattern 2.



Related Topics



Leave a reply



Submit