Using R to list all files with a specified extension
files <- list.files(pattern = "\\.dbf$")
$
at the end means that this is end of string. "dbf$"
will work too, but adding \\.
(.
is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf
(in case you have e.g. .adbf
files).
list.files and file extension selection - R
We need to use the pattern
argument to match all files with a .
(as .
is a metacharacter, we escape (\\
) it) followed by the string 'txt' and specify that it is at the end ($
) of the string
lf <- list.files(path = "E:/UUU/", pattern = "\\.txt$", full.names=TRUE)
By default, the pattern
is set as NULL, so it will select all the files in the folder. If we check the Usage
from ?list.files
list.files(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
List files with specific word and file extension
The list.files command has the options for wildcards, so you should be able to do something like:
list.files("/../directory", pattern = "*_2000*//.bil")
or maybe
list.files("/../directory", pattern = ".*_2000.*\\.bil")
I'm not 100% clear on whether list.files uses a regex pattern and I don't have access to R at the moment, so let me know if that works.
list.files pattern for specific word and file extension
try
pattern = '.*students.*\\.csv$'
You can test regular expressions in R with this tester
Reading in all files with a specific extension
I will reiterate that fread
is significantly quicker as is shown in this post on Stack Overflow: Quickly reading very large tables as dataframes in R. In summary, the tests (on a 51 Mb file - 1e6 rows x 6 columns) showed an performance improvement of over 70% against the best alternative methods including sqldf
, ff
and read.table
with and without the optimised setting recommended in the answer by @lukeA. This was backed up in the comments which report a 4GB file loading in under a minute with fread
, compared to 15 hours with base functions.
I ran some tests of my own, to compare alternative methods of reading and combining CSV files. The experimental setup is as follows:
- Generate 4 column CSV file (
character
x 1,numeric
x 3) for each run. There are 6 runs, each with a different number of rows, ranging from10^1
,10^2
,...,10^6
records in the data file. - Import the CSV file into
R
10 times, joining withrbind
orrbindlist
to create a single table. - Test out
read.csv
&read.table
, with and without optimised arguments such as colClasses, againstfread
. - Using
microbenchmark
repeat every test 10 times (probably unnecessarily high!), and collect the timings for each run.
The results show again in favour of fread
with rbindlist
over optimised read.table
with rbind
functionality.
This table shows the median
total duration for 10 file reads & combines for each method and number of rows per file. The first 3 columns are in microseconds, the last 3 in seconds.
expr 10 100 1000 10000 1e+05 1e+06
1: FREAD 3.93704 5.229699 16.80106 0.1470289 1.324394 12.28122
2: READ.CSV 12.38413 18.887334 78.68367 0.9609491 8.820387 187.89306
3: READ.CSV.PLUS 10.24376 14.480308 60.55098 0.6985101 5.728035 51.83903
4: READ.TABLE 12.82230 21.019998 74.49074 0.8096604 9.420266 123.53155
5: READ.TABLE.PLUS 10.12752 15.622499 57.53279 0.7150357 5.715737 52.91683
This plot shows the comparison of timings when run 10 times on the HPC:
Normalising these values against the fread
timing shows how much longer these other methods take for all scenarios:
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 3.145543 3.611553 4.683256 6.535784 6.659941 15.299223
READ.CSV.PLUS 2.601893 2.768861 3.603998 4.750835 4.325023 4.221001
READ.TABLE 3.256838 4.019352 4.433693 5.506811 7.112887 10.058576
READ.TABLE.PLUS 2.572370 2.987266 3.424355 4.863232 4.315737 4.308762
Table of results for 10 microbenchmark
iterations on the HPC
Interestingly for 1 million rows per file the optimised version of read.csv
and read.table
take 422% and 430% more time than fread
whilst without optimisation this leaps to around 1500% and 1005% longer.
Note that when I conducted this experiment on my powerful laptop as opposed to the HPC cluster the performance gains were somewhat less (around 81% slower as opposed to 400% slower). This is interesting in itself, not sure that I can explain it however!
10 100 1000 10000 1e+05 1e+06
FREAD 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
READ.CSV 2.595057 2.166448 2.115312 3.042585 3.179500 6.694197
READ.CSV.PLUS 2.238316 1.846175 1.659942 2.361703 2.055851 1.805456
READ.TABLE 2.191753 2.819338 5.116871 7.593756 9.156118 13.550412
READ.TABLE.PLUS 2.275799 1.848747 1.827298 2.313686 1.948887 1.832518
Table of results for only 5 `microbenchmark` iterations on my i7 laptop
Given that the data volume is reasonably large I'd suggest the benefits will not only be in the reading of the files with fread
but in the subsequent manipulation of the data with the data.table
package as opposed to traditional data.frame
operations! I am lucky to have learned this lesson at an early stage and would recommend others to follow suit...
Here is the code used in the tests.
rm(list=ls()) ; gc()
library(data.table) ; library(microbenchmark)
#=============== FUNCTIONS TO BE TESTED ===============
f_FREAD = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = fread("file.csv") else x = rbindlist(list(x, fread("file.csv")))
}
}
f_READ.TABLE = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.table("file.csv") else x = rbind(x, read.table("file.csv"))
}
}
f_READ.TABLE.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
f_READ.CSV = function(NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) x = read.csv("file.csv") else x = rbind(x, read.csv("file.csv"))
}
}
f_READ.CSV.PLUS = function (NUM_READS) {
for (i in 1:NUM_READS) {
if (i == 1) {
x = read.csv("file.csv", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric"))
} else {
x = rbind(x, read.csv("file.csv", comment.char="", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric")))
}
}
}
#=============== MAIN EXPERIMENTAL LOOP ===============
for (i in 1:6)
{
NUM_ROWS = (10^i) # the loop allows us to test the performance over varying numbers of rows
NUM_READS = 10
# create a test data.table with the specified number of rows and write it to file
dt = data.table(
col1 = sample(letters[],NUM_ROWS,replace=TRUE),
col2 = rnorm(NUM_ROWS),
col3 = rnorm(NUM_ROWS),
col4 = rnorm(NUM_ROWS)
)
write.csv(dt, "file.csv", row.names=FALSE)
# run the imports for each method, recording results with microbenchmark
results = microbenchmark(
FREAD = f_FREAD(NUM_READS),
READ.TABLE = f_READ.TABLE(NUM_READS),
READ.TABLE.PLUS = f_READ.TABLE.PLUS(NUM_READS),
READ.CSV = f_READ.CSV(NUM_READS),
READ.CSV.PLUS = f_READ.CSV.PLUS(NUM_READS),
times = NUM_ITERATIONS)
results = data.table(NUM_ROWS = NUM_ROWS, results)
if (i == 1) results.all = results else results.all = rbindlist(list(results.all, results))
}
results.all[,time:=time/1000000000] # convert from nanoseconds
Using R to read all files in a specific format and with specific extension
@jay.sf solution works for creating a regular expression to pull out the condition that you want.
However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect()
.
intersect(master1, master2)
Will show you all the files that satisfy pattern 1 and pattern 2.
Related Topics
Concatenate Unique Strings After Groupby in R
Changing Whisker Definition in Geom_Boxplot
Min for Each Row in a Data Frame
Create Dynamic Number of Input Elements with R/Shiny
Why Is Allow.Cartesian Required at Times When When Joining Data.Tables with Duplicate Keys
Get Last Row of Each Group in R
Export a List into a CSV or Txt File in R
Lib Unspecified & Error in Loadnamespace
Deleting Reversed Duplicates with R
Subset a Column in Data Frame Based on Another Data Frame/List
Reduce PDF File Size of Plots by Filtering Hidden Objects
Non-Equi Join Using Data.Table: Column Missing from the Output
Sum Cells of Certain Columns for Each Row
Why Is 'Vapply' Safer Than 'Sapply'
How to Connect Two Coordinates with a Line Using Leaflet in R