Read Sas Sas7Bdat Data into R

Read SAS sas7bdat data into R

sas7bdat worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven package in R which also has a read_sas method.

This method is superior for two reasons:

1) It didn't have any trouble reading the above-linked file
2) It is much (I'm talking much) faster than read.sas7bdat. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:

microbenchmark(times=10L,
read.sas7bdat("psu97ai.sas7bdat"),
read_sas("psu97ai.sas7bdat"))

Unit: milliseconds
expr min lq mean median uq max neval cld
read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152 10 b
read_sas("psu97ai.sas7bdat") 397.9955 402.2627 410.4015 408.5038 418.1059 425.2762 10 a

That's right--haven::read_sas takes (on average) 99.5% less time than sas7bdat::read.sas7bdat.

minor update

I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:

# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))

# read.sas7bdat prefers strings as factors,
# and as of now has no stringsAsFactors argument
# with which to prevent this
idj_factor <- sapply(haven, is.factor)

# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]

# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE

However, note that read.sas7bdat has kept a massive list of attributes for the file, presumably a holdover from SAS:

str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
# ..$ :List of 12
# .. ..$ name : chr "NCESSCH"
# .. ..$ offset: int 200
# .. ..$ length: int 12
# .. ..$ type : chr "character"
# .. ..$ format: chr "$"
# .. ..$ fhdr : int 0
# .. ..$ foff : int 76
# .. ..$ flen : int 1
# .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
# .. ..$ lhdr : int 0
# .. ..$ loff : int 44
# .. ..$ llen : int 32
# ...

So, if by any chance you need these attributes (I know some people are particularly keen on the labels, for instance), perhaps read.sas7bdat is the option for you after all.

How to import a formatted ' .sas7bdat' file into `R` without format deletion?

I think the issue you have, most likely, is misunderstanding how R labels work.

When I use the following SAS code:

libname temp 'h:\temp\';
proc format lib=temp;
value sexf
1='Female'
2='Male'
;
value racef
1='Black'
2='Asian'
3='White'
4='Other'
;
value hispf
1='Of Hispanic Origin'
2='Not of Hispanic Origin'
;
quit;
options fmtsearch=(temp);
data temp.rtest;
input sex race hisp;
format sex sexf. race racef. hisp hispf.;
datalines;
1 1 1
2 1 1
1 2 1
2 2 1
1 3 1
2 3 1
1 4 1
2 4 1
1 1 2
2 1 2
1 2 2
2 2 2
1 3 2
2 3 2
1 4 2
2 4 2
;;;;
run;

And then use the following R code:

library(haven)
data <- read_sas("H:/temp/rtest.sas7bdat", catalog_file="H:/temp/formats.sas7bcat")
print(data)

It works as expected - the console prints the labelled text.

# A tibble: 16 x 3
sex race hisp
<dbl+lbl> <dbl+lbl> <dbl+lbl>
1 1 [Female] 1 [Black] 1 [Of Hispanic Origin]
2 2 [Male] 1 [Black] 1 [Of Hispanic Origin]
3 1 [Female] 2 [Asian] 1 [Of Hispanic Origin]
4 2 [Male] 2 [Asian] 1 [Of Hispanic Origin]
5 1 [Female] 3 [White] 1 [Of Hispanic Origin]
6 2 [Male] 3 [White] 1 [Of Hispanic Origin]
7 1 [Female] 4 [Other] 1 [Of Hispanic Origin]
8 2 [Male] 4 [Other] 1 [Of Hispanic Origin]
9 1 [Female] 1 [Black] 2 [Not of Hispanic Origin]
10 2 [Male] 1 [Black] 2 [Not of Hispanic Origin]
11 1 [Female] 2 [Asian] 2 [Not of Hispanic Origin]
12 2 [Male] 2 [Asian] 2 [Not of Hispanic Origin]
13 1 [Female] 3 [White] 2 [Not of Hispanic Origin]
14 2 [Male] 3 [White] 2 [Not of Hispanic Origin]
15 1 [Female] 4 [Other] 2 [Not of Hispanic Origin]
16 2 [Male] 4 [Other] 2 [Not of Hispanic Origin]

However, if I view it in RStudio's viewer by double-clicking on the dataset in the Data pane, it doesn't, and that is what you pasted into the question (a picture of that). I don't believe that's supported (variable labels are, meaning column header labels, but not value labels); if you want to verify that you may want to ask a new question specifically mentioning that, with the code here cleaned up (you're welcome to use my example code).

What you will probably want to do is convert the value labels to factors. This can be done a few ways; there is some discussion of why in the labelled package documentation, which is one thing you could use for this, but there are several approaches. Again, this would be a good separate question if you can't figure it out on your own. Factors are how R would typically manage this sort of thing (i.e., categorical variables).

Reading SAS datasets R sas7bdat package

Do you have to use the sas7bdat package? Can you use haven instead?

> library("haven")
> read_sas("c:/downloads/empty.sas7bdat")
# A tibble: 0 x 5
# ... with 5 variables: Name <chr>, Sex <chr>, Age <dbl>, Height <dbl>, Weight <dbl>
> read_sas("c:/downloads/one.sas7bdat")
# A tibble: 1 x 5
Name Sex Age Height Weight
<chr> <chr> <dbl> <dbl> <dbl>
1 Alfred M 14.0 69.0 112
>

importing a SAS data set to R

SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.

First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.

Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.

Third would be a custom format that displays the numbers using letters.

The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.

As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.

filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;

The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.

data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;

How to read multiple large sas data files into R, filter rows and save subset datasets as .rds

I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):

library(haven)
library(data.table)
fromFolder <- "~/folder_with_input_data/"
toFolder <- "~/folder_with_output_data/"
import_sas <- function(filename) {
dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
dt <- as.data.table(dt)
dt <- dt[grep("^10", A)]
saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
remove(dt)
}

file_list <- list.files(path = fromFolder, pattern="^dataset")
for (filename in file_list) {
import_sas(filename)
}

I haven't tested this with the full 30 files yet. I'll do that tonight.
If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.

Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.



Related Topics



Leave a reply



Submit