Import Dat File into R

import dat file into R

The dat file has some lines of extra information before the actual data. Skip them with the skip argument:

read.table("http://www.nilu.no/projects/ccc/onlinedata/ozone/CZ03_2009.dat", 
header=TRUE, skip=3)

An easy way to check this if you are unfamiliar with the dataset is to first use readLines to check a few lines, as below:

readLines("http://www.nilu.no/projects/ccc/onlinedata/ozone/CZ03_2009.dat", 
n=10)
# [1] "Ozone data from CZ03 2009" "Local time: GMT + 0"
# [3] "" "Date Hour Value"
# [5] "01.01.2009 00:00 34.3" "01.01.2009 01:00 31.9"
# [7] "01.01.2009 02:00 29.9" "01.01.2009 03:00 28.5"
# [9] "01.01.2009 04:00 32.9" "01.01.2009 05:00 20.5"

Here, we can see that the actual data starts at [4], so we know to skip the first three lines.

Update

If you really only wanted the Value column, you could do that by:

as.vector(
read.table("http://www.nilu.no/projects/ccc/onlinedata/ozone/CZ03_2009.dat",
header=TRUE, skip=3)$Value)

Again, readLines is useful for helping us figure out the actual name of the columns we will be importing.

But I don't see much advantage to doing that over reading the whole dataset in and extracting later.

Opening and Saving a .dat file in R

There are two issues, the header and the separator. Header: It is not practical to have measurement units and column names together in the first row, because it makes creation of the header difficult. It is of course possible in R, but to avoid complication, let's just skip the first row and add the header afterwards.

txt <- 
"Time [d] T_in [degC]
0 2
-1 -1.00 0.00
1 3.08503 3.08503
2 2.01752 2.01752
3 2.4856 2.4856
4 2.14033 2.14033
5 2.35846 2.35846"

dat <- read.table(text=txt, skip=1)
names(dat) <- c("Time [d]", "T_in [degC]")

Note that I embedded the data directly in the code and read it from the txt string instead of a file. It works the same with files using read.table(file=filename, skip=1). The default separator of read.table is whitespace, i.e. either space or tab.

To write the file back, use write.table with the following options:

write.table(dat, file="output.dat", sep=" ", quote=FALSE, row.names = FALSE)

Here we use again the space " " as the separator, but it may be better to use a tab "\t" in this case.

More details are found at the help pages, where one can see that read.delim and write.csv are just a convenience functions, i.e. versions of read.tableand write.table with different defaults.

Alternative approach

Finally, we see that row 2 has only two values, so the idea is to handle the two-lines header and the data separately. readLines reads n=2 lines separately as is, and then read.table reads the data. Writing is then done analogously:

header <- readLines("input.dat", n=2)
dat <- read.table(text=txt, skip=2)

writeLines(header, "output.dat")
write.table(dat, "output.dat", sep="\t",
quote=FALSE, col.names=FALSE, row.names = FALSE, append=TRUE)

R - Import .dat files into R, which are not delimited

You can do this with readr's read_fwf (read fixed width file).

library(readr)

data <- read_fwf("C:\dataset.dat",
col_positions = fwf_positions(start = c(1, 3, 9, 19),
end = c(2, 8, 18, 20)))

read_fwf will try to figure out the types of each column, but if there are mistakes you can specify the types yourself using the col_types argument.

How to read this particular .dat file in R?

This looks like a fixed-width data file.

If you read in the first 10 lines and look at the length:

library(readr)

con = file("nov18pub.dat", "r")
line = readLines(con, n = 10)
num_char <- nchar(line)
close(con)

num_char
[1] 1018 1018 1018 1018 1018 1018 1018 1018 1018 1018

it appears to be 1018 characters.

You can use read.fwf from readr package to read in the file. In this example, the first 5 columns are read in (and the rest ignored using NULL).

read.fwf("nov18pub.dat",
widths = c(15,2,4,2,3,rep(NULL, num_char - 26)),
header = FALSE,
col.names = c("Household_ID", "Month", "Year", "Line_Number", "Final_Outcome"),
colClasses = c("character", rep("numeric", 4))
)

Household_ID Month Year Line_Number Final_Outcome
1 000110118096587 11 2018 2 201
2 000110118096587 11 2018 2 201
3 710004140617571 11 2018 1 201
4 761077501690006 11 2018 1 201
5 761077501690006 11 2018 1 201
6 067091706007561 11 2018 1 201
7 067091706007561 11 2018 1 201
8 067091706007561 11 2018 1 201
9 067091706007561 11 2018 1 201
10 691715007600067 11 2018 2 201
...

This document provides details on the column widths and codes for each variable:

https://www2.census.gov/programs-surveys/cps/techdocs/cpsnov18.pdf

In there, it mentions the format as well:

Structure: Rectangular.

File Size: 143,050 logical records; 968 character logical record length.

However, with the supplement data, the record lengths appears to go to 1018.

How to import .dat file?

In addition to @sindri_baldur's answer, use the faminc_99.zip file which includes the same faminc99.dat. The advantage is you may directly process the file in R.

The widths= are different, and to find out the values without codebook is a little forensic, but there's actually the information "Width:" in your linked codebook! We can also open faminc99.dat in a text editor and create a representative row with the given 48 distinguishable values to get a vector x,

x <- c("06933"," 4"," 2"," 213800.0"," 200000.0"," 13800","     0","    0","    0","     0","xyrid","      0","htdck","      0","      0","     0","      0","0","0","0","     0","     0","xyrid","     0","asgnd","    0","xscid","    0","xscid","     0","asgnd","     0","asgnd","    0","asgnd","    0","asgnd","     0","asgnd","200000","asgnd","  0","  0"," 45"," 69"," 2667"," 0915"," 4.197")

which character lengths we can count unsing nchar,

wdt <- nchar(x)

which gives us the desired widths wdt that we may plug into read.fwf.

## lk <- "http://data.nber.org/psid/supp/faminc99.dat.Z"  ## unknown .Z archive
lk <- "http://data.nber.org/psid/supp/faminc_99.zip"

temp <- tempfile() ## open connection

download.file(lk, temp)
r <- read.fwf(unz(temp, "faminc99.dat"), wdt)

unlink(temp) ## close connection

In the string variables there are elements with just spaces that we might want to turn into NA. (I think r[sapply(r, is.character)] isn't absolutely necessary here).

r[] <- lapply(r, function(z) {z[grep("^\\s*$", z)] <- NA;z})

Result

head(r)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
# 1 1 19 14 25500 25500 0 0 0 0 0 <NA> 0 <NA> 0 0 0 0
# 2 2 47 41 27060 22260 4800 0 0 0 0 <NA> 0 <NA> 0 0 0 0
# 3 3 47 41 11718 0 5400 0 0 6318 0 <NA> 0 <NA> 0 0 0 0
# 4 4 26 21 73928 73428 500 0 0 0 0 <NA> 7114 <NA> 0 0 3557 3557
# 5 5 37 32 32760 19000 4800 0 0 8960 0 <NA> 0 <NA> 0 0 0 0
# 6 6 29 24 30430 3000 3940 11085 0 12405 0 <NA> 0 <NA> 0 0 0 0
# V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34
# 1 0 0 0 25000 25000 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# 2 0 0 0 20900 20800 <NA> 100 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# 3 0 0 0 0 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# 4 1 0 1 56970 56970 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# 5 0 0 0 0 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# 6 0 0 0 0 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0 <NA> 0
# V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
# 1 <NA> 0 <NA> 0 <NA> 0 <NA> 430 69 0 0 1922 8480 25.278
# 2 <NA> 0 <NA> 0 <NA> 1360 <NA> 310 628 663 319 4087 17088 18.344
# 3 <NA> 0 <NA> 0 <NA> 0 <NA> 0 0 0 0 2338 9344 25.921
# 4 <NA> 0 <NA> 0 <NA> 9344 <NA> 152 219 281 339 4283 19453 15.649
# 5 <NA> 0 <NA> 0 <NA> 15000 <NA> 0 0 372 857 2476 9853 34.565
# 6 <NA> 0 <NA> 0 <NA> 0 <NA> 0 0 0 0 6360 27715 5.825

Where

str(r)
# 'data.frame': 6997 obs. of 48 variables:
# $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
# $ V2 : int 19 47 47 26 37 29 6 8 39 45 ...
# $ V3 : int 14 41 41 21 32 24 4 5 34 39 ...
# $ V4 : num 25500 27060 11718 73928 32760 ...
# $ V5 : num 25500 22260 0 73428 19000 ...
# $ V6 : int 0 4800 5400 500 4800 3940 0 200 0 0 ...
# $ V7 : int 0 0 0 0 0 11085 0 0 0 0 ...
# $ V8 : int 0 0 0 0 0 0 0 0 0 0 ...
# $ V9 : int 0 0 6318 0 8960 12405 0 0 8760 0 ...
# $ V10: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V11: chr NA NA NA NA ...
# $ V12: int 0 0 0 7114 0 0 0 0 0 0 ...
# $ V13: chr NA NA NA NA ...
# $ V14: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V15: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V16: int 0 0 0 3557 0 0 0 0 0 0 ...
# $ V17: int 0 0 0 3557 0 0 0 0 0 0 ...
# $ V18: int 0 0 0 1 0 0 0 0 0 0 ...
# $ V19: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V20: int 0 0 0 1 0 0 0 0 0 0 ...
# $ V21: int 25000 20900 0 56970 0 0 34000 24825 0 34500 ...
# $ V22: int 25000 20800 0 56970 0 0 28000 24825 0 34500 ...
# $ V23: chr NA NA NA NA ...
# $ V24: int 0 100 0 0 0 0 0 0 0 0 ...
# $ V25: chr NA NA NA NA ...
# $ V26: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V27: chr NA NA NA NA ...
# $ V28: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V29: chr NA NA NA NA ...
# $ V30: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V31: chr NA NA NA NA ...
# $ V32: int 0 0 0 0 0 0 3000 0 0 0 ...
# $ V33: chr NA NA NA NA ...
# $ V34: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V35: chr NA NA NA NA ...
# $ V36: int 0 0 0 0 0 0 3000 0 0 0 ...
# $ V37: chr NA NA NA NA ...
# $ V38: int 0 0 0 0 0 0 0 0 0 0 ...
# $ V39: chr NA NA NA NA ...
# $ V40: int 0 1360 0 9344 15000 0 0 0 0 0 ...
# $ V41: chr NA NA NA NA ...
# $ V42: int 430 310 0 152 0 0 706 14 0 133 ...
# $ V43: int 69 628 0 219 0 0 398 187 0 858 ...
# $ V44: int 0 663 0 281 372 0 984 0 0 0 ...
# $ V45: int 0 319 0 339 857 0 769 0 0 0 ...
# $ V46: int 1922 4087 2338 4283 2476 6360 2667 2932 1648 3268 ...
# $ V47: int 8480 17088 9344 19453 9853 27715 10915 13120 7818 16246 ...
# $ V48: num 25.3 18.3 25.9 15.6 34.6 ...

how to import dat file in R

To read dat files you need to use the command read.delim(). See ?read.delim for info on how to specify formatting.

EDIT:

Maybe after you read it as a single line you could do something like this:

# Creating sample data
data <- data.frame(x = c(1, 'a', 5, 2, 'b', 6, 3,'c', 7))
# Creating the columns
a <- as.numeric(data$x[seq(1,nrow(data), 3)])
b <- as.character(data$x[seq(2,nrow(data), 3)])
c <- as.numeric(data$x[seq(3, nrow(data), 3)])
# Putting it all together
data1 <- data.frame(a,b,c)

This works even if you want to create a lot of rows but gets tedious if you have a lot of columns. Honestly I have never had only a single line in a file so I don't know how to specify that directly when reading it. Hope that still helps though :)

Import .dat file from repository using R

A few years ago I also tried doing that and eventually cobbled together a method, but in the meantime, @AnthonyDamico wrote the SAScii-package and all these efforts are now unnecessary. Thank you, Anthony, for all your excellent efforts at making public use data available to R users. I suspect you've saved the Kaiser Foundation hundreds of thousands of dollars over the course of your employment and consulting career.

library(SAScii)
youth.tf <- tempfile()
daturl <- "https://wwwn.cdc.gov/nchs/data/nhanes3/2a/youthk.dat"
code_url ="https://wwwn.cdc.gov/nchs/data/nhanes3/2a/youthk.sas"
Sas_code <- url(code_url)
writeLines ( readLines(Sas_code) , con = youth.tf )
youth.fwf.parameters <- parse.SAScii( youth.tf , beginline = 5 )
str( youth.fwf.parameters )
#-----
'data.frame': 90 obs. of 4 variables:
$ varname: chr "SEQN" "HYK1A" "HYK1B" "HYK2A" ...
$ width : num 5 1 1 2 2 2 2 4 4 2 ...
$ char : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ divisor: num 1 1 1 1 1 1 1 1 1 1 ...
#------

daturl <- "https://wwwn.cdc.gov/nchs/data/nhanes3/2a/youthk.dat"
in.youth <- read.fwf(daturl, widths=youth.fwf.parameters$width,
col.names= youth.fwf.parameters$varname)

I'm afraid there are quite a few NA's and appears your code was mostly successful.

str(in.youth)
'data.frame': 13944 obs. of 90 variables:
$ SEQN : int 7 12 13 14 16 20 21 23 26 27 ...
$ HYK1A : int 1 1 2 2 2 2 1 2 2 2 ...
$ HYK1B : int 2 2 2 2 2 2 2 2 2 2 ...
$ HYK2A : int 1 1 NA NA NA NA 1 NA NA NA ...
$ HYK2B : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK3CG : int 1 3 NA NA NA NA 4 NA NA NA ...
$ HYK3DG : int 2 3 NA NA NA NA 2 NA NA NA ...
$ HYK6SG : int 30 30 NA NA NA NA 30 NA NA NA ...
$ HYK8SG : int 24 1 NA NA NA NA 24 NA NA NA ...
$ HYK3CH : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK3DH : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK6SH : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK8SH : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK3CI : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK3DI : int NA NA NA NA NA NA NA NA NA NA ...
$ HYK6SI : int NA NA NA NA NA NA NA NA NA NA ...
.... snipped the rest of the 90 variables

I only see these as having possibly invalid data and they appear to have not been loaded correctly:

 HYK11AG: Factor w/ 179 levels "      ","003.9 ",
HYK11AH: Factor w/ 108 levels " ","011.9 ",
HYK11AK: Factor w/ 12 levels " ","079.9 ",

These appear to be ICD-9-CM codes that should have been read into R with character format, but for which there were not the proper $ entries in the INPUT statements in the CDC SAS code. You can get a full set of diagnostics with the "read.SAScii"-function:

in2 <- read.SAScii( daturl, code_url)

You can check the counts of non-NA values against the values reported in the codebook at: https://wwwn.cdc.gov/nchs/data/nhanes3/1a/YOUTH-acc.pdf

This is a partial screenshot of page 210 of that codebook and shows that there should be all NA's for the three "HYK__" variables on that page:

Sample Image

Note added for amusement. This is actually now a counter-example of sorts to this R fortune. There are quite a few items in that package referencing SAS (often in a less than totally favorable light) and it took me 7 or 8 tries before the one I was looking for was brought up:

fortunes::fortune("SAS")

Jim Gustafsson: I would like to put my SAS-code into R. Could I do that,
if yes, how?
Frank Harrell: Just reverse the procedure you use when you put R code into
SAS. ;)
-- Jim Gustafsson and Frank Harrell
R-help (February 2004)

How can I convert this ABES ASCII/.dat file into a workable CSV file within R?

A combination of the tidyverse and the SAS and SPSS scripts provided in the ABES documentation can be a timesaver here. For example, you can use the widths from the SPSS script with the readr::read_fwf() option fwf_cols(). For example, the first few columns to pass to fwf_cols() would look like this: Q1 = c(1, 17), Q2 = c(18, 18), Q3 = c(19, 19)). Once you have it working, you can check your frequencies against Appendix E in the ABES Data User's Guide.



Related Topics



Leave a reply



Submit