Read a text file in R line by line
Here is the solution with a for
loop. Importantly, it takes the one call to readLines
out of the for loop so that it is not improperly called again and again. Here it is:
fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
How to read a txt file line by line in R/Rstudio?
You can use readLines
function.
What is a good way to read line-by-line in R?
The example Josh linked to is one that I use all the time.
inputFile <- "/home/jal/myFile.txt"
con <- file(inputFile, open = "r")
dataList <- list()
ecdfList <- list()
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
myVector <- (strsplit(oneLine, " "))
myVector <- list(as.numeric(myVector[[1]]))
dataList <- c(dataList,myVector)
myEcdf <- ecdf(myVector[[1]])
ecdfList <- c(ecdfList,myEcdf)
}
close(con)
I edited the example to create two lists from your example data. dataList is a list where each item in the list is a vector of numeric values from each line in your text file. ecdfList is a list where each element is an ecdf for each line in your text file.
You should probably add some try() or trycatch() logic in there to properly handle situations where the ecdf can't be created because of nulls or some such. But the above example should get you pretty close. Good luck!
R - Reading lines from a .txt-file after a specific line
1) read.pattern read.pattern
in gsubfn can be used to read only lines matching a specific pattern. In this example we match beginning of line, optional space(s), 1 or more digits, 1 or more spaces, an optional minus followed by 1 or more digits, optional space(s), end of line. The portions matching the parenthesized portions of the regexp are returned as columns in a data.frame. text = Lines
in this self contained example can be replaced with "myfile.txt"
, say, if the data is coming from a file. Modify the pattern to suit.
Lines <- "junk
junk
##XYDATA= (X++(Y..Y))
131071 -2065
131070 -4137
131069 -6408
131068 -8043"
library(gsubfn)
DF <- read.pattern(text = Lines, pattern = "^ *(\\d+) +(-?\\d+) *$")
giving:
> DF
V1 V2
1 131071 -2065
2 131070 -4137
3 131069 -6408
4 131068 -8043
2) read twice Another possibility using only base R is simply to read it once to determine the value of skip=
and a second time to do the actual read using that value. To read from a file myfile.txt
replace text = Lines
and textConnection(Lines)
with "myfile.txt"
.
read.table(text = Lines,
skip = grep("##XYDATA=", readLines(textConnection(Lines))))
Added Some revisions and added second approach.
reading text file in r and store what is read conditioned on the next line
This will be somewhat problematic because the format is so irregular from item to item. Heres a run at the first item codebook text:
txt <- "m5a2 A2. Confirm how much time child lives with respondent
--------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: BM_101F
range: [-9,7] units: 1
unique values: 8 missing .: 0/4898
tabulation: Freq. Numeric Label
1383 -9 -9 Not in wave
4 -2 -2 Don't know
2 -1 -1 Refuse
3272 1 1 all or most of the time
29 2 2 about half of the time
76 3 3 some of the time
80 4 4 none of the time
52 7 7 only on weekends
"
Lines <- readLines( textConnection(txt))
# isolate lines with letter in first column
Lines[grep("^[a-zA-Z]", Lines)]
# Now replace long runs of spaces with commas and scan:
scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)] ),
sep=",", what="")
#----
Read 2 items
[1] "m5a2"
[2] "A2. Confirm how much time child lives with respondent"
The "tabulation" line can be used to create column labels.
colnames <- scan(text=sub(".*tabulation[:]", "",
Lines[grep("tabulation[:]", Lines)] ), sep="", what="")
#Read 3 items
The substitution-with-commas strategy need to be a bit more involved for the lines afterward. First isolate the rows where a numeric digit is the first non-space character:
dataRows <- Lines[grep("^[ ]*\\d", Lines)]
Then substitute commas for the pattern digit-2+spaces and read with read.csv:
myDat <- read.csv(text=
gsub("(\\d)[ ]{2,}", "\\1,", dataRows ),
header=FALSE ,col.names=colnames)
#------------
myDat
V1 V2 V3
1 1383 -9 -9 Not in wave
2 4 -2 -2 Don't know
3 2 -1 -1 Refuse
4 3272 1 1 all or most of the time
5 29 2 2 about half of the time
6 76 3 3 some of the time
7 80 4 4 none of the time
8 52 7 7 only on weekends
Looping over multiple items might be possible with a counter generated from cumsum( grepl("^-------", Lines)
if the Lines-object were the entire file such as the one at:
Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt")
sum( grepl("^-------", Lines) )
#----------------------
[1] 1966
Warning messages:
1: In grepl("^-------", Lines) :
input string 6995 is invalid in this locale
2: In grepl("^-------", Lines) :
input string 7349 is invalid in this locale
3: In grepl("^-------", Lines) :
input string 7350 is invalid in this locale
4: In grepl("^-------", Lines) :
input string 7352 is invalid in this locale
5: In grepl("^-------", Lines) :
input string 7353 is invalid in this locale
My "hand-held scan()-er" suggested to me that there were only two types of codebook record: "tabulations" (presumably items with fewer than 10 or so intances) and "examples"(ones with more). They had different structures (as can be seen above in your codebook fragment) so maybe only two types of parsing logic would be needed to be built and deployed. So I think the tools to do that are described above.
The warnings all relate to the character "\x92" being used as an apostrophe. Regex and R share an escape-character "\", so you need to escape the escapes. They could be corrected with:
Lines <- gsub("\\\x92", "'", Lines )
Reading a txt file line by line with skip function of every second line and the output saved as a dataframe using R
We read the data with readLines
lines <- readLines('file.txt')
Then use a recursive indexing with logical value and split it to a list
lst1 <- strsplit(gsub("\t", "", lines[c(FALSE, TRUE)]), "")
lst1
#[[1]]
# [1] "D" "M" "E" "S" "P" "V" "F" "A" "F" "P" "K" "A" "L" "D" "L" "E" "T" "H" "I" "E" "K" "L" "F" "L" "Y"
#[[2]]
# [1] "D" "D" "T" "L" "D" "D" "S" "D" "E" "D" "D" "I" "V" "V" "E" "S" "Q" "D" "P" "P" "L" "P" "S" "W" "G"
#[[3]]
# [1] "P" "R" "R" "E" "T" "E" "E" "F" "N" "D" "L" "K" "A" "L" "D" "F" "I" "L" "S" "N" "S" "L" "T" "H" "P"
#[[4]]
# [1] "E" "K" "A" "R" "M" "I" "Y" "E" "D" "D" "E" "T" "Y" "L" "S" "P" "K" "E" "V" "S" "L" "D" "S" "R" "V"
Related Topics
Ggplot2 - Jitter and Position Dodge Together
Simplest Way to Do Grouped Barplot
How to Subtract Months from a Date in R
How to Convert Dataframe into Time Series
Make a Group_Indices Based on Several Columns
Create Integer Sequences Defined by 'From' and 'To' Vectors
Display/Print All Rows of a Tibble (Tbl_Df)
How to Put a Transformed Scale on the Right Side of a Ggplot2
Return Elements of List as Independent Objects in Global Environment
Do.Call(Rbind, List) For Uneven Number of Column
Don't Drop Zero Count: Dodged Barplot
Add a Variable to a Data Frame Containing Max Value of Each Row
Select Multiple Columns in Data.Table by Their Numeric Indices
Order Stacked Bar Graph in Ggplot
Dplyr Summarise: Equivalent of ".Drop=False" to Keep Groups With Zero Length in Output