Import multiple files and then find the array of average of the columns
Group by coordinates
Combining things by rows is all right, as long as you don't require your final list to be in any particular order, and don't have different rows with the same coordinates. In that case, you can simply use common coordinates to group rows, and then aggregate over them like this:
aggregate(Value ~ Lat + Lon, hello, mean)
Group by row numbers
If, on the other hand, you have duplicate coordinates, or want the final result to be in the same order as all the inputs, then you should extract the Value
column from each data.frame
and combine them into a matrix. Then you can compute the mean for each matrix row, and combine those means with the two coordinate columns of any input data frame. This whole approach relies heavily on the order of input data rows, i.e. on the row number of a given place being the same in all files. You could implement it like this:
mean_values <- apply(do.call(cbind, lapply(data_list, function(df) df$Value)), 1, mean)
cbind(data_list[[1]][1:2], Value=mean_values)
Trying this out
Here is an example session of what this looks like on my system:
> data_list <- list(File.1=data.frame(Lat=c(10,12),Lon=c(12,13),Value=c(15,16)),
File.2=data.frame(Lat=c(10,12),Lon=c(12,13),Value=c(11,15)))
> hello <- as.data.frame(do.call(rbind,data_list))
> dim(hello)
[1] 4 3
> str(hello)
'data.frame': 4 obs. of 3 variables:
$ Lat : num 10 12 10 12
$ Lon : num 12 13 12 13
$ Value: num 15 16 11 15
> aggregate(Value ~ Lat + Lon, hello, mean)
Lat Lon Value
1 10 12 13.0
2 12 13 15.5
> value_matrix <- do.call(cbind, lapply(data_list, function(df) df$Value))
> value_matrix
File.1 File.2
[1,] 15 11
[2,] 16 15
> mean_values <- apply(value_matrix, 1, mean)
> cbind(data_list[[1]][1:2], Value=mean_values)
Lat Lon Value
1 10 12 13.0
2 12 13 15.5
Only a single column?
As you only get a single column from reading your input files, according to your dim
output, you should investigate that data frame using head
or str
to see what went wrong. Most likely, your columns aren't separated by tabs but by commas or spaces or some such. Notice that if you do not spcify sep
, then any sequence of spaces and / or tabs will be used as a column separator. Read the documentation for read.table
for details.
How to create a new read.csv in R so it can read .csv file without typing the full name of .csv file
This is not specifically an answer to your question (others have covered that), but rather some advice that may be helpful for accomplishing your task in a different way.
First, some of the GUI's for R have file name completion. You can type the first part: read.csv("001-
and then hit a key or combination of keys (In the windows GUI you press TAB) and the rest of the filename will be filled in for you (as long as it is unique).
You can use the file.choose
or choose.files
functions to open a dialog box to choose your file using the mouse: read.csv(file.choose())
.
If you want to read in all the above files then you can do this in one step using lapply
and either sprintf
or list.files
(or others):
mycsvlist <- lapply( 1:150, function(x) read.csv( sprintf("%03d-XXX.csv", x) ) )
or
mvcsvlist <- lapply( list.files(pattern="\\.csv$"), read.csv )
You could also use list.files
to get a list of all the files matching a pattern and then pass one of the returned values to read.csv
:
tmp <- list.files(pattern="001.*csv$")
read.csv(tmp[1])
Extracting speaker interventions from a text using R? Or something else?
If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:
# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )
# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"
The @ is a good starting point for extracting the individual interventions. This can be done thus:
pattern <- "@.[^@]*"
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"
Which function should I use to read unstructured text file into R?
read.delim
reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines()
. readLines()
creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return
. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines()
splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines()
read that long seventh paragraph as one line. And, as you can also see, readLines()
added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines()
may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE)
, but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines()
:
mytext <- readLines("textfile.txt")
Besides readLines()
, you can also use scan()
, readBin()
and other functions to read text from files. Look at the manual by entering ?scan
etc. Look at ?connections
to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return
does not cause R to execute the command before I closed the string with ")
. R just replies with +
, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n
.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit
to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit
to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences
now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.
Error in reading all files present in a folder
Lists are not like arrays in other languages; they do not have a predetermined size and are not sparse.
Hence, assigning to a specific index when the list has length 0
throws an IndexError
because you are saying "re-assign the i
-th element to be this" when there is no i
-th element".
Instead, you want to append elements to the end of the list.
Also, your code has many other confusing parts to it, so rather than trying to muddle through it, the following code will simply produce a list of strings of the contents of each file in the current working directory (note that this will include the Python script itself, so you may want to filter out the name of the script).
import os
file_data = []
files_in_cwd = os.listdir()
for file_name in files_in_cwd:
with open(file_name) as file_handler:
file_data.append(file_handler.read())
print(file_data)
Note that you should always use a with
statement when opening files and that mode='r'
is the default, also the usual way of creating an empty list is with []
.
Related Topics
Ggplot2 Time Series Plotting: How to Omit Periods When There Is No Data Points
How Calculate Growth Rate in Long Format Data Frame
How to Convert .Rdata Format into Text File Format
Export All User Inputs in a Shiny App to File and Load Them Later
How to Fill in the Contour Fully Using Stat_Contour
How to Loop Over the Length of a Dataframe in R
R Return the Index of the Minimum Column for Each Row
Dealing with Readlines() Function in R
Subsetting a Data.Table by Range Making Use of Binary Search
Insert Images Using Knitr::Include_Graphics in a for Loop
Subset Observations That Differ by at Least 30 Minutes Time
Data.Table VS Plyr Regression Output
How to Use Custom Functions in Mutate (Dplyr)
How Does One Merge Dataframes by Row Name Without Adding a "Row.Names" Column