Load documents(represented by sentences) from a single text file in R
Try readLines("/path/to/yourfile.txt")
Each line will be a different element in a text vector NLines long where Nlines is the number of lines in your document.
Otherwise, see scan().
Both have a skip option if you need it, and an nlines option if you want to read it in chunks.
Combining .txt files with character data into a data frame for tidytext analysis
One approach could be using dplyr
package and a for
loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext
to tidy up:
#install.packages(c("dplyr", "tidytext"))
library(dplyr)
library(tidytext)
file_list <- list.files(pattern="*.txt")
texts <- data.frame(file=character(),
paragraph=as.numeric(),
text=character(),
stringsAsFactors = FALSE) # creates empty dataframe
for (i in 1:length(file_list)) {
p <- read.delim(file_list[i],
header=FALSE,
col.names = "text",
stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
paragraph=row_number()) # add paragraph number
texts <- bind_rows(texts, p) # adds to existing dataframe
}
words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed
Your final output would then be:
head(words)
file paragraph word
1 SampleTextFile_10kb 1 lorem
1.1 SampleTextFile_10kb 1 ipsum
1.2 SampleTextFile_10kb 1 dolor
1.3 SampleTextFile_10kb 1 sit
1.4 SampleTextFile_10kb 1 amet
1.5 SampleTextFile_10kb 1 consectetur
...
Is this what you're looking for for your next stages of analysis?
read multiple text files into r for text mining purposes
I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:
textreadr::read_dir("../data/InauguralSpeeches/")
Your example is not reproducible so I do it below (please make your example reproducible in the future).
library(textreadr)
## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')
## the read in of a directory
read_dir('delete_me')
output
The output below shows the tibble output with each document registered in the document
column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.
## document content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18
How to do Text Mining from a HTML document, and convert it into a CSV file?
If you need the information from the table on the website using rvest
you can do :
library(rvest)
url <- 'https://www.bmkg.go.id/gempabumi/gempabumi-terkini.bmkg'
out_df <- url %>% read_html() %>% html_table() %>% .[[1]]
head(out_df)
# # Waktu Gempa Lintang Bujur Magnitudo Kedalaman Wilayah
#1 1 02-Apr-20 09:13:13 WIB -7.93 125.62 5.5 10 Km 125 km TimurLaut ALOR-NTT
#2 2 29-Mar-20 06:10:35 WIB -7.39 124.19 5.2 631 Km 108 km BaratLaut ALOR-NTT
#3 3 28-Mar-20 22:43:17 WIB -1.72 120.14 5.8 10 Km 46 km Tenggara SIGI-SULTENG
#4 4 27-Mar-20 21:32:48 WIB 0.28 133.53 5.5 10 Km 139 km BaratLaut MANOKWARI-PAPUABRT
#5 5 27-Mar-20 04:36:40 WIB -2.72 139.26 5.9 11 Km 72 km BaratLaut KAB-JAYAPURA-PAPUA
#6 6 26-Mar-20 22:38:03 WIB 5.58 125.16 6.3 10 Km 221 km BaratLaut TAHUNA-KEP.SANGIHE-SULUT
You could use write.csv
to write this data into csv
write.csv(out_df, 'earthquake_data.csc', row.names = FALSE)
creating corpus from multiple txt files
If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.
To find all the text files within a working directory, you can use list.files
with an argument:
all_txts <- list.files(pattern = ".txt$")
The all_txts
object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map
function from purrr. You can use a mutate()
within the map()
to annotate each line with the filename, if you'd like.
library(tidyverse)
library(tidytext)
map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))
How to read csv file for text mining
I don't know why you removed the file from the original post, @Yes Boss but this answer is based on this file, rather than your dput
output. The file basically had two problems why you couldn't read it in. 1. Your quote character was '
instead of the more common "
; 2. '
is also used in the column review
which is a bit too much for base (it tries to split into new columns in these instances). Luckily, the package data.table is a bit smarter and can take care of problem #2:
library(data.table)
df <- fread(file = "deception.csv", quote="\'")
The resulting object will be a data.table instead of a data.frame:
> str(df)
Classes ‘data.table’ and 'data.frame': 92 obs. of 3 variables:
$ lie : chr "f" "f" "f" "f" ...
$ sentiment: chr "n" "n" "n" "n" ...
$ review : chr "Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at"| __truncated__ "i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, an"| __truncated__ "After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes ." "Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat."| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
You can turn this behaviour off by setting data.table = FALSE
in fread()
(if you want to, I recommend learning how to work with data.table).
A personal opinionated note: If you want to get into text mining, look into the quanteda package instead of tm. It is a lot faster and has a more modern approach to many tasks.
Related Topics
Difference Between R-Base and R-Recommended Packages
Solving Non-Square Linear System with R
Compute Monthly Averages from Daily Data
Force Ggplot Legend to Show All Categories When No Values Are Present
R Dplyr Rowwise Mean or Min and Other Methods
Converting Numeric Time to Datetime Posixct Format in R
Delete Entries with Only One Observation in a Group
How to Set Legend Alpha with Ggplot2
Using Data.Table I and J Arguments in Functions
Split Time Series Data into Time Intervals (Say an Hour) and Then Plot the Count
Annotate Ggplot with an Extra Tick and Label
How to Create a Bipartite Network in R with Igraph or Tnet
Get Width of Plot Area in Ggplot2
Applying a Function to Two Lists
Calculating Time Difference Between Two Columns
Set the Order of a Stacked Bar Chart by the Value of One of the Variables
How to Fit a Very Wide Grid.Table or Tablegrob to Fit on a PDF Page
How to Change the Number of Decimal Places on Axis Labels in Ggplot2