R Text File and Text Mining...How to Load Data

Load documents(represented by sentences) from a single text file in R

Try readLines("/path/to/yourfile.txt")
Each line will be a different element in a text vector NLines long where Nlines is the number of lines in your document.
Otherwise, see scan().
Both have a skip option if you need it, and an nlines option if you want to read it in chunks.

Combining .txt files with character data into a data frame for tidytext analysis

One approach could be using dplyr package and a for loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext to tidy up:

#install.packages(c("dplyr", "tidytext"))
library(dplyr)
library(tidytext)

file_list <- list.files(pattern="*.txt")

texts <- data.frame(file=character(),
paragraph=as.numeric(),
text=character(),
stringsAsFactors = FALSE) # creates empty dataframe

for (i in 1:length(file_list)) {
p <- read.delim(file_list[i],
header=FALSE,
col.names = "text",
stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
paragraph=row_number()) # add paragraph number
texts <- bind_rows(texts, p) # adds to existing dataframe
}

words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed

Your final output would then be:

head(words)
file paragraph word
1 SampleTextFile_10kb 1 lorem
1.1 SampleTextFile_10kb 1 ipsum
1.2 SampleTextFile_10kb 1 dolor
1.3 SampleTextFile_10kb 1 sit
1.4 SampleTextFile_10kb 1 amet
1.5 SampleTextFile_10kb 1 consectetur
...

Is this what you're looking for for your next stages of analysis?

read multiple text files into r for text mining purposes

I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:

textreadr::read_dir("../data/InauguralSpeeches/")

Your example is not reproducible so I do it below (please make your example reproducible in the future).

library(textreadr)

## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')

## the read in of a directory
read_dir('delete_me')

output

The output below shows the tibble output with each document registered in the document column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.

##    document                                  content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18

How to do Text Mining from a HTML document, and convert it into a CSV file?

If you need the information from the table on the website using rvest you can do :

library(rvest)
url <- 'https://www.bmkg.go.id/gempabumi/gempabumi-terkini.bmkg'
out_df <- url %>% read_html() %>% html_table() %>% .[[1]]

head(out_df)
# # Waktu Gempa Lintang Bujur Magnitudo Kedalaman Wilayah
#1 1 02-Apr-20 09:13:13 WIB -7.93 125.62 5.5 10 Km 125 km TimurLaut ALOR-NTT
#2 2 29-Mar-20 06:10:35 WIB -7.39 124.19 5.2 631 Km 108 km BaratLaut ALOR-NTT
#3 3 28-Mar-20 22:43:17 WIB -1.72 120.14 5.8 10 Km 46 km Tenggara SIGI-SULTENG
#4 4 27-Mar-20 21:32:48 WIB 0.28 133.53 5.5 10 Km 139 km BaratLaut MANOKWARI-PAPUABRT
#5 5 27-Mar-20 04:36:40 WIB -2.72 139.26 5.9 11 Km 72 km BaratLaut KAB-JAYAPURA-PAPUA
#6 6 26-Mar-20 22:38:03 WIB 5.58 125.16 6.3 10 Km 221 km BaratLaut TAHUNA-KEP.SANGIHE-SULUT

You could use write.csv to write this data into csv

write.csv(out_df, 'earthquake_data.csc', row.names = FALSE)

creating corpus from multiple txt files

If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.

To find all the text files within a working directory, you can use list.files with an argument:

all_txts <- list.files(pattern = ".txt$")

The all_txts object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(tidyverse)
library(tidytext)

map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))

How to read csv file for text mining

I don't know why you removed the file from the original post, @Yes Boss but this answer is based on this file, rather than your dput output. The file basically had two problems why you couldn't read it in. 1. Your quote character was ' instead of the more common "; 2. ' is also used in the column review which is a bit too much for base (it tries to split into new columns in these instances). Luckily, the package data.table is a bit smarter and can take care of problem #2:

library(data.table)

df <- fread(file = "deception.csv", quote="\'")

The resulting object will be a data.table instead of a data.frame:

> str(df)
Classes ‘data.table’ and 'data.frame': 92 obs. of 3 variables:
$ lie : chr "f" "f" "f" "f" ...
$ sentiment: chr "n" "n" "n" "n" ...
$ review : chr "Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at"| __truncated__ "i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, an"| __truncated__ "After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes ." "Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat."| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>

You can turn this behaviour off by setting data.table = FALSE in fread() (if you want to, I recommend learning how to work with data.table).

A personal opinionated note: If you want to get into text mining, look into the quanteda package instead of tm. It is a lot faster and has a more modern approach to many tasks.



Related Topics



Leave a reply



Submit