R (arules) Convert dataframe into transactions and remove NA
Ogustari is right. Here is the complete code that also handles the transaction IDs.
library("arules")
library("dplyr") ### for dbl_df
df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"),
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"),
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"),
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA),
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"),
Other = c(NA, NA, NA, NA, "Promo", NA)),
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
### remove transaction IDs
tid <- as.character(df[["Transaction_ID"]])
df <- df[,-1]
### make all columns factors
for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])
trans <- as(df, "transactions")
### set transactionIDs
transactionInfo(trans)[["transactionID"]] <- tid
inspect(trans)
items transactionID
[1] {Personal=ToothP,Drink=Coff} A001
[2] {Personal=ToothP} A002
[3] {Drink=Coff} A003
[4] {Vegetables=Potato,Personal=ToothB,Drink=Milk} A004
[5] {Personal=ToothB,Drink=Milk,Other=Promo} A005
[6] {Vegetables=Yam,Drink=Coff} A006
Correctly convert data.frame to transactions for arules
We may need to split
by the 'data' column and do the unlist
df_trans <- as(setNames(lapply(split(noticias_json[-3],
noticias_json$data), unlist), NULL), "transactions")
inspect(df_trans)
# items
#[1] {icarai,
# trafico de drogas}
#[2] {danilo passos,
# porte ilegal de armas,
# roubo,
# serra verde,
# trafico de drogas}
data
noticias_json <- structure(list(bairro = structure(list("icarai",
c("danilo passos",
"serra verde")), class = "AsIs"), crime = structure(list("trafico de drogas",
c("trafico de drogas", "porte ilegal de armas", "roubo")), class = "AsIs"),
data = c("01-02-2016", "31-02-2016")), .Names = c("bairro",
"crime", "data"), row.names = c(NA, -2L), class = "data.frame")
How to convert a data frame to arules' transaction object
Here is what I tried. I think you need to manipulate your data and create lists. First, I created transaction ID just in case. Then, I transformed the data to a long-format data frame. By this time, all products stay in one column. I removed all rows that have NA. Then, I converted products to factor. For each group (transaction id), I created list containing all products. x
has a column called whatever
. This is the list you want to use to create a transaction object.
library(tidyverse)
library(arules)
mutate(mydata, transaction_id = 1:n()) %>%
pivot_longer(cols = contains("Item"), names_to = "item", values_to = "product") %>%
filter(complete.cases(product)) %>%
mutate(product = factor(product)) %>%
group_by(transaction_id) %>%
summarize(whatever = list(product)) -> x
# Assign transaction ID as name to whatever
names(x$whatever) <- x$transaction_id
$`1`
[1] lipstick Bronzer Mascara
Levels: Bronzer Eyeliner Eyeshadow Lip gloss lipstick Mascara Nail varnish Powder Remover
$`2`
[1] Eyeshadow lipstick
Levels: Bronzer Eyeliner Eyeshadow Lip gloss lipstick Mascara Nail varnish Powder Remover
$`3`
[1] Powder Remover
Levels: Bronzer Eyeliner Eyeshadow Lip gloss lipstick Mascara Nail varnish Powder Remover
$`4`
[1] Nail varnish Lip gloss Eyeliner
Levels: Bronzer Eyeliner Eyeshadow Lip gloss lipstick Mascara Nail varnish Powder Remover
Finally, I created a transaction-class object.
mybasket <- as(x$whatever, "transactions")
> summary(mybasket)
transactions as itemMatrix in sparse format with
4 rows (elements/itemsets/transactions) and
9 columns (items) and a density of 0.2777778
most frequent items:
lipstick Bronzer Eyeliner Eyeshadow Lip gloss (Other)
2 1 1 1 1 4
element (itemset/transaction) length distribution:
sizes
2 3
2 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 2.0 2.5 2.5 3.0 3.0
includes extended item information - examples:
labels
1 Bronzer
2 Eyeliner
3 Eyeshadow
includes extended transaction information - examples:
transactionID
1 1
2 2
3 3
DATA
mydata <- structure(list(Transaction = c("12/09/2001", "2/09/2001", "13/09/2002",
"14/09/2003"), Item1 = c("lipstick", "Eyeshadow", "Powder", "Nail varnish"
), Item2 = c("Bronzer", "lipstick", "Remover", "Lip gloss"),
Item3 = c("Mascara", NA, NA, "Eyeliner")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
long dataframe to transactions for arules in R
Here is one way I do it, and find it to be faster. Idea is to create a wide data frame of 0/1 values, and then feed that to create transactions. Does not require any split.
library(dplyr)
library(tidyr)
library(arules)
df <- df %>%
select(TID, itemNO) %>%
distinct() %>%
mutate(value = 1) %>%
spread(itemNO, value, fill = 0)
itemMatrix <- as(as.matrix(df[, -1]), 'transactions')
Convert R data.frame column to Arules transactions
Have a look at the examples in ? transactions
. You need a list with vectors of items (item labels) and not a data.frame
.
items <- strsplit(as.character(a_df$Tags), ", ")
trans3 <- as(items, "transactions")
rules <- apriori(trans3, parameter = list(sup = 0.1, conf = 0.5, target="rules",minlen=1))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen
0.5 0.1 1 none FALSE TRUE 5 0.1 1 10
target ext
rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 0
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[22 item(s), 7 transaction(s)] done [0.00s].
sorting and recoding items ... [22 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [198 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
correct converting dataframe into transactions for arules in R
This is because the data is comma delimited when downloaded, and in g=read.csv("g.csv",sep=";")
, you are splitting the data on a semi-colon. You should get desired output if you remove sep = ";"
from your definition of g
.
See the following, which defines sep as ;
:
> trans <- read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ';')
> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:9835] 1265 6162 6377 4043 3585 6475 4431 3535 4401 6490 ...
.. .. ..@ p : int [1:9836] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ Dim : int [1:2] 7011 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 7011 obs. of 1 variable:
.. ..$ labels: chr [1:7011] "abrasive cleaner" "abrasive cleaner,napkins" "artif. sweetener" "artif. sweetener,coffee" ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
And this, which defines sep as ,
:
> trans <- read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ',')
> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
.. .. ..@ Dim : int [1:2] 169 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 169 obs. of 1 variable:
.. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
Related Topics
Ggplot2 0.9.0 Automatically Dropping Unused Factor Levels from Plot Legend
Cluster One-Dimensional Data Optimally
Update/Replace Values in Dataframe with Tidyverse Join
Return a Data Frame from Function
Installing R Gsl Package on Ubuntu
How to Produce Different Geom_Vline in Different Facets in R
Without Root Access, Run R with Tuned Blas When It Is Linked with Reference Blas
Fill Na in a Time Series Only to a Limited Number
Changing the Line Type in the Ggplot Legend
Install Rtools on R Version 3.0.2
How to Join Two Dataframes by Nearest Time-Date
Rbind Data Frames Based on a Common Pattern in Data Frame Name
Saving a Graph with Ggsave After Using Ggplot_Build and Ggplot_Gtable
Spread with Duplicate Identifiers (Using Tidyverse and %>%)