How to Prep Transaction Data into Basket for Arules

How to prep transaction data into basket for arules

Take a look at the help page for the "transactions" data type for examples on how to get your data in:

library(arules)
?transactions

For your type, you want to split by Order, then use as to get it into a transactions list:

trans <- as(split(data[,"Part"], data[,"Order"]), "transactions")
inspect(trans)
items transactionID
1 {A,B,G} 1
2 {R} 2
3 {A,B} 3
4 {E} 4
5 {Y} 5
6 {A,B,F,V} 6

How to prepare transaction data for arules

Apparently the unique(s1) is causing some problem to your coding. Is it required?

I'd managed to create the transaction just by hashing out that line.

sales <- structure(list(sku = c(207426L, 207422L, 207424L, 9793L, 33186L, 
72406L), product_id = c(15729L, 15725L, 15727L, 15999L, 15983L,
15992L), item_id = 1:6, order_id = c(1L, 1L, 1L, 2L, 2L, 2L)),
.Names = c("sku", "product_id", "item_id", "order_id"),
class = "data.frame", row.names = c(NA, -6L))

s1 <- split(sales$product_id, sales$order_id)
#s1 <- unique(s1)

tr <- as(s1, "transactions")
tr

transactions in sparse format with
2 transactions (rows) and
6 items (columns)

If unique is really required, run this instead:

s1 <- lapply(s1, unique)

R arules preparing dataset for transactions

I would do it this way (following the examples in the manual page for transactions):

data_list <- split(data$Product, paste(data$OrderDate, data$Customer))
trans <- as(data_list, "transactions")
inspect(trans)

items transactionID
[1] {Milk} 1-Oct John
[2] {Bread,Eggs} 2-Oct John
[3] {Butter,Eggs,Milk} 2-Oct Tom
[4] {Bread,Butter,Eggs,Wine} 3-Oct Sally

itemFrequencyPlot(trans, topN = 5)

Hope this helps!

correct converting dataframe into transactions for arules in R

This is because the data is comma delimited when downloaded, and in g=read.csv("g.csv",sep=";"), you are splitting the data on a semi-colon. You should get desired output if you remove sep = ";" from your definition of g.

See the following, which defines sep as ;:

> trans <-  read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ';')
> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:9835] 1265 6162 6377 4043 3585 6475 4431 3535 4401 6490 ...
.. .. ..@ p : int [1:9836] 0 1 2 3 4 5 6 7 8 9 ...
.. .. ..@ Dim : int [1:2] 7011 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 7011 obs. of 1 variable:
.. ..$ labels: chr [1:7011] "abrasive cleaner" "abrasive cleaner,napkins" "artif. sweetener" "artif. sweetener,coffee" ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

And this, which defines sep as ,:

> trans <-  read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ',')
> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
.. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
.. .. ..@ Dim : int [1:2] 169 9835
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : NULL
.. .. ..@ factors : list()
..@ itemInfo :'data.frame': 169 obs. of 1 variable:
.. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
..@ itemsetInfo:'data.frame': 0 obs. of 0 variables

How to load transaction (basket) data in RapidMiner for association rule?

I finally understood what you meant - sorry I was being slow. This can be done using operators from the Text Processing Extension. You have to install this from the RapidMiner repository. Once you have you can try this process.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.000" expanded="true" height="68" name="Read CSV" width="90" x="246" y="85">
<parameter key="csv_file" value="C:\Temp\is.txt"/>
<parameter key="column_separators" value="\r\n"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=","/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

The trick is to use Read CSV to read the original file in but use end of line as the delimiter. This reads the entire line in as a polynominal attribute. From there, you have to convert this to text so that the text processing operators can do their work. The Process Documents from Data operator is then used to make the final example set. The important point is to use the Tokenize operator to split the lines into words separated by commas.

R (arules) Convert dataframe into transactions and remove NA

Ogustari is right. Here is the complete code that also handles the transaction IDs.

library("arules")
library("dplyr") ### for dbl_df
df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"),
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"),
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"),
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA),
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"),
Other = c(NA, NA, NA, NA, "Promo", NA)),
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

### remove transaction IDs
tid <- as.character(df[["Transaction_ID"]])
df <- df[,-1]

### make all columns factors
for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])

trans <- as(df, "transactions")

### set transactionIDs
transactionInfo(trans)[["transactionID"]] <- tid

inspect(trans)

items transactionID
[1] {Personal=ToothP,Drink=Coff} A001
[2] {Personal=ToothP} A002
[3] {Drink=Coff} A003
[4] {Vegetables=Potato,Personal=ToothB,Drink=Milk} A004
[5] {Personal=ToothB,Drink=Milk,Other=Promo} A005
[6] {Vegetables=Yam,Drink=Coff} A006


Related Topics



Leave a reply



Submit