Split Data.Frame Row into Multiple Rows Based on Commas

Split (explode) pandas dataframe string entry to separate rows

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2

Then you just have to rename the columns

Split data.frame row into multiple rows based on commas

This should work.

install.packages("splitstackshape")
library(splitstackshape)
out <- concat.split.multiple(mydat, c("v1","v2"), seps=",", "long")

out
v1 v2 v3
1: name 1 1
2: name2 2 1
3: name3 3 2
4: name4 4 3
5: name5 5 3

Splitting a comma separated value into separate rows in Pandas

Here is one way using explode:

df.Col1 = df.Col1.str.split(',')                                                               
df.explode('Col1')

Output:

  Col1   Col2  Col3
0 1 0.034 0.1
1 2 1.234 0.2
1 3 1.234 0.2
1 4 1.234 0.2
2 5 0.300 1.3

Split cell into multiple rows in pandas dataframe

Here's one way using numpy.repeat and itertools.chain. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack based methods may fare better.

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = df['package'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),
'order_date': np.repeat(df['order_date'], lens),
'package': chainer(df['package']),
'package_code': chainer(df['package_code'])})

print(res)

order_id order_date package package_code
0 1 20/5/2018 p1 #111
0 1 20/5/2018 p2 #222
0 1 20/5/2018 p3 #333
1 3 22/5/2018 p4 #444
2 7 23/5/2018 p5 #555
2 7 23/5/2018 p6 #666

How to split values separated by commas in the same row to different rows in R

You can do:

library(tidyr)
library(dplyr)

dat %>%
pivot_longer(-c(Q2, names)) %>%
separate_rows(value) %>%
group_by(names, name) %>%
mutate(row = row_number()) %>%
pivot_wider() %>%
select(-row)

# A tibble: 8 × 4
# Groups: names [3]
names Q2 Q3 Q4
<chr> <chr> <chr> <chr>
1 PART_1 fruits "bananas" brocolli
2 PART_1 fruits "apples" lettuce
3 PART_1 fruits NA potatoes
4 PART_2 vegetables "bananas" brocolli
5 PART_2 vegetables "oranges" NA
6 PART_3 fruits "" carrots
7 PART_3 fruits NA brocolli
8 PART_3 fruits NA lettuce

Python or pandas split columns by comma and append into rows

The pandas DataFrame has explode method that does exactly what you want. See explode() documentation. It works with list-like object, so if the column you want to explode is of type string, then you need to split it into list. See str.split() documentation. Additionally you can remove any white spaces with Pandas map function.

Full code example:

import pandas as pd

df = pd.DataFrame({
"x": [1,2,3,4],
"y": ["a, b, c, d", "e, f, g", "h, i", "j, k, l, m, n"]
})

# Convert string with commas into list of string and strip spaces
df['y'] = df['y'].str.split(',').map(lambda elements: [e.strip() for e in elements])

# Explode lists in the column 'y' into separate values
df.explode('y')

Output:

   x  y
0 1 a
0 1 b
0 1 c
0 1 d
1 2 e
1 2 f
1 2 g
2 3 h
2 3 i
3 4 j
3 4 k
3 4 l
3 4 m
3 4 n

Excel: Split rows of table in multiple rows based on different cell values

There are multiple ways of doing this: Excel or Power Query.

Excel

This is one way of doing it under Excel. You can try the following in cell E2 (Formula 1):

=LET(input, A2:C3, fAndF, SUBSTITUTE(INDEX(input,,2) & "," & INDEX(input,,3), 
", ",","), names, INDEX(input,,1),
GETCOL, LAMBDA(x, TEXTSPLIT(TEXTJOIN(",",, x),,",",TRUE)),
mapresult, MAP(names, fAndF, LAMBDA(name,f,
LET(num, LEN(f) - LEN(SUBSTITUTE(f, ",","")) + 1, REPT(name&",",num)))),
HSTACK(GETCOL(mapresult), GETCOL(fAndF))
)

and here is the output:

sample excel file

Note: For large dataset the previous solution has a limitation, due to TEXTJOIN function. The maximum number of text you can concatenate is 253. In order to circumvent that you can use the following approach based on DROP/REDUCE/VSTACK functions to incrementally add new elements as we iterate (Formula 2):

=LET(input, A2:C3, fAndF, SUBSTITUTE(INDEX(input,,2) & "," & 
INDEX(input,,3), ", ",","),names, INDEX(input,,1),
DROP(REDUCE("", names, LAMBDA(ac, name, VSTACK(ac, LET(
ff, TEXTSPLIT(XLOOKUP(name, names, fAndF), ","),
DROP(REDUCE("", ff, LAMBDA(acc, f, VSTACK(acc, HSTACK(name,f)))),1)
)))),1)
)

Power Query

With Power Query, there is no need to use M-code all the functionalities required are provided by the UI. Here is the corresponding M-code (Advanced Editor):

let
Source = Excel.CurrentWorkbook(){[Name="TB_INPUT"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Name", type text}, {"Friends", type text},
{"Family", type text}}),
RemoveExtraSpaceAfterComma = Table.ReplaceValue(#"Changed Type",", ",",",Replacer.ReplaceText,
{"Friends", "Family"}),
#"Merge Friend and Family" = Table.CombineColumns(RemoveExtraSpaceAfterComma,{"Friends", "Family"},
Combiner.CombineTextByDelimiter(",", QuoteStyle.None),"F&F"),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Merge Friend and Family",
{{"F&F", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta
[Serialized.Text = true] in type {itemType}}}), "F&F")
in
#"Split Column by Delimiter"

it will produce the following output:
sample output in Power Query

Explanation

Excel

For Formula 1, we use use LET for easy reading and composition. The name fAndF represents the concatenation by a row of Friend and Family columns. We remove also the space of the comma (,) delimiter.

GETCOL is LAMBDA user-defined function. To convert an array of comma-delimited values on each row into a single column.

We use a MAP function to determine how many times we need to repeat the Name column value. The num:

LEN(f) - LEN(SUBSTITUTE(f, ",","")) + 1

is a trick to determine how many rows we need by counting the total number of commas (,).

The name mapresult generates the following output:

John,John,John,John,John,
Jane,Jane,Jane,Jane,

Now, all we need to do is to join both rows via TEXTJOIN, then split again by comma via TEXTSPLIT and this is what the user function GETCOL does.

Note: The fourth input argument in TEXTSPLIT (in GETCOL), is set to TRUE to avoid generating an empty row with the last delimiter at the end.

Finally, we combine the result via HSTACK.

For Formula 2, we take a different approach. We use a pattern for avoiding nested array error that TEXTSPLIT produces and other functions. Check the answer of this question: How to split texts from dynamic range? provided by: @JvdV. The main idea consists of combining DROP, REDUCE and VSTACK functions to produce a recursion to add rows on each new iteration. We use this idea twice. One for each names and the other one for fAndF items for a given name on first REDUCE call.

For each name on first REDUCE call we find via XLOOKUP the corresponding fAndF values, then we invoke TEXTSPLIT to get the corresponding array (ff) and for each element of ff (f) we invoke the second REDUCE call adding a 1x2 row with the name and the corresponding f value via HSTACK.

Under this pattern, we need to remove the first row, that contains the initialization value of the accumulator (ac, acc). The pattern is always the same:

DROP(REDUCE("", array, LAMBDA(acc, arr, VSTACK(acc, func))),1)

and func, is where we do the calculation to build the content of the row we want to add. Usually we need to create additional variables and we encapsulate the calculation inside a LET function call.

Power Query

Once we have the data defined in an Excel Table, then in Power Query view.

Remove extra space in the comma delimiter. We select both columns Friend and Family columns and then: Home->Replace Values searching for , and replacing it with ,.

Select again Friend and Family columns then Transform->Merge Column. We name the merged column: F&F and we indicate the Separator. Here is the output:

PQ: merge two columns

Now we need to split the column: Home -> Split Column -> By Delimiter. In the Advanced options select: Rows, indicating we want to do the spit by rows. Here are the configuration options:

PQ: Split columns by delimiter

As @Ron Rosenfeld pointed out in the comments section. Another way of doing it is as follows:

After removing extra space after the comma (#RemovedExtraSpaceAfterComma) Select Family and Friend columns, then right-click and select: Unpivot Columns. Select the resulting Value column and Home->Split Column->By Delimiter and in Advanced options select Rows. Finally, remove the Attribute column.

Here is the M-code:

let
Source = Excel.CurrentWorkbook(){[Name="TB_INPUT"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Name", type text}, {"Friends", type text}, {"Family", type text}}),
RemoveExtraSpaceAfterComma = Table.ReplaceValue(#"Changed Type",", ",",",Replacer.ReplaceText,{"Friends", "Family"}),
#"Unpivoted Columns" = Table.UnpivotOtherColumns(RemoveExtraSpaceAfterComma, {"Name"}, "Attribute", "Value"),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Unpivoted Columns",
{{"Value", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta
[Serialized.Text = true] in type {itemType}}}), "Value"),
#"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Attribute"})
in
#"Removed Columns"

duplicating rows by splitting comma separated multiple values in another column pandas

First split the column on commas to get a list and then you can explode that Series of lists. Move 'title' to the index so it gets repeated for each element in 'country'. The last two parts just clean up the names and remove title from the index.

(df.set_index('title')['country']
.str.split(',')
.explode()
.rename('country')
.reset_index())


    title country
0 title1 a
1 title2 a
2 title2 b
3 title2 c
4 title3 d
5 title3 e
6 title3 f
7 title4 e



Also, your original code is logically fine, but you need to properly create your object. I would recommend importing the module instead of individual classes/methods, so you create a Series with pd.Series not Series

import pandas as pd

desired_df = pd.concat([pd.Series(row['title'], row['country'].split(','))
for _, row in original_df.iterrows()]).reset_index()


Related Topics



Leave a reply



Submit