Pandas Column of Lists, Create a Row For Each List Element

Pandas column of lists, create a row for each list element

UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().



lst_col = 'samples'

r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3

PS here you may find a bit more generic solution


UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

in the following line we are repeating values in one column N times where N - is the length of the corresponding list:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2

[18 rows x 2 columns]

using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])

putting all this together:

In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31

[18 rows x 3 columns]

using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...

Compare each element in a list with a column of lists in a dataframe python

In order to have a working example, I considered that your data were lists of strings :

df = pd.DataFrame({
'fruits':[['apple','orange','berry'],['orange']]
})

ml = ['apple', 'orange', 'banana']

Then I created a function that return the list of matching elements, or 0 if there's no match :

def matchFruits(row):
result = []
for fruit in row['fruits']:
if fruit in ml :
result.append(fruit)

return result if len(result) > 0 else 0

result = [fruit for fruit in row['fruits'] if fruit in ml] for list comprehesion aficionados.

Finally, I called this function on the whole DataFrame with axis = 1 to add a new column to the initial DataFrame :

df["match"] = df.apply(matchFruits, axis = 1)

The output is the following, it is different from your example since your result was 0 even though 'orange' was in both list. If it is not the requested behavior, please edit your question.

                   fruits            match
0 [apple, orange, berry] [apple, orange]
1 [orange] [orange]

How do I iterate through a df column (where each row is a list), looking for elements in a different list?

A better approach would be to use set intersection (assuming you're trying to count unique matches, i.e., you're not interested in how many times "apple" is mentioned in a review, only that it is mentioned, period).

This should get you what you want, again, assuming you want to count unique matches and assuming your lemmatized column values are indeed lists of strings:

df["lemmatized"].apply(lambda r: len(set(r) & set(menu_items)))

Pandas column tolist() while each row data being list of strings?

Your problem lies with the saving method. CSVs are not natively able to store lists unless you specifically parse them after reading.

Would it be possible for you to save time and effort by saving in another format instead? JSON natively supprots lists and is also a format that can be easily read by humans.

Here is an obligatory snippet for you:

import pandas as pd
df = pd.DataFrame([{"sentence":['aa', 'bb', 'cc']},{"sentence":['dd', 'ee', 'ff']}])

df.to_json("myfile.json")
df2 = pd.read_json("myfile.json")

Giving the following result:

>>> df2
sentence
0 [aa, bb, cc]
1 [dd, ee, ff]

how to create columns for each element given a column of lists pandas

Fastest way is to initialise the new dataframe with list of values obtained using Series.tolist:

df1 = pd.DataFrame(df['omega'].tolist())

Example:

df = pd.DataFrame({'omega': [np.arange(12).tolist()]* 5})
print(df)
omega
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

df1 = pd.DataFrame(df['omega'].tolist()).add_prefix('col')

Result:

# print(df1)

col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 0 1 2 3 4 5 6 7 8 9 10 11
2 0 1 2 3 4 5 6 7 8 9 10 11
3 0 1 2 3 4 5 6 7 8 9 10 11
4 0 1 2 3 4 5 6 7 8 9 10 11

Create a new column from two columns of a dataframe where rows of each column contains list in string format

IIUC, do some data clean up by remove an intra-string single quote.
And, then use library yaml to convert your string to actual list in each pandas dataframe cell with applymap. Lastly, apply explode to your dataframe twice once for each column you want to expand.

import yaml
import pandas as pd

df = pd.read_csv('Downloads/nodes_list.csv', index_col=[0])

df['Opp1'] = df['Opp1'].str.replace("[\'\"]s",'s', regex=True)
df['Opp2'] = df['Opp2'].str.replace("[\'\"]s",'s', regex=True)

df = df.applymap(yaml.safe_load)

df_new = df.explode('Opp1').explode('Opp2').apply(list, axis=1)

df_new

Output:

0                       [KingdomofPoland, Georgia]
0 [GrandDuchyofLithuania, Georgia]
1 [NorthernYuanDynasty, Georgia]
2 [SpanishEmpire, ChechenRepublic]
2 [CaptaincyGeneralofChile, ChechenRepublic]
...
3411 [SyrianOpposition, SpanishEmpire]
3412 [UnitedStates, SpanishEmpire]
3412 [UnitedKingdom, SpanishEmpire]
3412 [SaudiArabia, SpanishEmpire]
3413 [Turkey, Russia]
Length: 31170, dtype: object

How to add column(s) for each classification that contains values in a list

You can store the conditions/codes in a dictionary, loop over that, and then use isin + any(axis=1) to check if any codes from each condition are in each row of the dataframe:

all_codes = {
'dementia': ['F01', 'F02', 'F03', 'F051', 'G30', 'G311'],
'solid_tumour': ['C77', 'C78', 'C79', 'C80'],
}

for condition, codes in all_codes.items():
df[condition + '_yn'] = df.isin(codes).any(axis=1).astype(int)

Output:

>>> df
patient_num DIAGX1 DIAGX2 DIAGX3 DIAGX4 dementia_yn solid_tumour_yn
0 pat1 C77 F01 M32 M315 1 1
1 pat2 I099 I278 M05 F01 1 0
2 pat3 N057 N057 N058 N057 0 0


Related Topics



Leave a reply



Submit