How to Concatenate Multiple Column Values into a Single Column in Pandas Dataframe

How to concatenate multiple column values into a single column in Panda dataframe based on start and end time

Let's do this in a few steps.

First, let's make sure your Timestamp is a datetime.

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

Then we can create a new dataframe based on a min and max values of your timestamp.

df1 = pd.DataFrame({'start_time' : pd.date_range(df['Timestamp'].min(), df['Timestamp'].max())})

df1['end_time'] = df1['start_time'] + pd.DateOffset(days=1)

start_time end_time
0 2013-02-01 2013-02-02
1 2013-02-02 2013-02-03
2 2013-02-03 2013-02-04
3 2013-02-04 2013-02-05
4 2013-02-05 2013-02-06
5 2013-02-06 2013-02-07
6 2013-02-07 2013-02-08
7 2013-02-08 2013-02-09

Now we need to create a dataframe to merge onto your start_time column.

Let's filter out any values that are less than 0 and create a list of active appliances:

df = df.set_index('Timestamp')
# the remaining columns MUST be integers for this to work.
# or you'll need to subselect them.
df2 = df.mask(df.le(0)).stack().reset_index(1).groupby(level=0)\
.agg(active_appliances=('level_1',list)).reset_index(0)

# change .agg(active_appliances=('level_1',list) >
# to .agg(active_appliances=('level_1',','.join)
# if you prefer strings.



Timestamp active_appliances
0 2013-02-01 [A]
1 2013-02-02 [A, B, C]
2 2013-02-03 [A, C]
3 2013-02-04 [A, B, C]
4 2013-02-05 [B, C]
5 2013-02-06 [A, B, C]
6 2013-02-07 [A, B, C]

Then we can merge:

final = pd.merge(df1,df2,left_on='start_time',right_on='Timestamp',how='left').drop('Timestamp',1)


start_time end_time active_appliances
0 2013-02-01 2013-02-02 [A]
1 2013-02-02 2013-02-03 [A, B, C]
2 2013-02-03 2013-02-04 [A, C]
3 2013-02-04 2013-02-05 [A, B, C]
4 2013-02-05 2013-02-06 [B, C]
5 2013-02-06 2013-02-07 [A, B, C]
6 2013-02-07 2013-02-08 [A, B, C]
7 2013-02-08 2013-02-09 NaN

Concatenate multiple columns into a list in a single column

With the landing of this PR, we can reshape a Series/Expr into a Series/Expr of type List. These can then be concatenated per row.

df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})


df.select([
pl.concat_list([
pl.col("a").reshape((-1, 1)),
pl.col("b").reshape((-1, 1))
])
])

Outputs:

shape: (3, 1)
┌────────────┐
│ a │
│ --- │
│ list [i64] │
╞════════════╡
│ [1, 4] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 5] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [3, 6] │
└────────────┘

Note that we give the shape (-1, 1), where -1 means infer the dimension size. So this reads as (infer the rows, 1 column).

You can compile polars from source to use this new feature, or wait a few days and then its landed on PyPi.

Append multiple columns to single column

Try:

single_column_frame = pd.concat([df[col] for col in df.columns])

If you want to create a single column and get rid of month names:

df_new = df.melt()['value'].to_frame()

Or you can do:

single_column_frame = single_column_frame.reset_index().drop(columns=['index'])

You can also do:

single_column_frame = df.stack().reset_index().loc[:,0]

Combine Multiple Column Values in a Single Column

Idea is convert column A to index first and then processing data in custom function with remove missing values by NaN != NaN trick:

f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4

If dont use trick NaN != NaN test missing values and Nones by notna:

f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)

A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4

EDIT:

Some another solutions tested for 30k rows:

df = pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))
df = pd.concat([df] * 10000, ignore_index=True)

#pure pandas solution, slow
In [247]: %%timeit
...: df1 = df.set_index('A', append=True).stack().astype(int).astype(str).reset_index(level=-1)
...: df1.columns = ['a', 'b']
...: df1 = df1.a + '=' + df1.b
...: df1.groupby(level=[0, 'A']).agg(', '.join).reset_index(level=[1], name='Result')
...:
...:
...:
1.66 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [248]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
344 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [249]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
431 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehensions solutions:

In [258]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
250 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [259]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
336 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Group multiple columns by a single column in pandas and concatenate the rows of each column being grouped

Use nested list comprehension in GroupBy.agg with filtered columns names in list:

files_list=["Col1", "Col2", "Col3"]
f = lambda x: [z for y in x for z in y]
df_1 = df_1.groupby('ID', sort=False, as_index=False)[files_list].agg(f)

If performance is not important or small DataFrame is possible use sum for join lists:

files_list=["Col1", "Col2", "Col3"]
df_1 = df_1.groupby('ID', sort=False, as_index=False)[files_list].agg(sum)
print (df_1)
ID Col1 Col2 Col3
0 S [A, A1] [R, R1] [y, ii1]
1 T [B] [S] []
2 L [B2, C2, D1] [R2, Q2, T1] [m2, i2, p1]

Merge multiple column values into one column in python pandas

You can call apply pass axis=1 to apply row-wise, then convert the dtype to str and join:

In [153]:
df['ColumnA'] = df[df.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
df

Out[153]:
Column1 Column2 Column3 Column4 Column5 ColumnA
0 a 1 2 3 4 1,2,3,4
1 a 3 4 5 NaN 3,4,5
2 b 6 7 8 NaN 6,7,8
3 c 7 7 NaN NaN 7,7

Here I call dropna to get rid of the NaN, however we need to cast again to int so we don't end up with floats as str.

How to combine multiple dataframe columns into one given each column has nan values

With stack:

df["XYZ"] = df.stack().values

to get

>>> df

X Y Z XYZ
0 NaN NaN ZVal1 ZVal1
1 NaN NaN ZVal2 ZVal2
2 XVal1 NaN NaN XVal1
3 NaN YVal1 NaN YVal1

since you guarantee only 1 non-NaN per row and stack drops NaNs by default.


Another way with fancy indexing:

df["XYZ"] = df.to_numpy()[np.arange(len(df)),
df.columns.get_indexer(df.notna().idxmax(axis=1))]

which, for each row, looks at the index of the non-NaN value and selects it.

How to efficiently combine multiple pandas columns into one array-like column?

Using numpy on large data it is much faster than rest

Update -- numpy with list comprehension is much faster takes only 0.77s

pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()

Comparision of speed

import pandas as pd
import sys
import time

def f1():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf.assign(combined=pdf.agg(list, axis=1))
print(time.time() - s0)

def f2():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
print(time.time() - s0)

def f3():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
cols = ['a', 'b', 'c']
pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
print(time.time() - s0)

def f4():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
print(time.time() - s0)

if __name__ == '__main__':
eval(f'{sys.argv[1]}()')
➜   python test.py f1
17.766116857528687
➜ python test.py f2
0.7762737274169922
➜ python test.py f3
14.403311252593994
➜ python test.py f4
12.631694078445435


Related Topics



Leave a reply



Submit