How to concatenate multiple column values into a single column in Panda dataframe based on start and end time
Let's do this in a few steps.
First, let's make sure your Timestamp
is a datetime.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
Then we can create a new dataframe based on a min and max values of your timestamp.
df1 = pd.DataFrame({'start_time' : pd.date_range(df['Timestamp'].min(), df['Timestamp'].max())})
df1['end_time'] = df1['start_time'] + pd.DateOffset(days=1)
start_time end_time
0 2013-02-01 2013-02-02
1 2013-02-02 2013-02-03
2 2013-02-03 2013-02-04
3 2013-02-04 2013-02-05
4 2013-02-05 2013-02-06
5 2013-02-06 2013-02-07
6 2013-02-07 2013-02-08
7 2013-02-08 2013-02-09
Now we need to create a dataframe to merge onto your start_time
column.
Let's filter out any values that are less than 0 and create a list of active appliances:
df = df.set_index('Timestamp')
# the remaining columns MUST be integers for this to work.
# or you'll need to subselect them.
df2 = df.mask(df.le(0)).stack().reset_index(1).groupby(level=0)\
.agg(active_appliances=('level_1',list)).reset_index(0)
# change .agg(active_appliances=('level_1',list) >
# to .agg(active_appliances=('level_1',','.join)
# if you prefer strings.
Timestamp active_appliances
0 2013-02-01 [A]
1 2013-02-02 [A, B, C]
2 2013-02-03 [A, C]
3 2013-02-04 [A, B, C]
4 2013-02-05 [B, C]
5 2013-02-06 [A, B, C]
6 2013-02-07 [A, B, C]
Then we can merge:
final = pd.merge(df1,df2,left_on='start_time',right_on='Timestamp',how='left').drop('Timestamp',1)
start_time end_time active_appliances
0 2013-02-01 2013-02-02 [A]
1 2013-02-02 2013-02-03 [A, B, C]
2 2013-02-03 2013-02-04 [A, C]
3 2013-02-04 2013-02-05 [A, B, C]
4 2013-02-05 2013-02-06 [B, C]
5 2013-02-06 2013-02-07 [A, B, C]
6 2013-02-07 2013-02-08 [A, B, C]
7 2013-02-08 2013-02-09 NaN
Concatenate multiple columns into a list in a single column
With the landing of this PR, we can reshape
a Series/Expr
into a Series/Expr
of type List
. These can then be concatenated
per row.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
df.select([
pl.concat_list([
pl.col("a").reshape((-1, 1)),
pl.col("b").reshape((-1, 1))
])
])
Outputs:
shape: (3, 1)
┌────────────┐
│ a │
│ --- │
│ list [i64] │
╞════════════╡
│ [1, 4] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 5] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [3, 6] │
└────────────┘
Note that we give the shape (-1, 1)
, where -1
means infer the dimension size. So this reads as (infer the rows, 1 column)
.
You can compile polars from source to use this new feature, or wait a few days and then its landed on PyPi.
Append multiple columns to single column
Try:
single_column_frame = pd.concat([df[col] for col in df.columns])
If you want to create a single column and get rid of month names:
df_new = df.melt()['value'].to_frame()
Or you can do:
single_column_frame = single_column_frame.reset_index().drop(columns=['index'])
You can also do:
single_column_frame = df.stack().reset_index().loc[:,0]
Combine Multiple Column Values in a Single Column
Idea is convert column A
to index first and then processing data in custom function with remove missing values by NaN != NaN
trick:
f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4
If dont use trick NaN != NaN
test missing values and None
s by notna
:
f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
df = df.set_index('A').apply(f, axis=1).reset_index(name='Result')
print (df)
A Result
0 C1 B=2, D=2, E=3, F=2
1 C2 B=4, C=5, E=6, F=1
2 C3 B=9, C=2, D=4
EDIT:
Some another solutions tested for 30k rows:
df = pd.DataFrame([['C1',2,np.nan,2,3,2],['C2',4,5,np.nan,6,1],['C3',9,2,4,np.nan,np.nan]], columns=list('ABCDEF'))
df = pd.concat([df] * 10000, ignore_index=True)
#pure pandas solution, slow
In [247]: %%timeit
...: df1 = df.set_index('A', append=True).stack().astype(int).astype(str).reset_index(level=-1)
...: df1.columns = ['a', 'b']
...: df1 = df1.a + '=' + df1.b
...: df1.groupby(level=[0, 'A']).agg(', '.join).reset_index(level=[1], name='Result')
...:
...:
...:
1.66 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [248]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
344 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [249]: %%timeit
...: f = lambda x : ', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: df.set_index('A').apply(f, axis=1).reset_index(name='Result')
...:
431 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehensions solutions:
In [258]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if not v != v)
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
250 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [259]: %%timeit
...: L = [', '.join(f'{k}={int(v)}' for k, v in x.items() if pd.notna(v))
...: for x in df.drop('A', axis=1).to_dict('records')]
...:
...: df[['A']].assign(Result = L)
...:
336 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Group multiple columns by a single column in pandas and concatenate the rows of each column being grouped
Use nested list comprehension in GroupBy.agg
with filtered columns names in list
:
files_list=["Col1", "Col2", "Col3"]
f = lambda x: [z for y in x for z in y]
df_1 = df_1.groupby('ID', sort=False, as_index=False)[files_list].agg(f)
If performance is not important or small DataFrame is possible use sum
for join lists:
files_list=["Col1", "Col2", "Col3"]
df_1 = df_1.groupby('ID', sort=False, as_index=False)[files_list].agg(sum)
print (df_1)
ID Col1 Col2 Col3
0 S [A, A1] [R, R1] [y, ii1]
1 T [B] [S] []
2 L [B2, C2, D1] [R2, Q2, T1] [m2, i2, p1]
Merge multiple column values into one column in python pandas
You can call apply
pass axis=1
to apply
row-wise, then convert the dtype to str
and join
:
In [153]:
df['ColumnA'] = df[df.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
df
Out[153]:
Column1 Column2 Column3 Column4 Column5 ColumnA
0 a 1 2 3 4 1,2,3,4
1 a 3 4 5 NaN 3,4,5
2 b 6 7 8 NaN 6,7,8
3 c 7 7 NaN NaN 7,7
Here I call dropna
to get rid of the NaN
, however we need to cast again to int
so we don't end up with floats as str.
How to combine multiple dataframe columns into one given each column has nan values
With stack
:
df["XYZ"] = df.stack().values
to get
>>> df
X Y Z XYZ
0 NaN NaN ZVal1 ZVal1
1 NaN NaN ZVal2 ZVal2
2 XVal1 NaN NaN XVal1
3 NaN YVal1 NaN YVal1
since you guarantee only 1 non-NaN per row and stack
drops NaNs by default.
Another way with fancy indexing:
df["XYZ"] = df.to_numpy()[np.arange(len(df)),
df.columns.get_indexer(df.notna().idxmax(axis=1))]
which, for each row, looks at the index of the non-NaN value and selects it.
How to efficiently combine multiple pandas columns into one array-like column?
Using numpy on large data it is much faster than rest
Update -- numpy with list comprehension is much faster takes only 0.77s
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
Comparision of speed
import pandas as pd
import sys
import time
def f1():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf.assign(combined=pdf.agg(list, axis=1))
print(time.time() - s0)
def f2():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
print(time.time() - s0)
def f3():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
cols = ['a', 'b', 'c']
pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
print(time.time() - s0)
def f4():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
print(time.time() - s0)
if __name__ == '__main__':
eval(f'{sys.argv[1]}()')
➜ python test.py f1
17.766116857528687
➜ python test.py f2
0.7762737274169922
➜ python test.py f3
14.403311252593994
➜ python test.py f4
12.631694078445435
Related Topics
Replacing Blank Values (White Space) With Nan in Pandas
Python Pandas Valueerror Arrays Must Be All Same Length
How to Convert a Django Queryset into List of Dicts
What Is the Simplest Way to Ssh Using Python
How to Read a CSV File from an S3 Bucket Using Pandas in Python
How to Close a Program Using Python
How to Get the Column Name in Pandas Based on Row Values
How to Read Multiple Lines of Raw Input
How to Sort a Single String Output in Ascii Descending Order Through a Function
Python: Scaling Numbers Column by Column With Pandas
Datetime.Datetime Has No Attribute Datetime
Python - Split Array into Multiple Arrays
How to Uniqify a List of Dict in Python
Simple Digit Recognition Ocr in Opencv-Python
How to Make a Tkinter Label Background Transparent
How to Drop Rows from Pandas Data Frame That Contains a Particular String in a Particular Column