Using condition to split pandas column of lists into multiple columns.
Use apply
with a custom function
def my_split(row):
return pd.Series({
'1st_half_T1': [i for i in row.Time1 if i <= 96],
'2nd_half_T1': [i for i in row.Time1 if i > 96],
'1st_half_T2': [i for i in row.Time2 if i <= 96],
'2nd_half_T2': [i for i in row.Time2 if i > 96]
})
df.apply(my_split, axis=1)
Out[]:
1st_half_T1 1st_half_T2 2nd_half_T1 2nd_half_T2
0 [93] [16, 48, 66] [109, 187] [128]
1 [] [] [159] [123, 136]
2 [94, 96] [40] [154, 169] [177, 192]
Split a column into multiple columns with condition
Assuming the "Value" column contains strings, you can use str.split
and pivot
like so:
value = df["Value"].str.split(",").explode().astype(int).reset_index()
output = value.pivot(index="index", columns="Value", values="Value")
output = output.reindex(range(value["Value"].min(), value["Value"].max()+1), axis=1)
>>> output
Value 1 2 3 4 5 6 7 8 9
index
0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN 4.0 NaN 6.0 NaN 8.0 NaN
3 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
4 NaN 2.0 NaN NaN NaN NaN 7.0 NaN 9.0
Input df
:
df = pd.DataFrame({"Value": ["1", "1,3", "4,6,8", "1,3", "2,7,9"]})
pandas - split column with arrays into multiple columns and count values
np.random.seed(2022)
from collections import Counter
import numpy as np
df = pd.DataFrame(data=[[[np.random.randint(1,7) for _ in range(10)] for _ in range(5)]],
index=["col1"])
df = df.transpose()
You can use Series.explode
with SeriesGroupBy.value_counts
and reshape by Series.unstack
:
df1 = (df['col1'].explode()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0)
.add_prefix('col')
.rename_axis(None, axis=1))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
Or use list comprehension with Counter
and DataFrame
constructor:
df1 = (pd.DataFrame([Counter(x) for x in df['col1']])
.sort_index(axis=1)
.fillna(0)
.astype(int)
.add_prefix('col'))
print (df1)
col1 col2 col3 col4 col5 col6
0 4 2 1 0 1 2
1 3 2 0 4 0 1
2 3 1 3 2 0 1
3 1 1 3 0 1 4
4 1 1 1 1 3 3
Split lists into multiple columns in a pandas DataFrame
What you could do is instead of appending columns on every iteration append all of them after running your loop:
df3 = pd.DataFrame(columns=['name', 'hobby'])
d_list = []
for index, row in df.iterrows():
for value in str(row['hobbies']).split(';'):
d_list.append({'name':row['name'],
'value':value})
df3 = df3.append(d_list, ignore_index=True)
df3 = df3.groupby('name')['value'].value_counts()
df3 = df3.unstack(level=-1).fillna(0)
df3
I checked how much time it would take for you example dataframe. With the improvement I suggest it's ~50 times faster.
Split a column containing a list into multiple rows in Pandas based on a condition
After using literal_eval
to evaluate the strings in column status
as python lists you can use:
For pandas version >= 0.25
you can use explode
:
# Explode dataframe
df_out = df.explode('status').reset_index(drop=True)
# fill the NaN with empty lists
df_out['status'] = df_out['status'].dropna().reindex(df_out.index, fill_value=[])
For pandas version < 0.25
as explode
is not available you can use replicate the explode like behaviour using index.repeat
, then flatening out the nested lists using chain
:
from itertools import chain
l = df['status'].str.len()
m = l > 0
df_out = df.reindex(df[m].index.repeat(l[m]))
df_out['status'] = list(chain(*df.loc[m, 'status']))
df_out = df_out.append(df[~m]).sort_index().reset_index(drop=True)
>>> df_out
date status
0 2021-02-06 08:18:01.212763 [London, New York, BUSY]
1 2021-02-06 08:17:01.018633 [Mumbai, Tokyo, IDLE]
2 2021-02-06 08:16:01.182888 [Amsterdam, Chicago, IDLE]
3 2021-02-06 08:16:01.182888 [Amsterdam, London, IDLE]
4 2021-02-06 08:16:01.182888 [Amsterdam, Berlin, BUSY]
5 2021-02-06 08:15:01.245619 [Tokyo, Moscow, IDLE]
6 2021-02-06 07:18:01.413066 [Mumbai, Los Angeles, IDLE]
7 2021-02-06 07:18:01.413066 [Mumbai, Berlin, IDLE]
8 2021-02-06 07:17:01.154138 []
9 2021-02-06 07:16:01.253111 []
Pandas: Split and/or update columns, based on inconsistent data?
Use:
#part of cities with space
cities = ['York','Angeles']
#test rows
m = df['Team'].str.contains('|'.join(cities))
#first split by first space to 2 new columns
df[['City','Franchise']] = df['Team'].str.split(n=1, expand=True)
#split by second space only filtered rows
s = df.loc[m, 'Team'].str.split(n=2)
#update values
df.update(pd.concat([s.str[:2].str.join(' '), s.str[2]], axis=1, ignore_index=True).set_axis(['City','Franchise'], axis=1))
print (df)
Team City Franchise
0 New York Giants New York Giants
1 Atlanta Braves Atlanta Braves
2 Chicago Cubs Chicago Cubs
3 Chicago White Sox Chicago White Sox
How to split a list-column in a Pandas Dataframe based on a condition on the elements of the list?
df_out = pd.DataFrame(
df["combined_list"]
.apply(lambda x: list(zip(*[s.split("|") for s in x])))
.tolist(),
columns=["countries", "abbreviations"],
)
print(df_out)
Prints:
countries abbreviations
0 (Netherlands, Germany, United_States, Poland) (NL, DE, US, PL)
1 (Netherlands, Austria, Belgium) (NL, AU, BE)
2 (United_States, Germany) (US, DE)
To have lists in the columns:
df_out = pd.DataFrame(
df["combined_list"]
.apply(lambda x: list(map(list, zip(*[s.split("|") for s in x]))))
.tolist(),
columns=["countries", "abbreviations"],
)
print(df_out)
Prints:
countries abbreviations
0 [Netherlands, Germany, United_States, Poland] [NL, DE, US, PL]
1 [Netherlands, Austria, Belgium] [NL, AU, BE]
2 [United_States, Germany] [US, DE]
how to split a multiline text in dataframe column into multiple columns using start and end words as pattern to capture the the text inbetween
Even though it's not entirely clear what you want to achieve, I think you're looking for extract
.
Make some test data
import pandas as pd
import re # for the re.DOTALL flag
data = {"index_id": ["ROW1"],
"raw_text": ["STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULTILINE\nTEXT TO COL2 ENDWORD2 MULTILINE TEXT TO COL3 ENDWORD3"]}
df = pd.DataFrame(data).set_index("index_id")
df
looks like this:
raw_text
index_id
ROW1 STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULT...
Extract the columns
The following matches everything in between the split words as long as the order in the list matches the order of their occurrence in the raw string.
(You need the re.DOTALL
flag so the dot .
matches newlines, too.)
split_words = ["STARTWORD:", "ENDWORD", "ENDWORD2", "ENDWORD3"]
new_df = df.raw_text.str.extract("(.+)".join(split_words), flags=re.DOTALL)
Result
0 1 2
index_id
ROW1 MULTILINE TEXT TO COL1 MULTILINE\nTEXT TO COL2 MULTILINE TEXT TO COL3
Python:how to split column into multiple columns in a dataframe and with dynamic column naming
We can reconstruct your data with tolist
and pd.DataFrame
. Then concat
everything together again:
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
id0 id1 id2 value0 value1 value2
0 10 10 NaN apple orange None
1 15 67 NaN banana orange None
2 12 34 45.0 apple banana orange
How can I split Pandas dataframe column with strings according to multiple conditions
Series.str.split
s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
Regex details:
\s+
: Matches any whitespace character one or more time(?=\b(?:dark|\d+)\b)
: Positive Lookahead\b
: Word boundary to prevent partial matches(?:dark|\d+)
: Non capturing groupdark
: First Alternative matches the characters dark literally\d+
: Second alternative which matches any digit one or more times
\b
: Word boundary to prevent partial matches
See the online regex demo
Related Topics
How to Map True/False to 1/0 in a Pandas Dataframe
How to Check Whether a Number Is Divisible by Another Number
How to Delete Tkinter Widgets from a Window
How to Install Pip for a Specific Python Version
Converting a List into Comma Separated and Add Quotes in Python
Printing the Number of Days in a Given Month and Year [Python]
Easiest Way to Ignore Blank Lines When Reading a File in Python
How to Start a Background Process in Python
How to Transfer Data from One Worksheet into Another Using Python in the Same Workbook
How to Convert Number 1 to a Boolean in Python
Python Tkinter Return Value from Function Used in Command
How to Remove Zeros After Decimal from String Remove All Zero After Dot
Csv File Written With Python Has Blank Lines Between Each Row
How to Merge Columns from Multiple CSV Files Using Python
Pytest Cannot Import Module While Python Can
Keep Other Columns When Doing Groupby