How to split a dataframe string column into two columns?
There might be a better way, but this here's one approach:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['fips','row'])
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
split one column into multiple columns usining delimiter
Use str.split
to split
df[['date', 'date2', 'date3']] = df['date'].replace('NULL', np.nan).str.split('+', expand=True)
and count
to count
df['number of dates'] = df[['date', 'date2', 'date3']].count(axis=1)
print(df)
ID date date2 date3 number of dates
0 3009 2016 2017 None 2
1 129 2015 None None 1
2 119 2014 2019 2020 3
3 120 2020 None None 1
4 121 NaN NaN NaN 0
How to split a column into multiple (non equal) columns in R
We could use cSplit
from splitstackshape
library(splitstackshape)
cSplit(DF, "Col1",",")
-output
cSplit(DF, "Col1",",")
Col1_1 Col1_2 Col1_3 Col1_4
1: a b c <NA>
2: a b <NA> <NA>
3: a b c d
How to split a Pandas DataFrame column into multiple columns if the column is a string of varying length?
You can try using str.rsplit
:
Splits string around given separator/delimiter, starting from the
right.
df['Col_1'].str.rsplit(' ', 2, expand=True)
Output:
0 1 2
0 Hello X Y
1 Hello world Q R
2 Hi S T
As a full dataframe:
df['Col_1'].str.rsplit(' ', 2, expand=True).add_prefix('nCol_').join(df)
Output:
nCol_0 nCol_1 nCol_2 Col_1 Col_2
0 Hello X Y Hello X Y A
1 Hello world Q R Hello world Q R B
2 Hi S T Hi S T C
Python split one column into multiple columns and reattach the split columns into original dataframe
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat
for join together and then add to original by DataFrame.join
or concat
with axis=1
:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
Split column to multiple columns by another column value (complicated separator)
option 1
Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit
:
df['column1'].str.rsplit(n=2, expand=True)
output:
0 1 2
0 abc 33 aaa 9g98f
1 cde aaa 95fwf
2 12 faf bbb 92gcs
3 faf bbb 7t87f
NB. this doesn't work with the updated example
option 2
Alternatively, to split on the provided delimiter:
df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in
zip(df['column1'], df['column2'])]
output:
column1 column2 new_column1 new_column2
0 abc 33 aaa 9g98f 333 aaa abc 33 9g98f 333
1 cde aaa 95fwf aaa cde 95fwf
2 12 faf bbb 92gcs bbb 12 faf 92gcs
3 faf bbb 7t87f bbb faf 7t87f
option 3
Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:
(df
.groupby('column2')
.apply(lambda g: g['column1'].str.split(f'\s*{g.name}\s*', expand=True))
)
output:
0 1
0 abc 33 9g98f 333
1 cde 95fwf
2 12 faf 92gcs
3 faf 7t87f
How to split a column in multiple columns using data.table
Use tstrsplit
with keep = 1:3
to keep only the first three columns:
dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
name bin position ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5
Split list in a column to multiple columns
You could map ast.literal_eval
to items in df2["1"]
; build a DataFrame and join
it to df1
:
import ast
out = df1.join(pd.DataFrame(map(ast.literal_eval, df2["1"].tolist())).add_prefix('feature_'))
Output:
Text Topic feature_0 feature_1 feature_2
0 Where is the party tonight? Party -0.011571 -0.010117 0.062448
1 Let's dance Party -0.082682 -0.001614 0.020942
2 Hello world Other -0.063768 -0.015903 0.020942
3 It is rainy today Weather 0.063796 -0.028781 0.056791
Related Topics
Set Standard Legend Key Size with Long Label Names Ggplot
Convert an Integer Column to Time Hh:Mm
R: Fast (Conditional) Subsetting Where Feasible
R: How to Get a Sum of Two Distributions
Ggplot Line Plot Different Colors for Sections
Ggplot2: Geom_Smooth Confidence Band Does Not Extend to Edge of Graph, Even with Fullrange=True
R Histogram from Frequency Table
How to Find Correct Executable with Sys.Which on Windows
Ggplot2_Error: Geom_Point Requires the Following Missing Aesthetics: Y
How to Split a Vector by Delimiter
R Cleaning Up a Character and Converting It into a Numeric
Help Understand the Error in a Function I Defined in R
Drawing Journey Path Using Leaflet in R
How to Add New Calculated Variables to a Data Frame
R Replacing Zeros in Dataframe with Next Non Zero Value
Add a Series of Elements in Different Locations Within a Vector
Copy-On-Modify Semantic on a Vector Does Not Append in a Loop. Why