How to One Hot Encode in Python

Applying One hot encoding on a particular column of a dataset but result was not as expected

You're not saving the output.

out = encoder.transform(...).todense()

data['employed'] = out

It may take some wrangling to get the datasets to go together. I have found pd.concat(numerical_in, categorical_encoded_in, axis=1) is needed in the past but you might simply find it works once you save the dense output.

how to make one hot encoding to column in data frame in python

For your specific case, if you reshape the underlying array (along with setting sparse=False) it will give you your one-hot encoded array:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'EDUCATION':['high school','high school','high school',
'university','university','university',
'graduate school', 'graduate school','graduate school',
'others','others','others']})

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder.fit_transform(df['EDUCATION'].to_numpy().reshape(-1,1))

>>>

array([[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.]])

The most straightforward approach in my opinion is using pandas.get_dummies:

pd.get_dummies(df['EDUCATION'])

Sample Image

One-hot encoding in Python for array values in a DataFrame

IIUC, and if target contains lists, you could do:

(df.drop('trace',1)
.join(df['trace']
.apply('|'.join)
.str.get_dummies()
)
)

or for in place modification of df:

df = (df.join(df.pop('trace')
.apply('|'.join)
.str.get_dummies())
)

Or using explode and pivot_table:

(df.explode('trace')
.assign(x=1)
.pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
.fillna(0, downcast='infer')
.reset_index()
)

Output:

   ID  length  A  B  C  D  E
0 3 4 1 1 1 0 0
1 4 5 1 1 1 1 0
2 5 6 1 1 1 1 1
3 24 4 1 1 1 0 0
4 25 5 1 1 1 1 0

Sklearn One Hot Encoding produces non-tabular output

The column transformer has opted to transform into a scipy sparse matrix because the one-hot encoder does and it has sufficiently many columns compared to the passthrough.

Many ML models will accept sparse input, and this will be much more memory-efficient.

Otherwise, you can force dense arrays throughout by specifying sparse_threshold=0.0 in the ColumnTransformer, or sparse=False in the OneHotEncoder. Or you can cast the sparse output to dense after transforming; you cannot do that with the np.array(...) you've tried, but using .todense() instead will work (see https://stackoverflow.com/a/55639087/10495893).

One hot encoding with duplicate columns

Given you are OHE the features and using them as input for LightGBM, it won't hurt to rename the conflicting values, or slightly modify them to avoid any issues. Therefore I would suggest to just proceed with:

import pandas as pd 
#One Hot Encoding of the Categorical features
one_hot_city_name=pd.get_dummies(data.city_name.rename({'Limburg':'Limburg (city)'})
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_province=pd.get_dummies(data.province)
one_hot_region=pd.get_dummies(data.region)


Related Topics



Leave a reply



Submit