Applying One hot encoding on a particular column of a dataset but result was not as expected
You're not saving the output.
out = encoder.transform(...).todense()
data['employed'] = out
It may take some wrangling to get the datasets to go together. I have found pd.concat(numerical_in, categorical_encoded_in, axis=1)
is needed in the past but you might simply find it works once you save the dense output.
how to make one hot encoding to column in data frame in python
For your specific case, if you reshape the underlying array (along with setting sparse=False
) it will give you your one-hot encoded array:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'EDUCATION':['high school','high school','high school',
'university','university','university',
'graduate school', 'graduate school','graduate school',
'others','others','others']})
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder.fit_transform(df['EDUCATION'].to_numpy().reshape(-1,1))
>>>
array([[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.]])
The most straightforward approach in my opinion is using pandas.get_dummies
:
pd.get_dummies(df['EDUCATION'])
One-hot encoding in Python for array values in a DataFrame
IIUC, and if target contains lists, you could do:
(df.drop('trace',1)
.join(df['trace']
.apply('|'.join)
.str.get_dummies()
)
)
or for in place modification of df
:
df = (df.join(df.pop('trace')
.apply('|'.join)
.str.get_dummies())
)
Or using explode
and pivot_table
:
(df.explode('trace')
.assign(x=1)
.pivot_table(index=['ID', 'length'], columns='trace', values='x', aggfunc='first')
.fillna(0, downcast='infer')
.reset_index()
)
Output:
ID length A B C D E
0 3 4 1 1 1 0 0
1 4 5 1 1 1 1 0
2 5 6 1 1 1 1 1
3 24 4 1 1 1 0 0
4 25 5 1 1 1 1 0
Sklearn One Hot Encoding produces non-tabular output
The column transformer has opted to transform into a scipy sparse matrix because the one-hot encoder does and it has sufficiently many columns compared to the passthrough.
Many ML models will accept sparse input, and this will be much more memory-efficient.
Otherwise, you can force dense arrays throughout by specifying sparse_threshold=0.0
in the ColumnTransformer
, or sparse=False
in the OneHotEncoder
. Or you can cast the sparse output to dense after transforming; you cannot do that with the np.array(...)
you've tried, but using .todense()
instead will work (see https://stackoverflow.com/a/55639087/10495893).
One hot encoding with duplicate columns
Given you are OHE the features and using them as input for LightGBM, it won't hurt to rename the conflicting values, or slightly modify them to avoid any issues. Therefore I would suggest to just proceed with:
import pandas as pd
#One Hot Encoding of the Categorical features
one_hot_city_name=pd.get_dummies(data.city_name.rename({'Limburg':'Limburg (city)'})
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_province=pd.get_dummies(data.province)
one_hot_region=pd.get_dummies(data.region)
Related Topics
How to Filter Only Printable Characters in a File on Bash (Linux) or Python
Multi Platform Portable Python
Running a Bash Script from Python
Executable Python Program with All Dependencies for Linux
Checking Running Python Script Within the Python Script
I Have a Problem with Sending Mail:Typeerror: _Init_() Got an Unexpected Keyword Argument 'Context'
Use Df Command to Show Only the %Used
How to Run Celery Workers by Superuser
Change Parent Shell's Environment from a Subprocess
Creating a Symbolic in Shared Volume of Docker and Accessing It in Host MAChine
Capturing Repeating Subpatterns in Python Regex
Class Method Differences in Python: Bound, Unbound and Static
How to Make a Timezone Aware Datetime Object