Label Encoding Across Multiple Columns in Scikit-Learn

Label encoding across multiple columns with same attributes in sckit-learn

pandas Method

You could create a dictionary of {country: value} pairs and map the dataframe to that:

country_map = {country:i for i, country in enumerate(df.stack().unique())}

df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)

>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0

sklearn method

Since you tagged sklearn, you could use LabelEncoder():

from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())

df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])

>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0

To get the original labels back:

>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)

How to Label encode multiple non-contiguous dataframe columns

Yes that's correct.

Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.

Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:

  • Label encoding across multiple columns in scikit-learn

From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

Label encoding several columns in DataFrame but only those who need it

Try this -

# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()

# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()

After that you can make a pipeline like this -

# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)


Related Topics



Leave a reply



Submit