Label encoding across multiple columns with same attributes in sckit-learn
pandas
Method
You could create a dictionary of {country: value}
pairs and map the dataframe to that:
country_map = {country:i for i, country in enumerate(df.stack().unique())}
df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)
>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
sklearn
method
Since you tagged sklearn
, you could use LabelEncoder()
:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())
df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])
>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
To get the original labels back:
>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)
How to Label encode multiple non-contiguous dataframe columns
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
- Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.
Label encoding several columns in DataFrame but only those who need it
Try this -
# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
After that you can make a pipeline like this -
# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)
Related Topics
Settingwithcopywarning Even When Using .Loc[Row_Indexer,Col_Indexer] = Value
How to Rotate an Image Around an Off Center Pivot in Pygame
Typeerror: 'Int' Object Is Not Callable
How to Know If an Object Has an Attribute in Python
Creating a Range of Dates in Python
Tkinter.Tclerror: Image "Pyimage3" Doesn't Exist
Remove Duplicates by Columns A, Keeping the Row with the Highest Value in Column B
How to Get Method Parameter Names
How to Embed Matplotlib in Pyqt - for Dummies
Downloading a Picture via Urllib and Python
Django Multivaluedictkeyerror Error, How to Deal with It
How to Ignore the First Line of Data When Processing CSV Data
Does Python Make a Copy of Objects on Assignment
Slice 2D Array into Smaller 2D Arrays
Pip Install from Pypi Works, But from Testpypi Fails (Cannot Find Requirements)
Correct Way to Try/Except Using Python Requests Module
"Command Not Found" Using Line in Argument to Os.System Using Python
Replace Values in a Pandas Series via Dictionary Efficiently