Pandas - Convert a categorical column to binary encoded form
I think you need get_dummies
:
df = pd.get_dummies(df['month'])
And if need add new columns to original and remove month
use join
with pop
:
df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
yyyy tmax tmin April August December February January July June \
0 1908 5.0 -1.4 0 0 0 0 1 0 0
1 1908 7.3 1.9 0 0 0 1 0 0 0
2 1908 6.2 0.3 0 0 0 0 0 0 0
3 1908 7.4 2.1 1 0 0 0 0 0 0
4 1908 16.5 7.7 0 0 0 0 0 0 0
March May November October September
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
If NOT need remove column month
:
df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
If need sort columns there is more possible solutions - use reindex
or reindex_axis
:
months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Or convert column month
to ordered categorical:
df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Convert various dummy/logical variables into a single categorical variable/factor from their name in R
Try:
library(dplyr)
library(tidyr)
df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)
Which gives:
# id type
#1 1 conditionA
#2 2 conditionB
#3 3 conditionC
#4 4 conditionD
#5 5 conditionA
Update
To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join()
the other columns:
df %>%
select(starts_with("condition"), id) %>%
gather(type, value, -id) %>%
na.omit() %>%
select(-value) %>%
left_join(., df %>% select(-starts_with("condition"))) %>%
arrange(id)
Convert categorical data in pandas dataframe
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes
.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes
. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes
to select the columns, and then applying .cat.codes
on each of these columns, you can get the following result:
In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
scikit learn creation of dummy variables
For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?
All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.
Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.
Why does onehotencoding convert binary data into 2 mutually exclusive features?
OneHotEncoder
can not know what do you want and need. But in any case it should not behave differently for features containing 2 and 100 categories.
Imagine you have 5 or 100 categories within a feature. Maybe by chance it would drop the category X
, that has very strong correlation with the target. Then your ML algorithm would have hard time to generalize well (for example, a tree-based algorithm would need to set splits that all the rest of 4 or 99 binary columns are equal to 0, which leads to many splits)
But indeed, there is redundant information. OneHotEncoder
does not allow to configure the transformation to drop one of the categories (which could be beneficial for linear models, for example). If you really need that functionality, you can use pandas.get_dummies instead. It has drop_first
argument and it by default transforms only categorical features instead of all features.
How to use Binary Encoding of Categorical Columns to predict labels in Python?
You just do the transform()
on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown
parameter). For example:
import category_encoders as ce
train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})
encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)
# var1_0 var1_1 var2_0 var2_1
#0 0 1 0 1
#1 1 0 0 1
#2 0 1 0 1
#3 1 0 0 1
#4 1 1 1 0
test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)
# var1_0 var1_1 var2_0 var2_1
#0 1 1 0 1
#1 0 0 0 0
#2 1 0 0 0
'D'
doesn't occur in var1
in training data, so it was encoded as 0 0
. 'C'
and 'F'
don't occur in var2
in training data, so they were both encoded as 0 0
.
Recode categorical variable as new variable in R
Found this answer here (https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/9)
df <- mutate(df, cat = ifelse(grepl("Sailfin molly", common_name), "Fish",
ifelse(grepl("Hardhead silverside", common_name), "Fish", "Crab")))
Related Topics
Take the Subsets of a Data.Frame with the Same Feature and Select a Single Row from Each Subset
Rolling by Group in Data.Table R
R: Pivoting Using 'Spread' Function
R - Converting Posixct to Milliseconds
Rvest Not Recognizing CSS Selector
Http Error 400 on Google_Elevation() Call
How to Use R to Create a Word Co-Occurrence Matrix
Subsetting Data Based on Dynamic Column Names
How to Read Column Names 'As Is' from CSV File
How to Log Transform the Y-Axis of R Geom_Histogram in the Right Direction
Generating a Date from a String with a 'Month-Year' Format
Combining Rows Based on a Column
Using Jupyter R Kernel with Visual Studio Code
Reshape Data from Long to Wide Format - More Than One Variable