Transform One Column from Categoric to Binary, Keep the Rest

Pandas - Convert a categorical column to binary encoded form

I think you need get_dummies:

df = pd.get_dummies(df['month'])

And if need add new columns to original and remove month use join with pop:

df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
yyyy tmax tmin April August December February January July June \
0 1908 5.0 -1.4 0 0 0 0 1 0 0
1 1908 7.3 1.9 0 0 0 1 0 0 0
2 1908 6.2 0.3 0 0 0 0 0 0 0
3 1908 7.4 2.1 1 0 0 0 0 0 0
4 1908 16.5 7.7 0 0 0 0 0 0 0

March May November October September
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0

If NOT need remove column month:

df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0

July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0

If need sort columns there is more possible solutions - use reindex or reindex_axis:

months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0

October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0

October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

Or convert column month to ordered categorical:

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0

October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

Try:

library(dplyr)
library(tidyr)

df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)

Which gives:

#  id       type
#1 1 conditionA
#2 2 conditionB
#3 3 conditionC
#4 4 conditionD
#5 5 conditionA

Update

To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:

df %>% 
select(starts_with("condition"), id) %>%
gather(type, value, -id) %>%
na.omit() %>%
select(-value) %>%
left_join(., df %>% select(-starts_with("condition"))) %>%
arrange(id)

Convert categorical data in pandas dataframe

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.

Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1

scikit learn creation of dummy variables

For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?

All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.

Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.

Why does onehotencoding convert binary data into 2 mutually exclusive features?

OneHotEncoder can not know what do you want and need. But in any case it should not behave differently for features containing 2 and 100 categories.

Imagine you have 5 or 100 categories within a feature. Maybe by chance it would drop the category X, that has very strong correlation with the target. Then your ML algorithm would have hard time to generalize well (for example, a tree-based algorithm would need to set splits that all the rest of 4 or 99 binary columns are equal to 0, which leads to many splits)

But indeed, there is redundant information. OneHotEncoder does not allow to configure the transformation to drop one of the categories (which could be beneficial for linear models, for example). If you really need that functionality, you can use pandas.get_dummies instead. It has drop_first argument and it by default transforms only categorical features instead of all features.

How to use Binary Encoding of Categorical Columns to predict labels in Python?

You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:

import category_encoders as ce

train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})

encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)

# var1_0 var1_1 var2_0 var2_1
#0 0 1 0 1
#1 1 0 0 1
#2 0 1 0 1
#3 1 0 0 1
#4 1 1 1 0

test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)

# var1_0 var1_1 var2_0 var2_1
#0 1 1 0 1
#1 0 0 0 0
#2 1 0 0 0

'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.

Recode categorical variable as new variable in R

Found this answer here (https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/9)

df <- mutate(df, cat = ifelse(grepl("Sailfin molly", common_name), "Fish",
ifelse(grepl("Hardhead silverside", common_name), "Fish", "Crab")))


Related Topics



Leave a reply



Submit