Transform One Column from Categoric to Binary, Keep the Rest

Pandas - Convert a categorical column to binary encoded form

I think you need get_dummies:

df = pd.get_dummies(df['month'])

And if need add new columns to original and remove month use join with pop:

df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
   yyyy  tmax  tmin  April  August  December  February  January  July  June  \
0  1908   5.0  -1.4      0       0         0         0        1     0     0   
1  1908   7.3   1.9      0       0         0         1        0     0     0   
2  1908   6.2   0.3      0       0         0         0        0     0     0   
3  1908   7.4   2.1      1       0         0         0        0     0     0   
4  1908  16.5   7.7      0       0         0         0        0     0     0   

   March  May  November  October  September  
0      0    0         0        0          0  
1      0    0         0        0          0  
2      1    0         0        0          0  
3      0    0         0        0          0  
4      0    1         0        0          0

If NOT need remove column month:

df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
   yyyy     month  tmax  tmin  April  August  December  February  January  \
0  1908   January   5.0  -1.4      0       0         0         0        1   
1  1908  February   7.3   1.9      0       0         0         1        0   
2  1908     March   6.2   0.3      0       0         0         0        0   
3  1908     April   7.4   2.1      1       0         0         0        0   
4  1908       May  16.5   7.7      0       0         0         0        0   

   July  June  March  May  November  October  September  
0     0     0      0    0         0        0          0  
1     0     0      0    0         0        0          0  
2     0     0      1    0         0        0          0  
3     0     0      0    0         0        0          0  
4     0     0      0    1         0        0          0

If need sort columns there is more possible solutions - use reindex or reindex_axis:

months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']

df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0

Or convert column month to ordered categorical:

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0

Convert various dummy/logical variables into a single categorical variable/factor from their name in R

Try:

library(dplyr)
library(tidyr)

df %>% gather(type, value, -id) %>% na.omit() %>% select(-value) %>% arrange(id)

Which gives:

#  id       type
#1  1 conditionA
#2  2 conditionB
#3  3 conditionC
#4  4 conditionD
#5  5 conditionA

Update

To handle the case you detailed in the comments, you could do the operation on the desired portion of the data frame and then left_join() the other columns:

df %>% 
  select(starts_with("condition"), id) %>% 
  gather(type, value, -id) %>% 
  na.omit() %>% 
  select(-value) %>% 
  left_join(., df %>% select(-starts_with("condition"))) %>%
  arrange(id)

Convert categorical data in pandas dataframe

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.

Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1       int64
col2    category
col3    category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
   col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1

scikit learn creation of dummy variables

For which algorithms in scikit-learn is this transformation into dummy variables necessary? And for those algorithms that aren't, it can't hurt, right?

All algorithms in sklearn with the notable exception of tree-based methods require one-hot encoding (also known as dummy variables) for nominal categorical variables.

Using dummy variables for categorical features with very large cardinalities might hurt tree-based methods, especially randomized tree methods by introducing a bias in the feature split sampler. Tree-based method tend to work reasonably well with a basic integer encoding of categorical features.

Why does onehotencoding convert binary data into 2 mutually exclusive features?

OneHotEncoder can not know what do you want and need. But in any case it should not behave differently for features containing 2 and 100 categories.

Imagine you have 5 or 100 categories within a feature. Maybe by chance it would drop the category X, that has very strong correlation with the target. Then your ML algorithm would have hard time to generalize well (for example, a tree-based algorithm would need to set splits that all the rest of 4 or 99 binary columns are equal to 0, which leads to many splits)

But indeed, there is redundant information. OneHotEncoder does not allow to configure the transformation to drop one of the categories (which could be beneficial for linear models, for example). If you really need that functionality, you can use pandas.get_dummies instead. It has drop_first argument and it by default transforms only categorical features instead of all features.

How to use Binary Encoding of Categorical Columns to predict labels in Python?

You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:

import category_encoders as ce

train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})

encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)

#   var1_0  var1_1  var2_0  var2_1
#0  0       1       0       1
#1  1       0       0       1
#2  0       1       0       1
#3  1       0       0       1
#4  1       1       1       0

test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)

#   var1_0  var1_1  var2_0  var2_1
#0  1       1       0       1
#1  0       0       0       0
#2  1       0       0       0

'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.

Recode categorical variable as new variable in R

Found this answer here (https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/9)

df <- mutate(df, cat = ifelse(grepl("Sailfin molly", common_name), "Fish",
                                      ifelse(grepl("Hardhead silverside", common_name), "Fish", "Crab")))