Convert Categorical Data in Pandas Dataframe

Convert categorical data in pandas dataframe

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.

Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1       int64
col2    category
col3    category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
   col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1

Convert Categorical values to custom number in pandas dataframe

new_label = {"cat_column": {"low": 1, "high": 0}}
df.replace(new_label , inplace = True)

To do custom label encoding, create the dict of mappings and use replace() to replace your categorical values with numerical ones. You can vary your numerical value depending on your preference.

Hope this is what you are looking for.

Pandas: convert categories to numbers

First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

Convert a column with categorical data to separate column for each category and transpose variable data of corresponding columns to rows

Use df.pivot() + stack(), as follows:

(df.pivot('date', 'type')
   .stack(level=0)
   .rename_axis(index=['date', 'hour'], columns=None)
).reset_index()

Result:

         date hour    A    B    C
0  2021-01-01    1  6.0  0.1  1.0
1  2021-01-01    2  7.0  0.2  2.0
2  2021-01-01    3  8.0  0.3  3.0
3  2021-02-02    1  6.0  0.1  1.0
4  2021-02-02    2  7.0  0.2  2.0
5  2021-02-02    3  8.0  0.3  3.0

Convert categorical values to columns in Pandas

You can use:

vendors=df['attribute_name'].unique()
df2=pd.concat([df.set_index(['Platform_id','part_id','class_id']).groupby('attribute_name')['attribute_value'].get_group(key) for key in vendors],axis=1)
df2.columns=vendors
df2.reset_index(inplace=True)
print(df2)

   Platform_id part_id   class_id Vendor_name Cache Clock-speed Model_name
0          772  GHTYW2  PROCESSOR        None     4         NaN        NaN
1         4356  XCVT43  PROCESSOR       Intel     4         3.1        NaN
2        23675  TT3344  PROCESSOR       Intel   NaN         2.3      4500U

Convert Categorical data to numeric percentage in Pandas

You want to make a cross-tabulation of two factors (col1 and col2) with the frequency normalized over each row. To do this you can use pd.crosstab() with normalize set to index:

>> df = pd.DataFrame({'col1': list('aaaaaabbb'), 'col2': list('xyxzzzxyx')})
>> pd.crosstab(df['col1'], df['col2'], normalize='index') * 100
col2    x           y           z
col1            
a       33.333333   16.666667   50.0
b       66.666667   33.333333   0.0

If you want to use multiple factors, just call crosstab with a list of factors:

>> df['col3'] = list('112231345')
>> pd.crosstab([df['col1'], df['col3']], df['col2'], normalize='index') * 100
        col2    x           y           z
col1    col3            
a       1       33.333333   33.333333   33.333333
        2       50.000000   0.000000    50.000000
        3       0.000000    0.000000    100.000000
b       3       100.000000  0.000000    0.000000
        4       0.000000    100.000000  0.000000
        5       100.000000  0.000000    0.000000

If you want to round up, just call round:

>> round(pd.crosstab(df['col1'], df['col2'], normalize='index') * 100, 2)
col2    x       y       z
col1            
a       33.33   16.67   50.0
b       66.67   33.33   0.0

Identifying the categorical columns of a dataframe

Your independent features include categorical data. The error is raised because you have some columns in string and it cannot be interpreted as float to train the model.

My suggestion is to use get_dummies.

This example might help you:

import pandas as pd

r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()

   Country_France  Country_Japan  ...  gendor_female  gendor_male
0               1              0  ...              0            1
1               0              1  ...              1            0
2               0              0  ...              1            0
3               1              0  ...              1            0
4               0              0  ...              0            1
[5 rows x 6 columns]

>>>

All categorical columns are automatically converted using hot label encoding.

Once you convert your categorical data you can fit the LogisticRegression.

Pandas DataFrame: How to convert numeric columns into pairwise categorical data?

Use DataFrame.stack with filtering and Index.to_frame:

s = df.stack()

df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
   id result
0   0      A
1   0      D
2   1      A
3   1      B
4   2      A
5   2      B
6   2      C
7   3      D
8   5      B

Or if performance is important use numpy.where for indices by matched values with DataFrame constructor:

i, c = np.where(df != 0)

df = pd.DataFrame({'id':df.index.values[i],
                   'result':df.columns.values[c]})
print (df)
   id result
0   0      A
1   0      D
2   1      A
3   1      B
4   2      A
5   2      B
6   2      C
7   3      D
8   5      B

EDIT:

For first:

s = df.stack()

df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
   id result  vals
0   0      A     3
1   0      D     1
2   1      A     4
3   1      B     1
4   2      A     1
5   2      B     7
6   2      C    20
7   3      D     4
8   5      B     1

For second:

df = pd.DataFrame({'id':df.index.values[i],
                   'result':df.columns.values[c],
                   'vals':df.values[i,c]})

Convert Categorical Data in Pandas Dataframe