Convert Categorical Data in Pandas Dataframe

Convert categorical data in pandas dataframe

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.

Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1

Convert Categorical values to custom number in pandas dataframe

new_label = {"cat_column": {"low": 1, "high": 0}}
df.replace(new_label , inplace = True)

To do custom label encoding, create the dict of mappings and use replace() to replace your categorical values with numerical ones. You can vary your numerical value depending on your preference.

Hope this is what you are looking for.

Pandas: convert categories to numbers

First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0

If you don't want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

Convert a column with categorical data to separate column for each category and transpose variable data of corresponding columns to rows

Use df.pivot() + stack(), as follows:

(df.pivot('date', 'type')
.stack(level=0)
.rename_axis(index=['date', 'hour'], columns=None)
).reset_index()

Result:

         date hour    A    B    C
0 2021-01-01 1 6.0 0.1 1.0
1 2021-01-01 2 7.0 0.2 2.0
2 2021-01-01 3 8.0 0.3 3.0
3 2021-02-02 1 6.0 0.1 1.0
4 2021-02-02 2 7.0 0.2 2.0
5 2021-02-02 3 8.0 0.3 3.0

Convert categorical values to columns in Pandas

You can use:

vendors=df['attribute_name'].unique()
df2=pd.concat([df.set_index(['Platform_id','part_id','class_id']).groupby('attribute_name')['attribute_value'].get_group(key) for key in vendors],axis=1)
df2.columns=vendors
df2.reset_index(inplace=True)
print(df2)

Platform_id part_id class_id Vendor_name Cache Clock-speed Model_name
0 772 GHTYW2 PROCESSOR None 4 NaN NaN
1 4356 XCVT43 PROCESSOR Intel 4 3.1 NaN
2 23675 TT3344 PROCESSOR Intel NaN 2.3 4500U

Convert Categorical data to numeric percentage in Pandas

You want to make a cross-tabulation of two factors (col1 and col2) with the frequency normalized over each row. To do this you can use pd.crosstab() with normalize set to index:

>> df = pd.DataFrame({'col1': list('aaaaaabbb'), 'col2': list('xyxzzzxyx')})
>> pd.crosstab(df['col1'], df['col2'], normalize='index') * 100
col2 x y z
col1
a 33.333333 16.666667 50.0
b 66.666667 33.333333 0.0

If you want to use multiple factors, just call crosstab with a list of factors:

>> df['col3'] = list('112231345')
>> pd.crosstab([df['col1'], df['col3']], df['col2'], normalize='index') * 100
col2 x y z
col1 col3
a 1 33.333333 33.333333 33.333333
2 50.000000 0.000000 50.000000
3 0.000000 0.000000 100.000000
b 3 100.000000 0.000000 0.000000
4 0.000000 100.000000 0.000000
5 100.000000 0.000000 0.000000

If you want to round up, just call round:

>> round(pd.crosstab(df['col1'], df['col2'], normalize='index') * 100, 2)
col2 x y z
col1
a 33.33 16.67 50.0
b 66.67 33.33 0.0

Identifying the categorical columns of a dataframe

Your independent features include categorical data. The error is raised because you have some columns in string and it cannot be interpreted as float to train the model.

My suggestion is to use get_dummies.

This example might help you:

import pandas as pd

r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()

Country_France Country_Japan ... gendor_female gendor_male
0 1 0 ... 0 1
1 0 1 ... 1 0
2 0 0 ... 1 0
3 1 0 ... 1 0
4 0 0 ... 0 1
[5 rows x 6 columns]

>>>

All categorical columns are automatically converted using hot label encoding.

Once you convert your categorical data you can fit the LogisticRegression.

Pandas DataFrame: How to convert numeric columns into pairwise categorical data?

Use DataFrame.stack with filtering and Index.to_frame:

s = df.stack()

df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B

Or if performance is important use numpy.where for indices by matched values with DataFrame constructor:

i, c = np.where(df != 0)

df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c]})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B

EDIT:

For first:

s = df.stack()

df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
id result vals
0 0 A 3
1 0 D 1
2 1 A 4
3 1 B 1
4 2 A 1
5 2 B 7
6 2 C 20
7 3 D 4
8 5 B 1

For second:

df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c],
'vals':df.values[i,c]})


Related Topics



Leave a reply



Submit