Convert categorical data in pandas dataframe
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes
.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes
. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes
to select the columns, and then applying .cat.codes
on each of these columns, you can get the following result:In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
Convert Categorical values to custom number in pandas dataframe
new_label = {"cat_column": {"low": 1, "high": 0}}
df.replace(new_label , inplace = True)
To do custom label encoding, create the dict of mappings and use replace()
to replace your categorical values with numerical ones. You can vary your numerical value depending on your preference.Hope this is what you are looking for.
Pandas: convert categories to numbers
First, change the type of the column:
df.cc = pd.Categorical(df.cc)
Now the data look similar but are stored categorically. To capture the category codes:df['code'] = df.cc.cat.codes
Now you have: cc temp code
0 US 37.0 2
1 CA 12.0 1
2 US 35.0 2
3 AU 20.0 0
If you don't want to modify your DataFrame but simply get the codes:df.cc.astype('category').cat.codes
Or use the categorical column as an index:df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)
Convert a column with categorical data to separate column for each category and transpose variable data of corresponding columns to rows
Use df.pivot()
+ stack()
, as follows:
(df.pivot('date', 'type')
.stack(level=0)
.rename_axis(index=['date', 'hour'], columns=None)
).reset_index()
Result: date hour A B C
0 2021-01-01 1 6.0 0.1 1.0
1 2021-01-01 2 7.0 0.2 2.0
2 2021-01-01 3 8.0 0.3 3.0
3 2021-02-02 1 6.0 0.1 1.0
4 2021-02-02 2 7.0 0.2 2.0
5 2021-02-02 3 8.0 0.3 3.0
Convert categorical values to columns in Pandas
You can use:
vendors=df['attribute_name'].unique()
df2=pd.concat([df.set_index(['Platform_id','part_id','class_id']).groupby('attribute_name')['attribute_value'].get_group(key) for key in vendors],axis=1)
df2.columns=vendors
df2.reset_index(inplace=True)
print(df2)
Platform_id part_id class_id Vendor_name Cache Clock-speed Model_name
0 772 GHTYW2 PROCESSOR None 4 NaN NaN
1 4356 XCVT43 PROCESSOR Intel 4 3.1 NaN
2 23675 TT3344 PROCESSOR Intel NaN 2.3 4500U
Convert Categorical data to numeric percentage in Pandas
You want to make a cross-tabulation of two factors (col1
and col2
) with the frequency normalized over each row. To do this you can use pd.crosstab()
with normalize
set to index
:
>> df = pd.DataFrame({'col1': list('aaaaaabbb'), 'col2': list('xyxzzzxyx')})
>> pd.crosstab(df['col1'], df['col2'], normalize='index') * 100
col2 x y z
col1
a 33.333333 16.666667 50.0
b 66.666667 33.333333 0.0
If you want to use multiple factors, just call crosstab
with a list of factors:>> df['col3'] = list('112231345')
>> pd.crosstab([df['col1'], df['col3']], df['col2'], normalize='index') * 100
col2 x y z
col1 col3
a 1 33.333333 33.333333 33.333333
2 50.000000 0.000000 50.000000
3 0.000000 0.000000 100.000000
b 3 100.000000 0.000000 0.000000
4 0.000000 100.000000 0.000000
5 100.000000 0.000000 0.000000
If you want to round up, just call round
:>> round(pd.crosstab(df['col1'], df['col2'], normalize='index') * 100, 2)
col2 x y z
col1
a 33.33 16.67 50.0
b 66.67 33.33 0.0
Identifying the categorical columns of a dataframe
Your independent features include categorical data. The error is raised because you have some columns in string and it cannot be interpreted as float to train the model.
My suggestion is to use get_dummies
.
This example might help you:
import pandas as pd
r = pd.DataFrame(['France','Japan','Spain','France','USA'],columns= ['Country'])
r['gendor'] = ['male','female','female','female','male']
r = pd.get_dummies(r)
r.head()
Country_France Country_Japan ... gendor_female gendor_male
0 1 0 ... 0 1
1 0 1 ... 1 0
2 0 0 ... 1 0
3 1 0 ... 1 0
4 0 0 ... 0 1
[5 rows x 6 columns]
>>>
All categorical columns are automatically converted using hot label encoding.Once you convert your categorical data you can fit the LogisticRegression.
Pandas DataFrame: How to convert numeric columns into pairwise categorical data?
Use DataFrame.stack
with filtering and Index.to_frame
:
s = df.stack()
df = s[s!=0].index.to_frame(index=False).rename(columns={1:'result'})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B
Or if performance is important use numpy.where
for indices by matched values with DataFrame
constructor:i, c = np.where(df != 0)
df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c]})
print (df)
id result
0 0 A
1 0 D
2 1 A
3 1 B
4 2 A
5 2 B
6 2 C
7 3 D
8 5 B
EDIT:For first:
s = df.stack()
df = s[s!=0].reset_index()
df.columns= ['id','result','vals']
print (df)
id result vals
0 0 A 3
1 0 D 1
2 1 A 4
3 1 B 1
4 2 A 1
5 2 B 7
6 2 C 20
7 3 D 4
8 5 B 1
For second:df = pd.DataFrame({'id':df.index.values[i],
'result':df.columns.values[c],
'vals':df.values[i,c]})
Related Topics
Adding a Y-Axis Label to Secondary Y-Axis in Matplotlib
Python Argparse: Default Value or Specified Value
Is There a Numpy Builtin to Reject Outliers from a List
Get the String Within Brackets in Python
"Pip Install --Editable ./" VS "Python Setup.Py Develop"
Memory Error When Using Pandas Read_Csv
How to Write Utf-8 in a CSV File
Meaning of Using Commas and Underscores with Python Assignment Operator
Elif' in List Comprehension Conditionals
Splitting List Based on Missing Numbers in a Sequence
What Is the Purpose of Meshgrid in Python/Numpy
Python Ignore Certificate Validation Urllib2
Plotting Results of Hierarchical Clustering Ontop of a Matrix of Data in Python
How to Print a Dictionary Line by Line in Python
In Python, Why Is List[] Automatically Global