Factorize a column of strings in pandas
series
RowX yes
RowY no
RowW yes
RowJ no
RowA yes
RowR no
RowX yes
RowY yes
RowW yes
RowJ yes
RowA yes
RowR no
Name: Column 3, dtype: object
pd.factorize
1 - series.factorize()[0]
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
np.where
np.where(series == 'yes', 1, 0)
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
pd.Categorical
/astype('category')
pd.Categorical(series).codes
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0], dtype=int8)
series.astype('category').cat.codes
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
dtype: int8
pd.Series.replace
series.replace({'yes' : 1, 'no' : 0})
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
Name: Column 3, dtype: int64
A fun, generalised version of the above:
series.replace({r'^(?!yes).*$' : 0}, regex=True).astype(bool).astype(int)
RowX 1
RowY 0
RowW 1
RowJ 0
RowA 1
RowR 0
RowX 1
RowY 1
RowW 1
RowJ 1
RowA 1
RowR 0
Name: Column 3, dtype: int64
Anything that is not "yes"
is 0
.
How to factorize selected columns in pandas
Use:
def gnumeric_func (data, columns):
data[columns] = data[columns].apply(lambda x: pd.factorize(x)[0])
return data
For all columns:
def gnumeric_func (data):
data = data.apply(lambda x: pd.factorize(x)[0])
return data
pandas.factorize on an entire data frame
You can use apply
if you need to factorize
each column separately:
df = pd.DataFrame({'A':['type1','type2','type2'],
'B':['type1','type2','type3'],
'C':['type1','type3','type3']})
print (df)
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
print (df.apply(lambda x: pd.factorize(x)[0]))
A B C
0 0 0 0
1 1 1 1
2 1 2 1
If you need for the same string value the same numeric one:
print (df.stack().rank(method='dense').unstack())
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
If you need to apply the function only for some columns, use a subset:
df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
A B C
0 type1 1.0 1.0
1 type2 2.0 3.0
2 type2 3.0 3.0
Solution with factorize
:
stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
A B C
0 type1 0 0
1 type2 1 2
2 type2 2 2
Translate them back is possible via map
by dict
, where you need to remove duplicates by drop_duplicates
:
vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]
d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}
df1 = df.stack().rank(method='dense').unstack()
print (df1)
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
print (df1.stack().map(d1).unstack())
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
Pandas how to Factorize in Unusual Text Order
Consider a DF
with a string column as shown:
df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA']))
df
Custom Function:
(i) Take unique entries from the column under consideration.
(ii) Groupby
by string lengths and sort these lexicographically and stack them horizontally.
(iii) Factorize them.
def complex_factorize(df, col):
ser = pd.Series(df[col].unique())
func = lambda x: sorted(x.values.ravel())
arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values)
return pd.factorize(arr)
Taking the labels and the unique elements of the series returned by the factorize
method, feed it to DF.replace
to construct the mapping.
val, ser = complex_factorize(df, 'col')
df.replace(ser, val)
factorize columns of a pandas data frame
First line: iteritems
iterates over the columns of a dataframe and returns (column_name, actual_column)
pairs. By zip
ping and destructuring in the for
line, you end up with:
train_name
: name of current column intrain
dataframe;test_name
: name of corresponding column intest
dataframe;train_series
andtest_series
: actual columns (as pandas Series).
Second line: this checks if the column is of type Object
, essentially meaning that it contains strings and is a categorical column.
Third line: factorize
will return, in second position, the list of unique values (or categorical labels) in the provided column, and, in first position, the indices that would let you recreate the original column from the unique values. In other words:
labels, uniques = pd.factorize(column)
for i in range(len(column)):
print(column[i] == uniques[labels[i]]) # True
Continuing with destructuring assignments, the current train
column train[train_name]
will be replaced by its index-based representation, while tmp_indexer
will contains the unique values in the original train[train_name]
.
Fourth line: get_indexer
will return the indices where the values in test[test_name]
are to be found in tmp_indexer
. As a result, the current test
column is replaced by a list of indices in the exact same way the corresponding train
column was in the line above.
End result: both columns in train
and test
have gone from a series of strings (categorical values) to a series of numerical index values, both indexed on the same (temporary) object.
How to iterate over pandas dataframe columns and factorize based on a conditional?
This is more like
for i in columns:
if dataframe[i].dtypes=='object':
xtrain[i] = pd.Categorical(pd.factorize(dataframe[i])[0])
And since you are doing MlP, so let us using LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in columns:
if dataframe[i].dtypes=='object':
dataframe[i] = le.fit_transform(dataframe[i])
Factorize column before or after train/test split?
I would encode/factorize it before you split it into testing and training datasets.
This way you ensure consistent factorization in both splits.
Also I would suggestion looking into sklearn.preprocessing.LabelEncoder
Turning a column of strings into a column of integers in Pandas
How about using factorize
?
>>> labels, uniques = df.A.factorize()
>>> df.A = labels
>>> df
A B
0 0 4
1 1 4
2 0 4
3 2 4
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.factorize.html
Pandas: How to convert column of string objects representing categories to integers?
You can use factorize
to encode the different values in the column as integers:
df['day'] = pd.factorize(df.day)[0]
This sets the 'day' column of the example DataFrame to the following:
>>> df
day hour price booked
0 0 7 12 True
1 0 8 12 False
2 1 7 13 True
3 2 8 13 False
4 0 7 15 True
5 0 8 13 False
6 1 7 13 True
7 1 8 15 False
The 'day' column is of integer type:
>>> df.dtypes
day int64
hour int64
price float64
booked bool
Related Topics
Re.Sub Erroring with "Expected String or Bytes-Like Object"
How to Find Unused Functions in Python Code
Sending Mail from Python Using Smtp
Python: Urllib2 How to Send Cookie with Urlopen Request
How to Get a List of Column Names in SQLite
Python Slice How-To, I Know the Python Slice But How to Use Built-In Slice Object for It
Python's Sum VS. Numpy's Numpy.Sum
Select Multiple Ranges of Columns in Pandas Dataframe
Python: My Function Returns "None" After It Does What I Want It To
Recursive Function Returning None
When Installing Pyaudio, Pip Cannot Find Portaudio.H in /Usr/Local/Include
Why Should I Close Files in Python
String Comparison Doesn't Seem to Work for Lines Read from a File
How to Use Multiple Requests and Pass Items in Between Them in Scrapy Python
How to Search a Word in a Word 2007 .Docx File
Typeerror: Unhashable Type: 'List' When Using Built-In Set Function
Pygame 2 Dimensional Movement of an Enemy Towards the Player, How to Calculate X and Y Velocity