Factorize a Column of Strings in Pandas

Factorize a column of strings in pandas

series

RowX    yes
RowY     no
RowW    yes
RowJ     no
RowA    yes
RowR     no
RowX    yes
RowY    yes
RowW    yes
RowJ    yes
RowA    yes
RowR     no
Name: Column 3, dtype: object

`pd.factorize`

1 - series.factorize()[0]
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])

`np.where`

np.where(series == 'yes', 1, 0)
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])

`pd.Categorical`/`astype('category')`

pd.Categorical(series).codes
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0], dtype=int8)

series.astype('category').cat.codes

RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
dtype: int8

`pd.Series.replace`

series.replace({'yes' : 1, 'no' : 0})
 
RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
Name: Column 3, dtype: int64

A fun, generalised version of the above:

series.replace({r'^(?!yes).*$' : 0}, regex=True).astype(bool).astype(int)

RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
Name: Column 3, dtype: int64

Anything that is not "yes" is 0.

How to factorize selected columns in pandas

Use:

def gnumeric_func (data, columns):
  data[columns] = data[columns].apply(lambda x: pd.factorize(x)[0])
  return data

For all columns:

def gnumeric_func (data):
  data = data.apply(lambda x: pd.factorize(x)[0])
  return data

pandas.factorize on an entire data frame

You can use apply if you need to factorize each column separately:

df = pd.DataFrame({'A':['type1','type2','type2'],
                   'B':['type1','type2','type3'],
                   'C':['type1','type3','type3']})

print (df)
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))
   A  B  C
0  0  0  0
1  1  1  1
2  1  2  1

If you need for the same string value the same numeric one:

print (df.stack().rank(method='dense').unstack())
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

If you need to apply the function only for some columns, use a subset:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
       A    B    C
0  type1  1.0  1.0
1  type2  2.0  3.0
2  type2  3.0  3.0

Solution with factorize:

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
       A  B  C
0  type1  0  0
1  type2  1  2
2  type2  2  2

Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]

d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()
print (df1)
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

print (df1.stack().map(d1).unstack())
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

Pandas how to Factorize in Unusual Text Order

Consider a DF with a string column as shown:

df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA']))
df

Sample Image

Custom Function:

(i) Take unique entries from the column under consideration.

(ii) Groupby by string lengths and sort these lexicographically and stack them horizontally.

(iii) Factorize them.

def complex_factorize(df, col):
    ser = pd.Series(df[col].unique())
    func = lambda x: sorted(x.values.ravel())
    arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values)
    return pd.factorize(arr)

Taking the labels and the unique elements of the series returned by the factorize method, feed it to DF.replace to construct the mapping.

val, ser = complex_factorize(df, 'col')
df.replace(ser, val)

Sample Image

factorize columns of a pandas data frame

First line: iteritems iterates over the columns of a dataframe and returns (column_name, actual_column) pairs. By zipping and destructuring in the for line, you end up with:

train_name: name of current column in train dataframe;
test_name: name of corresponding column in test dataframe;
train_series and test_series: actual columns (as pandas Series).

Second line: this checks if the column is of type Object, essentially meaning that it contains strings and is a categorical column.

Third line: factorize will return, in second position, the list of unique values (or categorical labels) in the provided column, and, in first position, the indices that would let you recreate the original column from the unique values. In other words:

labels, uniques = pd.factorize(column)
for i in range(len(column)):
    print(column[i] == uniques[labels[i]])  # True

Continuing with destructuring assignments, the current train column train[train_name] will be replaced by its index-based representation, while tmp_indexer will contains the unique values in the original train[train_name].

Fourth line: get_indexer will return the indices where the values in test[test_name] are to be found in tmp_indexer. As a result, the current test column is replaced by a list of indices in the exact same way the corresponding train column was in the line above.

End result: both columns in train and test have gone from a series of strings (categorical values) to a series of numerical index values, both indexed on the same (temporary) object.

How to iterate over pandas dataframe columns and factorize based on a conditional?

This is more like

for i in columns:
    if dataframe[i].dtypes=='object':
        xtrain[i] = pd.Categorical(pd.factorize(dataframe[i])[0])

And since you are doing MlP, so let us using LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in columns:
    if dataframe[i].dtypes=='object':
        dataframe[i] = le.fit_transform(dataframe[i])

Factorize column before or after train/test split?

I would encode/factorize it before you split it into testing and training datasets.
This way you ensure consistent factorization in both splits.

Also I would suggestion looking into sklearn.preprocessing.LabelEncoder

Turning a column of strings into a column of integers in Pandas

How about using factorize?

>>> labels, uniques = df.A.factorize()
>>> df.A = labels
>>> df
   A  B
0  0  4
1  1  4
2  0  4
3  2  4

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.factorize.html

Pandas: How to convert column of string objects representing categories to integers?

You can use factorize to encode the different values in the column as integers:

df['day'] = pd.factorize(df.day)[0]

This sets the 'day' column of the example DataFrame to the following:

>>> df
   day  hour  price booked
0    0     7     12   True
1    0     8     12  False
2    1     7     13   True
3    2     8     13  False
4    0     7     15   True
5    0     8     13  False
6    1     7     13   True
7    1     8     15  False

The 'day' column is of integer type:

>>> df.dtypes
day         int64
hour        int64
price     float64
booked       bool

Factorize a Column of Strings in Pandas