Closest Equivalent of a Factor Variable in Python Pandas

Closest equivalent of a factor variable in Python Pandas

This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical dtype and it operates very similar to factors in R. Please see this link for more information:

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.

In [1]: s = Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

Python equivalent to R 's factor data type

Try pandas

http://pandas.pydata.org/

and take a look at:

http://pandas.pydata.org/pandas-docs/stable/categorical.html

It's an amazing library that does pretty what what you're asking for.

How to generate pandas DataFrame column of Categorical from string column?

The only workaround for pandas pre-0.15 I found is as follows:

column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe

train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical

train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe

[UPDATE: pandas 0.15+ added decent support for Categorical]

Applying Decay Factor to Return Data in Pandas and Saving as New Variable

Try This:

df     = df.set_index("Date")
retFac = np.fromfunction(lambda i,j : .97**(i), df.values.shape)[::-1]
df*retFac
df['ret_decay'] = retFac[:,0]
df

                AAPL        DE       IBM      MSFT  ret_decay
Date                                                         
2016-01-19       NaN       NaN       NaN       NaN   0.693842
2016-01-20 -0.021780 -0.019701 -0.078557 -0.019177   0.715301
2016-01-21  0.016271  0.014681  0.021823  0.024440   0.737424
2016-01-22  0.036128  0.034555  0.009910  0.019085   0.760231
2016-01-25  0.008539 -0.026477 -0.001068  0.007608   0.783743
2016-01-26 -0.011491 -0.000907  0.004933 -0.001936   0.807983
2016-01-27 -0.048231  0.021428 -0.013007 -0.010279   0.832972
2016-08-16  0.010455  0.010135 -0.006738 -0.005712   0.858734
2016-08-17 -0.007966 -0.006432 -0.005290 -0.000698   0.885293
2016-08-18  0.006277 -0.006603  0.003754  0.000699   0.912673
2016-08-19 -0.006054  0.034406 -0.005735 -0.001222   0.940900
2016-08-22 -0.004707  0.091092 -0.002445  0.001049   0.970000
2016-08-23  0.006305  0.007621  0.006913  0.010304   1.000000

Using the newly added code: and

retFac = np.fromfunction(lambda i,j : .97**(i), ret.values.shape)[::-1]
ret*retFac
ret['ret_decay'] = retFac[:,0]
ret

    Symbol          AAPL        DE       IBM      MSFT  ret_decay
Date                                                         
2016-08-16       NaN       NaN       NaN       NaN   0.858734
2016-08-17 -0.007966 -0.006432 -0.005290 -0.000698   0.885293
2016-08-18  0.006277 -0.006603  0.003754  0.000699   0.912673
2016-08-19 -0.006054  0.034406 -0.005735 -0.001222   0.940900
2016-08-22 -0.004707  0.091092 -0.002445  0.001049   0.970000
2016-08-23  0.006305  0.007621  0.006913  0.010304   1.000000

Convert array of string (category) to array of int from a pandas dataframe

If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):

In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])

In [2]: s
Out[2]: 
0    single
1    touching
2    nuclei
3    dusts
4    touching
5    single
6    nuclei
Name: None, Length: 7

In [4]: Factor(s)
Out[4]: 
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]

The factor has attributes labels and levels:

In [7]: f = Factor(s)

In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)

In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)

This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.

BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.

Pandas long to wide reshape, by two variables

A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:

df['idx'] = df.groupby('Salesman').cumcount()

Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:

print df.pivot(index='Salesman',columns='idx')[['product','price']]

        product              price        
idx            0     1     2      0   1   2
Salesman                                   
Knut         bat  ball  wand      5   1   3
Steve        pen   NaN   NaN      2 NaN NaN

To get closer to your desired output I added the following:

df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)

product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')

reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape

         product_0 product_1 product_2  price_0  price_1  price_2  Height
Salesman                                                                 
Knut           bat      ball      wand        5        1        3       6
Steve          pen       NaN       NaN        2      NaN      NaN       5

Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):

df['idx'] = df.groupby('Salesman').cumcount()

tmp = []
for var in ['product','price']:
    df['tmp_idx'] = var + '_' + df.idx.astype(str)
    tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))

reshape = pd.concat(tmp,axis=1)

@Luke said:

I think Stata can do something like this with the reshape command.

You can but I think you also need a within group counter to get the reshape in stata to get your desired output:

     +-------------------------------------------+
     | salesman   idx   height   product   price |
     |-------------------------------------------|
  1. |     Knut     0        6       bat       5 |
  2. |     Knut     1        6      ball       1 |
  3. |     Knut     2        6      wand       3 |
  4. |    Steve     0        5       pen       2 |
     +-------------------------------------------+

If you add idx then you could do reshape in stata:

reshape wide product price, i(salesman) j(idx)

What is the most efficient way of finding all the factors of a number in Python?

from functools import reduce

def factors(n):    
    return set(reduce(list.__add__, 
                ([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0)))

This will return all of the factors, very quickly, of a number n.

Why square root as the upper limit?

sqrt(x) * sqrt(x) = x. So if the two factors are the same, they're both the square root. If you make one factor bigger, you have to make the other factor smaller. This means that one of the two will always be less than or equal to sqrt(x), so you only have to search up to that point to find one of the two matching factors. You can then use x / fac1 to get fac2.

The reduce(list.__add__, ...) is taking the little lists of [fac1, fac2] and joining them together in one long list.

The [i, n/i] for i in range(1, int(sqrt(n)) + 1) if n % i == 0 returns a pair of factors if the remainder when you divide n by the smaller one is zero (it doesn't need to check the larger one too; it just gets that by dividing n by the smaller one.)

The set(...) on the outside is getting rid of duplicates, which only happens for perfect squares. For n = 4, this will return 2 twice, so set gets rid of one of them.

How to create a new dataframe with an average of panel data with different IDs and times?

Just remove Date from your groupby key. In this case, you want the mean value of column Ranking from all rows in each Code column, so your groupby key should be only Code.

df_avg = df.groupby(['Code'],as_index=False)['Ranking'].mean().rename(columns={'Ranking':'Avg_Ranking'})

How can I count categorical columns by month in Pandas?

I tried this so posting though I like @Scott Boston's solution better as I combined A and B values earlier.

df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'

new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']

                a_or_b_count    c_count
Year    Month       
2017    1       3.0             0.0
        2       1.0             3.0

Closest Equivalent of a Factor Variable in Python Pandas