Closest equivalent of a factor variable in Python Pandas
This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical
dtype and it operates very similar to factors
in R. Please see this link for more information:
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.
In [1]: s = Series(["a","b","c","a"], dtype="category")
In [2]: s
Out[2]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a < b < c]
Python equivalent to R 's factor data type
Try pandas
http://pandas.pydata.org/
and take a look at:
http://pandas.pydata.org/pandas-docs/stable/categorical.html
It's an amazing library that does pretty what what you're asking for.
How to generate pandas DataFrame column of Categorical from string column?
The only workaround for pandas pre-0.15 I found is as follows:
- column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
- so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
Applying Decay Factor to Return Data in Pandas and Saving as New Variable
Try This:
df = df.set_index("Date")
retFac = np.fromfunction(lambda i,j : .97**(i), df.values.shape)[::-1]
df*retFac
df['ret_decay'] = retFac[:,0]
df
AAPL DE IBM MSFT ret_decay
Date
2016-01-19 NaN NaN NaN NaN 0.693842
2016-01-20 -0.021780 -0.019701 -0.078557 -0.019177 0.715301
2016-01-21 0.016271 0.014681 0.021823 0.024440 0.737424
2016-01-22 0.036128 0.034555 0.009910 0.019085 0.760231
2016-01-25 0.008539 -0.026477 -0.001068 0.007608 0.783743
2016-01-26 -0.011491 -0.000907 0.004933 -0.001936 0.807983
2016-01-27 -0.048231 0.021428 -0.013007 -0.010279 0.832972
2016-08-16 0.010455 0.010135 -0.006738 -0.005712 0.858734
2016-08-17 -0.007966 -0.006432 -0.005290 -0.000698 0.885293
2016-08-18 0.006277 -0.006603 0.003754 0.000699 0.912673
2016-08-19 -0.006054 0.034406 -0.005735 -0.001222 0.940900
2016-08-22 -0.004707 0.091092 -0.002445 0.001049 0.970000
2016-08-23 0.006305 0.007621 0.006913 0.010304 1.000000
Using the newly added code: and
retFac = np.fromfunction(lambda i,j : .97**(i), ret.values.shape)[::-1]
ret*retFac
ret['ret_decay'] = retFac[:,0]
ret
Symbol AAPL DE IBM MSFT ret_decay
Date
2016-08-16 NaN NaN NaN NaN 0.858734
2016-08-17 -0.007966 -0.006432 -0.005290 -0.000698 0.885293
2016-08-18 0.006277 -0.006603 0.003754 0.000699 0.912673
2016-08-19 -0.006054 0.034406 -0.005735 -0.001222 0.940900
2016-08-22 -0.004707 0.091092 -0.002445 0.001049 0.970000
2016-08-23 0.006305 0.007621 0.006913 0.010304 1.000000
Convert array of string (category) to array of int from a pandas dataframe
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor
class (available in the pandas
namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
The factor has attributes labels
and levels
:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
Pandas long to wide reshape, by two variables
A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
@Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx
then you could do reshape in stata
:
reshape wide product price, i(salesman) j(idx)
What is the most efficient way of finding all the factors of a number in Python?
from functools import reduce
def factors(n):
return set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0)))
This will return all of the factors, very quickly, of a number n
.
Why square root as the upper limit?
sqrt(x) * sqrt(x) = x
. So if the two factors are the same, they're both the square root. If you make one factor bigger, you have to make the other factor smaller. This means that one of the two will always be less than or equal to sqrt(x)
, so you only have to search up to that point to find one of the two matching factors. You can then use x / fac1
to get fac2
.
The reduce(list.__add__, ...)
is taking the little lists of [fac1, fac2]
and joining them together in one long list.
The [i, n/i] for i in range(1, int(sqrt(n)) + 1) if n % i == 0
returns a pair of factors if the remainder when you divide n
by the smaller one is zero (it doesn't need to check the larger one too; it just gets that by dividing n
by the smaller one.)
The set(...)
on the outside is getting rid of duplicates, which only happens for perfect squares. For n = 4
, this will return 2
twice, so set
gets rid of one of them.
How to create a new dataframe with an average of panel data with different IDs and times?
Just remove Date
from your groupby key. In this case, you want the mean value of column Ranking
from all rows in each Code
column, so your groupby key should be only Code
.
df_avg = df.groupby(['Code'],as_index=False)['Ranking'].mean().rename(columns={'Ranking':'Avg_Ranking'})
How can I count categorical columns by month in Pandas?
I tried this so posting though I like @Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0
Related Topics
Color Coding Cells in a Table Based on the Cell Value Using Jinja Templates
Calling Custom Functions from Python Using Rpy2
Integration Testing for a Web App
Simple File Server to Serve Current Directory
Difference Between Multiple If's and Elif'S
What Is the Perfect Counterpart in Python for "While Not Eof"
Find All Combinations of a List of Numbers with a Given Sum
Django HTML Template Can't Find Static CSS and Js Files
Convert Tuple to List and Back
Replacing Numpy Elements If Condition Is Met
How to Edit Header Row in Pandas - Styling
Fama MACbeth Regression in Python (Pandas or Statsmodels)
Python: What's the Difference Between Pythonbrew and Virtualenv
Get Protocol + Host Name from Url
How to Get Reproducible Results in Keras