How to Take Column-Slices of Dataframe in Pandas

How to take column-slices of dataframe in pandas

2017 Answer - pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let's assume we have a DataFrame with the following columns:

foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y

python pandas slice column

Use iloc function but python counts from 0, so for second column need 1:

print (df.iloc[:, 1])
0 2
1 5
2 8
Name: 1, dtype: int64

For range:

print (df.iloc[:, 0:3])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9

How to get multiple column-slices of a dataframe in pandas

You can use numpy.r_ for concatenation of indices, but it works only with positions, so need get_loc or searchsorted + iloc:

df = pd.DataFrame(np.arange(8).reshape(1, 8), columns = list('abcdefgh'))
print (df)
a b c d e f g h
0 0 1 2 3 4 5 6 7

b = df.columns.get_loc('b')
d = df.columns.get_loc('d')
f = df.columns.get_loc('f')
h = df.columns.get_loc('h')
print (b,d,f,h)
1 3 5 7
b = df.columns.searchsorted('b')
d = df.columns.searchsorted('d')
f = df.columns.searchsorted('f')
h = df.columns.searchsorted('h')
print (b,d,f,h)
1 3 5 7

df = df.iloc[:, np.r_[b:c+1, f:h+1]]
print (df)
b c d f g h
0 1 2 3 5 6 7

It is same as:

df =  df.iloc[:, np.r_[1:4, 5:8]]
print (df)
b c d f g h
0 1 2 3 5 6 7

df =  df.iloc[:, np.r_['b':'d', 'f':'h']]
print (df)
#TypeError: unsupported operand type(s) for -: 'str' and 'str'

Another solution with loc + join:

df =  df.loc[:,'b':'d'].join(df.loc[:,'f':'h'])
print (df)
b c d f g h
0 1 2 3 5 6 7

Pandas dataframe slicing with multiple column ranges

Slicing by multiple label ranges is more challenging and has less support, so let's try to slice on index ranges instead:

loc = df.columns.get_loc
df.iloc[:, np.r_[loc('lat'):loc('long')+1, loc('year'):loc('day')+1]]

lat long year month day
0 0.218559 0.418508 0.345499 0.166776 0.878559
1 0.572760 0.898007 0.702427 0.386477 0.694439
2 0.803740 0.983359 0.945517 0.649540 0.860832
3 0.873401 0.906277 0.463535 0.610538 0.496282
4 0.187359 0.687674 0.039455 0.647117 0.638054
5 0.169531 0.794548 0.352917 0.484498 0.697736
6 0.022867 0.375123 0.444112 0.498140 0.414346
7 0.729086 0.415919 0.430047 0.734766 0.556216
8 0.138769 0.614932 0.109311 0.539576 0.289299
9 0.037969 0.500108 0.758036 0.262273 0.100859

When indexing by position I need to add +1 to the right index since it is right-exclusive.


Another option is to slice individual sections and concatenate:

ranges = [('lat', 'long'), ('year', 'day')]
pd.concat([df.loc[:, i:j] for i, j in ranges], axis=1)

lat long year month day
0 0.218559 0.418508 0.345499 0.166776 0.878559
1 0.572760 0.898007 0.702427 0.386477 0.694439
2 0.803740 0.983359 0.945517 0.649540 0.860832
3 0.873401 0.906277 0.463535 0.610538 0.496282
4 0.187359 0.687674 0.039455 0.647117 0.638054
5 0.169531 0.794548 0.352917 0.484498 0.697736
6 0.022867 0.375123 0.444112 0.498140 0.414346
7 0.729086 0.415919 0.430047 0.734766 0.556216
8 0.138769 0.614932 0.109311 0.539576 0.289299
9 0.037969 0.500108 0.758036 0.262273 0.100859

Pandas data slicing by column names

According to documentation

With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.

You get an empty DataFrame because your index contains strings and it can't find values 'area' and 'pop' there. Here what you get in case of numeric index

>> data.reset_index()['area':'pop']
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [area] of <class 'str'>

What you want instead is

>> data.loc[:, 'area':'pop']

How to slice column values in Python pandas DataFrame

By using str slice

df.days=df.days.str[1:]
df
Out[759]:
element id year month days tmax tmin
0 0 MX17004 2010 1 1 NaN NaN
1 1 MX17004 2010 1 10 NaN NaN
2 2 MX17004 2010 1 11 NaN NaN
3 3 MX17004 2010 1 12 NaN NaN
4 4 MX17004 2010 1 13 NaN NaN

Pandas slice row by index list for column

You can try rsplit

df0['rate'] = df0['county'].str.rsplit(' ', n = 1).str[-1]

Leveraging for loop to run slices of dataframe through supervised model based on one column value

A for loop is perfect here!

columns = whatever_df.columns.tolist()
cols = [c for c in columns if c not in ['Date', 'CPR']]
from sklearn.ensemble import RandomForestRegressor

for i in range(5):
cluster = whatever_df[whatever_df['cluster'] == i]

train = cluster[cluster['Date'] <= max(cluster['Date']) - relativedelta(months = 3)]
test = cluster[cluster['Date'] > max(cluster['Date']) - relativedelta(months = 3)]

X_train = train[cols]
y_train = train['CPR']
X_test = test[cols]
y_test = test['CPR']

rf = RandomForestRegressor(max_depth=5)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(f'MAE {i}: ', metrics.mean_absolute_error(y_test, y_pred))
print(f'MSE {i}: ', metrics.mean_squared_error(y_test, y_pred))
print('\n')

Pandas Slice Columns and select subsets based on between condition

Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:

df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values

Pandas data frames comparaision operations

For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.

Pandas data frame slicing

This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.



Related Topics



Leave a reply



Submit