Selecting Across Multiple Columns with Python Pandas

Selecting multiple columns in a Pandas dataframe

The column names (which are strings) cannot be sliced in the manner you tried.

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).

df1 = df[['a', 'b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df

To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.

{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}

Now you can use this dictionary to access columns through names and using iloc.

selecting across multiple columns with pandas

I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:

In [11]: df
Out[11]:
A B C D
2000-01-03 -0.59885 -0.18141 -0.68828 -0.77572
2000-01-04 0.83935 0.15993 0.95911 -1.12959
2000-01-05 2.80215 -0.10858 -1.62114 -0.20170
2000-01-06 0.71670 -0.26707 1.36029 1.74254
2000-01-07 -0.45749 0.22750 0.46291 -0.58431
2000-01-10 -0.78702 0.44006 -0.36881 -0.13884
2000-01-11 0.79577 -0.09198 0.14119 0.02668
2000-01-12 -0.32297 0.62332 1.93595 0.78024
2000-01-13 1.74683 -1.57738 -0.02134 0.11596
2000-01-14 -0.55613 0.92145 -0.22832 1.56631
2000-01-17 -0.55233 -0.28859 -1.18190 -0.80723
2000-01-18 0.73274 0.24387 0.88146 -0.94490
2000-01-19 0.56644 -0.49321 1.17584 -0.17585
2000-01-20 1.56441 0.62331 -0.26904 0.11952
2000-01-21 0.61834 0.17463 -1.62439 0.99103
2000-01-24 0.86378 -0.68111 -0.15788 -0.16670
2000-01-25 -1.12230 -0.16128 1.20401 1.08945
2000-01-26 -0.63115 0.76077 -0.92795 -2.17118
2000-01-27 1.37620 -1.10618 -0.37411 0.73780
2000-01-28 -1.40276 1.98372 1.47096 -1.38043
2000-01-31 0.54769 0.44100 -0.52775 0.84497
2000-02-01 0.12443 0.32880 -0.71361 1.31778
2000-02-02 -0.28986 -0.63931 0.88333 -2.58943
2000-02-03 0.54408 1.17928 -0.26795 -0.51681
2000-02-04 -0.07068 -1.29168 -0.59877 -1.45639
2000-02-07 -0.65483 -0.29584 -0.02722 0.31270
2000-02-08 -0.18529 -0.18701 -0.59132 -1.15239
2000-02-09 -2.28496 0.36352 1.11596 0.02293
2000-02-10 0.51054 0.97249 1.74501 0.20525
2000-02-11 0.10100 0.27722 0.65843 1.73591

In [12]: df[(df.values > 1.5).any(1)]
Out[12]:
A B C D
2000-01-05 2.8021 -0.1086 -1.62114 -0.2017
2000-01-06 0.7167 -0.2671 1.36029 1.7425
2000-01-12 -0.3230 0.6233 1.93595 0.7802
2000-01-13 1.7468 -1.5774 -0.02134 0.1160
2000-01-14 -0.5561 0.9215 -0.22832 1.5663
2000-01-20 1.5644 0.6233 -0.26904 0.1195
2000-01-28 -1.4028 1.9837 1.47096 -1.3804
2000-02-10 0.5105 0.9725 1.74501 0.2052
2000-02-11 0.1010 0.2772 0.65843 1.7359

Multiple conditions have to be combined using & or | (and parentheses!):

In [13]: df[(df['A'] > 1) | (df['B'] < -1)]
Out[13]:
A B C D
2000-01-05 2.80215 -0.1086 -1.62114 -0.2017
2000-01-13 1.74683 -1.5774 -0.02134 0.1160
2000-01-20 1.56441 0.6233 -0.26904 0.1195
2000-01-27 1.37620 -1.1062 -0.37411 0.7378
2000-02-04 -0.07068 -1.2917 -0.59877 -1.4564

I'd be very interested to have some kind of query API to make these kinds of things easier

Select multiple columns by labels in pandas

Name- or Label-Based (using regular expression syntax)

df.filter(regex='[A-CEG-I]')   # does NOT depend on the column order

Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')

Location-Based (depends on column order)

df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]

Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.

The Long Way

And for completeness, you always have the option shown by @Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:

df[['A','B','C','E','G','H','I']]   # does NOT depend on the column order

Results for any of the above methods

          A         B         C         E         G         H         I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467

Selecting Multiple Sets of Columns in a DataFrame

Try this:

df = pd.DataFrame(np.random.random((10,25)))

df.iloc[:, np.r_[1:5, 10:15, 24]]

Output:

         1         2         3         4         10        11        12  \
0 0.919851 0.852250 0.296771 0.562167 0.926956 0.425690 0.347112
1 0.053743 0.709286 0.866658 0.873554 0.588566 0.349387 0.582820
2 0.910201 0.918976 0.170105 0.967791 0.839613 0.200846 0.680498
3 0.606104 0.932580 0.857744 0.876963 0.199340 0.303397 0.103754
4 0.310878 0.386755 0.792151 0.664561 0.295020 0.980937 0.161358
5 0.808738 0.473452 0.190060 0.882827 0.778226 0.054262 0.052157
6 0.381418 0.216191 0.034603 0.314118 0.806126 0.535102 0.903150
7 0.531248 0.411528 0.644153 0.994051 0.727920 0.587441 0.679924
8 0.585064 0.352427 0.940689 0.684018 0.544400 0.765451 0.018906
9 0.075305 0.526637 0.911727 0.945098 0.105858 0.299441 0.862912

13 14 24
0 0.084237 0.317501 0.906934
1 0.949726 0.744821 0.149304
2 0.529243 0.492711 0.933917
3 0.723055 0.898373 0.642724
4 0.929206 0.540533 0.467883
5 0.825112 0.357224 0.235781
6 0.258703 0.114978 0.506079
7 0.758599 0.440214 0.863970
8 0.936511 0.117202 0.089875
9 0.968953 0.509748 0.584470

Selecting multiple columns, both consecutive and non-consecutive, in a Pandas dataframe

Use np.r_:

import numpy as np
X = d.iloc[:, np.r_[13, 30, 35:45]].to_numpy()

Intermediate output of np.r_[13, 30, 35:45]:

array([13, 30, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44])

Pandas dataframe select rows with multiple columns' string conditions

This works. Following your pattern both start with P/p or N/n

ptp = df.loc[((df['label_one'].str.startswith('P')) &
(df['label_two'].str.startswith('p'))) |
((df['label_one'].str.startswith('N')) &
(df['label_two'].str.startswith('n')))]

gives

PTP
year text label_one label_two
0 2017 yes it is POSITIVE positive
3 2018 it has to be done POSITIVE positive
4 2018 no NEGATIVE negative
6 2019 he is right POSITIVE positive
8 2020 that is a trap NEGATIVE negative
9 2021 I am on my way POSITIVE positive

Pandas - Selecting over multiple columns

You'll have to specify your conditions one way or another. You can create individual masks for each condition which you eventually reduce to a single one:

import seaborn.apionly as sns
import operator
import numpy as np

# Load a sample dataframe to play with
df = sns.load_dataset('iris')

# Define individual conditions as tuples
# ([column], [compare_function], [compare_value])
cond1 = ('sepal_length', operator.gt, 5)
cond2 = ('sepal_width', operator.lt, 2)
cond3 = ('species', operator.eq, 'virginica')
conditions = [cond1, cond2, cond3]

# Apply those conditions on the df, creating a list of 3 masks
masks = [fn(df[var], val) for var, fn, val in conditions]
# Reduce those 3 masks to one using logical OR
mask = np.logical_or.reduce(masks)

result = df.ix[mask]

When we compare this with the "hand-made" selection, we see they're the same:

result_manual = df[(df.sepal_length>5) | (df.sepal_width<2) | (df.species == 'virginica')]
result_manual.equals(result) # == True


Related Topics



Leave a reply



Submit