Selecting multiple columns in a Pandas dataframe
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
. selecting across multiple columns with pandas
I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:
In [11]: df
Out[11]:
A B C D
2000-01-03 -0.59885 -0.18141 -0.68828 -0.77572
2000-01-04 0.83935 0.15993 0.95911 -1.12959
2000-01-05 2.80215 -0.10858 -1.62114 -0.20170
2000-01-06 0.71670 -0.26707 1.36029 1.74254
2000-01-07 -0.45749 0.22750 0.46291 -0.58431
2000-01-10 -0.78702 0.44006 -0.36881 -0.13884
2000-01-11 0.79577 -0.09198 0.14119 0.02668
2000-01-12 -0.32297 0.62332 1.93595 0.78024
2000-01-13 1.74683 -1.57738 -0.02134 0.11596
2000-01-14 -0.55613 0.92145 -0.22832 1.56631
2000-01-17 -0.55233 -0.28859 -1.18190 -0.80723
2000-01-18 0.73274 0.24387 0.88146 -0.94490
2000-01-19 0.56644 -0.49321 1.17584 -0.17585
2000-01-20 1.56441 0.62331 -0.26904 0.11952
2000-01-21 0.61834 0.17463 -1.62439 0.99103
2000-01-24 0.86378 -0.68111 -0.15788 -0.16670
2000-01-25 -1.12230 -0.16128 1.20401 1.08945
2000-01-26 -0.63115 0.76077 -0.92795 -2.17118
2000-01-27 1.37620 -1.10618 -0.37411 0.73780
2000-01-28 -1.40276 1.98372 1.47096 -1.38043
2000-01-31 0.54769 0.44100 -0.52775 0.84497
2000-02-01 0.12443 0.32880 -0.71361 1.31778
2000-02-02 -0.28986 -0.63931 0.88333 -2.58943
2000-02-03 0.54408 1.17928 -0.26795 -0.51681
2000-02-04 -0.07068 -1.29168 -0.59877 -1.45639
2000-02-07 -0.65483 -0.29584 -0.02722 0.31270
2000-02-08 -0.18529 -0.18701 -0.59132 -1.15239
2000-02-09 -2.28496 0.36352 1.11596 0.02293
2000-02-10 0.51054 0.97249 1.74501 0.20525
2000-02-11 0.10100 0.27722 0.65843 1.73591
In [12]: df[(df.values > 1.5).any(1)]
Out[12]:
A B C D
2000-01-05 2.8021 -0.1086 -1.62114 -0.2017
2000-01-06 0.7167 -0.2671 1.36029 1.7425
2000-01-12 -0.3230 0.6233 1.93595 0.7802
2000-01-13 1.7468 -1.5774 -0.02134 0.1160
2000-01-14 -0.5561 0.9215 -0.22832 1.5663
2000-01-20 1.5644 0.6233 -0.26904 0.1195
2000-01-28 -1.4028 1.9837 1.47096 -1.3804
2000-02-10 0.5105 0.9725 1.74501 0.2052
2000-02-11 0.1010 0.2772 0.65843 1.7359
Multiple conditions have to be combined using &
or |
(and parentheses!):In [13]: df[(df['A'] > 1) | (df['B'] < -1)]
Out[13]:
A B C D
2000-01-05 2.80215 -0.1086 -1.62114 -0.2017
2000-01-13 1.74683 -1.5774 -0.02134 0.1160
2000-01-20 1.56441 0.6233 -0.26904 0.1195
2000-01-27 1.37620 -1.1062 -0.37411 0.7378
2000-02-04 -0.07068 -1.2917 -0.59877 -1.4564
I'd be very interested to have some kind of query API to make these kinds of things easier Select multiple columns by labels in pandas
Name- or Label-Based (using regular expression syntax)
df.filter(regex='[A-CEG-I]') # does NOT depend on the column order
Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')
Location-Based (depends on column order)
df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]
Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B']
, then you could replace 'A':'C'
above with 'A':'B'
.The Long Way
And for completeness, you always have the option shown by @Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:df[['A','B','C','E','G','H','I']] # does NOT depend on the column order
Results for any of the above methods
A B C E G H I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467
Selecting Multiple Sets of Columns in a DataFrame
Try this:
df = pd.DataFrame(np.random.random((10,25)))
df.iloc[:, np.r_[1:5, 10:15, 24]]
Output: 1 2 3 4 10 11 12 \
0 0.919851 0.852250 0.296771 0.562167 0.926956 0.425690 0.347112
1 0.053743 0.709286 0.866658 0.873554 0.588566 0.349387 0.582820
2 0.910201 0.918976 0.170105 0.967791 0.839613 0.200846 0.680498
3 0.606104 0.932580 0.857744 0.876963 0.199340 0.303397 0.103754
4 0.310878 0.386755 0.792151 0.664561 0.295020 0.980937 0.161358
5 0.808738 0.473452 0.190060 0.882827 0.778226 0.054262 0.052157
6 0.381418 0.216191 0.034603 0.314118 0.806126 0.535102 0.903150
7 0.531248 0.411528 0.644153 0.994051 0.727920 0.587441 0.679924
8 0.585064 0.352427 0.940689 0.684018 0.544400 0.765451 0.018906
9 0.075305 0.526637 0.911727 0.945098 0.105858 0.299441 0.862912
13 14 24
0 0.084237 0.317501 0.906934
1 0.949726 0.744821 0.149304
2 0.529243 0.492711 0.933917
3 0.723055 0.898373 0.642724
4 0.929206 0.540533 0.467883
5 0.825112 0.357224 0.235781
6 0.258703 0.114978 0.506079
7 0.758599 0.440214 0.863970
8 0.936511 0.117202 0.089875
9 0.968953 0.509748 0.584470
Selecting multiple columns, both consecutive and non-consecutive, in a Pandas dataframe
Use np.r_
:
import numpy as np
X = d.iloc[:, np.r_[13, 30, 35:45]].to_numpy()
Intermediate output of np.r_[13, 30, 35:45]
:array([13, 30, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44])
Pandas dataframe select rows with multiple columns' string conditions
This works. Following your pattern both start with P/p or N/n
ptp = df.loc[((df['label_one'].str.startswith('P')) &
(df['label_two'].str.startswith('p'))) |
((df['label_one'].str.startswith('N')) &
(df['label_two'].str.startswith('n')))]
givesPTP
year text label_one label_two
0 2017 yes it is POSITIVE positive
3 2018 it has to be done POSITIVE positive
4 2018 no NEGATIVE negative
6 2019 he is right POSITIVE positive
8 2020 that is a trap NEGATIVE negative
9 2021 I am on my way POSITIVE positive
Pandas - Selecting over multiple columns
You'll have to specify your conditions one way or another. You can create individual masks for each condition which you eventually reduce to a single one:
import seaborn.apionly as sns
import operator
import numpy as np
# Load a sample dataframe to play with
df = sns.load_dataset('iris')
# Define individual conditions as tuples
# ([column], [compare_function], [compare_value])
cond1 = ('sepal_length', operator.gt, 5)
cond2 = ('sepal_width', operator.lt, 2)
cond3 = ('species', operator.eq, 'virginica')
conditions = [cond1, cond2, cond3]
# Apply those conditions on the df, creating a list of 3 masks
masks = [fn(df[var], val) for var, fn, val in conditions]
# Reduce those 3 masks to one using logical OR
mask = np.logical_or.reduce(masks)
result = df.ix[mask]
When we compare this with the "hand-made" selection, we see they're the same:result_manual = df[(df.sepal_length>5) | (df.sepal_width<2) | (df.species == 'virginica')]
result_manual.equals(result) # == True
Related Topics
Writing Unit Tests in Python: How to Start
How to Save the Pandas Dataframe/Series Data as a Figure
What Does 'Wb' Mean in This Code, Using Python
What Is the Official "Preferred" Way to Install Pip and Virtualenv Systemwide
How Does Sklearn.Svm.Svc's Function Predict_Proba() Work Internally
Detect Tap with Pyaudio from Live Mic
Conditional with Statement in Python
Safely Create a File If and Only If It Does Not Exist with Python
Looping from 1 to Infinity in Python
Extract Int from String in Pandas
Csvwriter Not Saving Data to File the Moment I Write It
In Python, Is It Better to Use List Comprehensions or For-Each Loops
How to Check for Python Version in a Program That Uses New Language Features
How to Ignore One Single Specific Line with Pylint