Select Multiple Ranges of Columns in Pandas Dataframe

Slicing multiple ranges of columns in Pandas, by list of names

I think you need numpy.r_ for concanecate positions of columns, then use iloc for selecting:

print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])

and for second approach subset by list:

print (df[years_month])

Sample:

df = pd.DataFrame({'2000-1':[1,3,5],
'2000-2':[5,3,6],
'2000-3':[7,8,9],
'2000-4':[1,3,5],
'2000-5':[5,3,6],
'2000-6':[7,8,9],
'2000-7':[1,3,5],
'2000-8':[5,3,6],
'2000-9':[7,4,3],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})

print (df)
2000-1 2000-2 2000-3 2000-4 2000-5 2000-6 2000-7 2000-8 2000-9 A \
0 1 5 7 1 5 7 1 5 7 1
1 3 3 8 3 3 8 3 3 4 2
2 5 6 9 5 6 9 5 6 3 3

B C
0 4 7
1 5 8
2 6 9

print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9

You can also sum of ranges (cast to list in python 3 is necessary):

rng = list(range(1,3)) + list(range(6, len(df.columns)))
print (rng)
[1, 2, 6, 7, 8, 9, 10, 11]

print (df.iloc[:, rng])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9

Select multiple ranges of columns in Pandas DataFrame

use np.r_

np.r_[1:10, 15, 17, 50:100]

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 17, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

so you can do

df.iloc[:, np.r_[1:10, 15, 17, 50:100]]

Selecting multiple ranges of dates from dataframe

Another way:

df['date'] = pd.to_datetime(df['date'])
df[df.date.dt.year.isin([2015, 2016]) & df.date.dt.day.lt(3)]

date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55

Pandas dataframe slicing with multiple column ranges

Slicing by multiple label ranges is more challenging and has less support, so let's try to slice on index ranges instead:

loc = df.columns.get_loc
df.iloc[:, np.r_[loc('lat'):loc('long')+1, loc('year'):loc('day')+1]]

lat long year month day
0 0.218559 0.418508 0.345499 0.166776 0.878559
1 0.572760 0.898007 0.702427 0.386477 0.694439
2 0.803740 0.983359 0.945517 0.649540 0.860832
3 0.873401 0.906277 0.463535 0.610538 0.496282
4 0.187359 0.687674 0.039455 0.647117 0.638054
5 0.169531 0.794548 0.352917 0.484498 0.697736
6 0.022867 0.375123 0.444112 0.498140 0.414346
7 0.729086 0.415919 0.430047 0.734766 0.556216
8 0.138769 0.614932 0.109311 0.539576 0.289299
9 0.037969 0.500108 0.758036 0.262273 0.100859

When indexing by position I need to add +1 to the right index since it is right-exclusive.


Another option is to slice individual sections and concatenate:

ranges = [('lat', 'long'), ('year', 'day')]
pd.concat([df.loc[:, i:j] for i, j in ranges], axis=1)

lat long year month day
0 0.218559 0.418508 0.345499 0.166776 0.878559
1 0.572760 0.898007 0.702427 0.386477 0.694439
2 0.803740 0.983359 0.945517 0.649540 0.860832
3 0.873401 0.906277 0.463535 0.610538 0.496282
4 0.187359 0.687674 0.039455 0.647117 0.638054
5 0.169531 0.794548 0.352917 0.484498 0.697736
6 0.022867 0.375123 0.444112 0.498140 0.414346
7 0.729086 0.415919 0.430047 0.734766 0.556216
8 0.138769 0.614932 0.109311 0.539576 0.289299
9 0.037969 0.500108 0.758036 0.262273 0.100859

How to efficiently select several value ranges in Pandas?

METHOD#1

You can use pd.cut and the create dynamic groups and save them in a dictionary, ,the refer each keys for the individual dataframe:

bins = [0,5,10,20,30,40,50,60,np.inf]
labels = ['five','ten','twenty','thirty','forty','fifty','sixty','over']

u = df1.assign(grp=pd.cut(df1['a'],bins,labels=labels))
d = dict(iter(u.groupby("grp")))

test runs:

print(f"""Group five is \n\n {d['five']}\n\n 
Group forty is \n\n{d['forty']} \n\n Group over is \n\n{d['over']}""")

Group five is

x a grp
3 d 5 five
13 fc 2 five


Group forty is

x a grp
0 a 34 forty
10 cs 34 forty
11 ca 32 forty

Group forty is

x a grp
4 e 120 over
8 cf 67 over
12 ac 1213 over

METHOD#2
you can also use locals for saving dictionary keys a local variables but the dict method is better:

bins = [0,5,10,20,30,40,50,60,np.inf]
labels = ['five','ten','twenty','thirty','forty','fifty','sixty','over']

u = df1.assign(grp=pd.cut(df1['a'],bins,labels=labels))
d = dict(iter(u.groupby("grp")))
for k,v in d.items():
locals().update({k:v})

print(over,'\n\n',five,'\n\n',sixty)

x a grp
4 e 120 over
8 cf 67 over
12 ac 1213 over

x a grp
3 d 5 five
13 fc 2 five

x a grp
2 c 51 sixty
7 cd 56 sixty
9 cv 54 sixty

Select multiple columns by labels in pandas

Name- or Label-Based (using regular expression syntax)

df.filter(regex='[A-CEG-I]')   # does NOT depend on the column order

Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')

Location-Based (depends on column order)

df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]

Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.

The Long Way

And for completeness, you always have the option shown by @Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:

df[['A','B','C','E','G','H','I']]   # does NOT depend on the column order

Results for any of the above methods

          A         B         C         E         G         H         I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467

Pandas: Find values within multiple ranges defined by start- and stop-columns

I believe need parameter closed='both' in IntervalIndex.from_arrays:

intervals = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')

And then select matching values:

df = df[intervals.get_indexer(df.age.values) != -1]
print (df)
age some_random_value
0 1 100
1 2 200
2 3 300
4 5 500
5 6 600
6 7 700

Detail:

print (intervals.get_indexer(df.age.values))
[ 0 0 0 -1 1 1 1 -1 -1 -1]


Related Topics



Leave a reply



Submit