Define and apply custom bins on a dataframe
Another cut answer that takes into account extrema:
dat <- read.table("clipboard", header=TRUE)
cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0
Explanation
The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.
cut(1:10, c(3, 5, 7))
[1] <NA> <NA> <NA> (3,5] (3,5] (5,7] (5,7] <NA> <NA> <NA>
Levels: (3,5] (5,7]
You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest
argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.
cut(1:10, c(3, 5, 7), labels=1:2)
[1] <NA> <NA> <NA> 1 1 2 2 <NA> <NA> <NA>
Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:
x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
[1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4
Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.
x[x=="4"] <- "1"
[1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4
This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut
works.
Ok, the apply
function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply
does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut
function to each column of your data frame. Everything after cut
in the apply function are just arguments to cut
, which we discussed above.
Hope that helps.
Binning Pandas Dataframe by custom and variable length datetime ranges
If you have the start and end times of each period, then you really don't need to create a range. You can just use logic with the datetime objects. Should be easy to generalize to more and more tests if you have that.
import pandas as pd
start_t1 = pd.to_datetime('2017-10-14 00:20:00')
stop_t1 = pd.to_datetime('2017-10-14 00:33:15')
start_t2 = pd.to_datetime('2017-10-14 00:49:15')
stop_t2 = pd.to_datetime('2017-10-14 01:15:15')
df.loc[(df.Timestamp > start_t1) & (df.Timestamp < stop_t1), 'Test'] = 'Test_1'
df.loc[(df.Timestamp > start_t2) & (df.Timestamp < stop_t2), 'Test'] = 'Test_2'
DeviceID Quant Result2 QuantResult1 Timestamp Test
0 15D 387403 7903 2017-10-14 00:28:00 Test_1
1 15D 786 3429734 2017-10-14 00:29:10 Test_1
2 15D 546 2320923 2017-10-14 00:31:15 Test_1
3 15D 435869 232 2017-10-14 00:50:05 Test_2
4 15D 12 34032984 2017-10-14 01:10:07 Test_2
Binning a column with Python Pandas
You can use pandas.cut
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts
or groupby
and aggregate size
:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut
returns categorical
.
Series
methods like Series.value_counts()
will use all categories, even if some categories are not present in the data, operations in categorical.
Grouping values into custom bins
Try making the bins fall between the thresholds:
bins = [0.5, 8.5, 11.5, 12.5, 15.5, 16.5]
labels=['Middle School or less', 'Some High School',
'High School Grad', 'Some College', 'College Grad']
adult_df_educrace['education_bins'] = pd.cut(x=adult_df_educrace['education'],
bins=bins,
labels=labels)
Test:
adult_df_educrace = pd.DataFrame({'education':np.arange(1,17)})
Output:
education education_bins
0 1 Middle School or less
1 2 Middle School or less
2 3 Middle School or less
3 4 Middle School or less
4 5 Middle School or less
5 6 Middle School or less
6 7 Middle School or less
7 8 Middle School or less
8 9 Some High School
9 10 Some High School
10 11 Some High School
11 12 High School Grad
12 13 Some College
13 14 Some College
14 15 Some College
15 16 College Grad
How to create bins and assign labels based on a given condition pandas
The solution involves editing your bins
list:
# Same labels as yours
labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79',
'1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99',
'6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0+']
# Define the edges between bins
bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]
# pd.cut each column, with each bin closed on left and open on right
res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))
# rename columns and print result
res.columns = [f'new{i+1}' for i in range(df.shape[1])]
print(res)
new1 new2 new3 new4 new5 new6 new7 new8 new9 new10
0 10.0+ 5.0 – 5.99 10.0+ 10.0+ 0 1.0 - 1.19 3.0 – 3.99 2.0 – 2.99 0 1.4 – 1.59
1 3.0 – 3.99 7.0 – 7.99 0 3.0 – 3.99 10.0+ 0 8.0 – 8.99 10.0+ 0 0
2 6.0 – 6.99 10.0+ 10.0+ 3.0 – 3.99 10.0+ 6.0 – 6.99 10.0+ 4.0 – 4.99 0 0
3 2.0 – 2.99 5.0 – 5.99 0 8.0 – 8.99 10.0+ 0 6.0 – 6.99 10.0+ 0 0
4 4.0 – 4.99 10.0+ 10.0+ 10.0+ 1.4 – 1.59 10.0+ 10.0+ 10.0+ 0 0
5 7.0 – 7.99 2.0 – 2.99 0 4.0 – 4.99 3.0 – 3.99 0 10.0+ 0 0 0
6 6.0 – 6.99 10.0+ 8.0 – 8.99 10.0+ 6.0 – 6.99 3.0 – 3.99 9.0 – 9.99 10.0+ 0 0
7 6.0 – 6.99 5.0 – 5.99 6.0 – 6.99 10.0+ 10.0+ 10.0+ 4.0 – 4.99 10.0+ 0 0
8 3.0 – 3.99 10.0+ 4.0 – 4.99 8.0 – 8.99 0 10.0+ 10.0+ 3.0 – 3.99 10.0+ 0
Explanation
The sequence of scalars passed as bins
to pd.cut()
"defines the bin edges allowing for non-uniform width": https://pandas.pydata.org/docs/reference/api/pandas.cut.html.
By default, each bin is open on the left and closed on the right. To switch this, pass right=False
(which also closes the left edge of each bin).
For example, bins=[0, 1.0, 1.19, 1.2]
causes pd.cut
to make 3 intervals: [0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0)
.
Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries
A simpler solution would be to use groupby
and apply
a custom function on each group. In this case, we can define a function reclass
that obtains the correct bins and ids and then uses pd.cut
:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
How to define a function that will check any data frame for Age column and return bins?
the problem lies here
x = input("Enter Dataframe Name: ") # type of x is a string
df = x # now type of df is also a string
df['Age'] # python uses [] as a slicing operation for string, hence generate error
this would resolve your problem
def age_range(df):
bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels=['0-9', '10-19', '20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s']
result = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
return result
for example, you can run it like:
df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
df["AgeRange"] = age_range(df)
or
df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
AgeRangeDf = pd.DataFrame({"Age_Range" :age_range(df)})
Related Topics
Ggplot Legends - Change Labels, Order and Title
Aggregate a Dataframe on a Given Column and Display Another Column
Nested Facets in Ggplot2 Spanning Groups
Dplyr: How to Use Group_By Inside a Function
Converting Decimal to Binary in R
Remove Columns With Zero Values from a Dataframe
How to Display the Frequency At the Top of Each Factor in a Barplot in R
Alternate, Interweave or Interlace Two Vectors
Cumulatively Paste (Concatenate) Values Grouped by Another Variable
R.Exe, Rcmd.Exe, Rscript.Exe and Rterm.Exe: What's the Difference
R Apply() Function on Specific Dataframe Columns
How to Make Execution Pause, Sleep, Wait For X Seconds in R
Create New Variables With Mutate_At While Keeping the Original Ones
Remove Extra Legends in Ggplot2
R: Use Magrittr Pipe Operator in Self Written Package
Subset Data to Contain Only Columns Whose Names Match a Condition