Add Column Which Contains Binned Values of a Numeric Column

Add column which contains binned values of a numeric column

See ?cut and specify breaks (and maybe labels).

x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15

Binning a column with Python Pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]


bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64


s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

How do I reassign the values of a column based on different ranges in R?

We could use case_when from dplyr package:

library(dplyr)
df %>%
mutate(NEW = case_when(sleep_duration < 5 ~ 3,
sleep_duration >=5 & sleep_duration < 6 ~ 2,
sleep_duration >=6 & sleep_duration < 7 ~ 1,
sleep_duration >=7 ~ 0))

Output:

  sleep_duration NEW
1 6.0 1
2 7.5 0
3 8.0 0
4 10.0 0
5 5.0 2
6 9.0 0

data:

df <- data.frame(sleep_duration = c(6, 7.5, 8, 10, 5, 9))

How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?

Will This work?

df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))

> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168

df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))

res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))

> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1

dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")

> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1

adapted from Group/bin/bucket data in R and get count per bucket and sum of values per bucket

Add column into a dataframe based on condition

EDIT:
Your code is NOT Wrong.
You just have to reconvert your result into factor like this:

 df<-data.frame(B=c("A","B","C","C"), C=c("A","C","B","B"), D=c("B","A","C","A") )   
df$A<-levels(df$B)[with(df,ifelse(df$B==df$C,df$D,df$C))]

To see why this happen you have to see what ifelse does:

debugonce(ifelse)
ifelse(df$B==df$C,df$D,df$C)

Keep in Mind "Factor variables are stored, internally, as numeric variables together with their levels. The actual values of the numeric variable are 1, 2, and so on."
In particular ifelse assign to the answer vector boolean values, that is you start with a logical vector. Then based on test comparison, ifelse subset this ans vector assigning "yes" values. So R keep the vector rapresentation.

Briefly something like this happen and you lose the factor rapresentation

   a<-c(TRUE,FALSE)
a[1]<-df$D[1]
df$D
a

Try also this working example (an alternative way to do the same thing)

df<-data.frame(B=c("A","B","C","C"), C=c("A","C","B","B"), D=c("B","A","C","A") )

f<-data.frame(b,c,d)
df
f<-function(x,y,z){
if(x==y){
z
}else{
y
}
}

df$A<-unlist(Map(f,df$B,df$C,df$D))


Related Topics



Leave a reply



Submit