Combining Low Frequency Counts

Python: Combining Low Frequency Factors/Category Counts

Mask by finding percentage of occupency i.e :

series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e 
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])

To change the index with sum :

new = series[~mask]
new['Other'] = series[mask].sum()

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          832
Name: 1, dtype: int64

If you want to replace the index then :

series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          347
Other          285
Other          167
Other           22
Other           11
Name: 1, dtype: int64

Explanation

(series/series.sum() * 100) # This will give you the percentage i.e 

Windows          39.820158
iOS              30.092211
Android          19.964276
Macintosh         8.853165
Chrome OS         0.529755
Linux             0.435101
Windows Phone     0.254954
(not set)         0.033587
BlackBerry        0.016793
Name: 1, dtype: float64

.lt(1) is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data

Group low frequency counts in to a single 'other' category

One option to collapse the factor levels or character is using fct_collapse

library(dplyr)
library(forcats)
threshold <- 7
out <- df1 %>% 
         count(Col1 = fct_collapse(Col1, Other = unique(Col1[Col2 < threshold])),  
            wt = Col2)
out
# A tibble: 3 x 2
#  Col1      n
#  <fct> <int>
#1 A        25
#2 Other     8
#3 C        15

then, we can create a pie chart

library(ggplot2)
out %>% 
  ggplot(aes(x = "", y = n, fill = Col1)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0)

Sample Image

Update

Based on the OP's dput, we can change the column names to the OP's column names

df2 %>%
  count(Haplogroup = fct_collapse(as.character(Haplogroup), 
      Other = unique(as.character(Haplogroup)[n < threshold])),
      wt = n, name = "n1")
# A tibble: 6 x 2
#  Haplogroup    n1
#  <fct>      <int>
#1 Other         40
#2 E1b           14
#3 N1a           12
#4 R1            10
#5 R1a           15
#6 R1b           25

Or another option is base R (assuming the column is character class), create a logical vector by comparing the 'threshold' with 'Col2', assign the elements in 'Col1' where the 'i1' is TRUE to 'Other', then do a group by sum with aggregate

i1 <- df1$Col2 < threshold
df1$Col1[i1] <- "Other"
aggregate(Col2 ~ Col1, df1, sum)
#    Col1 Col2
#1     A   25
#2     C   15
#3 Other    8

Data

df1 <- structure(list(Col1 = c("A", "B", "C", "D", "E"), Col2 = c(25L, 
1L, 15L, 5L, 2L)), row.names = c(NA, -5L), class = "data.frame")

How can I convert the python code that combines low frequency categories/values to a function that can be applied to any pandas dataframe column?

Solution should be simplify with normalize=True in Series.value_counts:

def replace_thresh(df, col, thresh, new_val):
    s = df[col].value_counts(normalize=True).mul(100)
    df[col] = np.where(df[col].isin(s.index[s < thresh]), new_val, df[col])
    return df

df = replace_thresh(df, 'col', 1, 'Other')

How do I assign 'other' to low frequency categories? (pandas)

Let's go through the logic of expected actions:

Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other

You started in the right direction. To get frequencies for every city:

city_freq = (df['city'].value_counts())/df.shape[0]

We want to find the bottom 10%. We use pandas' quantile to do it:

bottom_decile = city_freq.quantile(q=0.1)

Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:

less_freq_cities = city_freq[city_freq<=botton_decile]

less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":

df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

complete code:

city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.

EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.

Is there a way to add frequency counts together in a for loop?

m = sapply(unique(dd$Allele), function(x)
    sapply(unique(dd$Range), function(y){
        sum(dd$Frequency[dd$Allele == x & dd$Range == y])
    }))
dimnames(m) = list(Range = unique(dd$Range), Allele = unique(dd$Allele))
m
#        Allele
#Range    0 1  2
#  High   1 9 28
#  Normal 2 7 17

with(dd, tapply(Frequency, list(Range, Allele), sum))
#       0 1  2
#High   1 9 28
#Normal 2 7 17

library(reshape2)
dcast(dd, Range ~ Allele, value.var = "Frequency", fun.aggregate = sum)
#   Range 0 1  2
#1   High 1 9 28
#2 Normal 2 7 17