Combining Low Frequency Counts

Python: Combining Low Frequency Factors/Category Counts

Mask by finding percentage of occupency i.e :

series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])

To change the index with sum :

new = series[~mask]
new['Other'] = series[mask].sum()

Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 832
Name: 1, dtype: int64

If you want to replace the index then :

series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)

Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 347
Other 285
Other 167
Other 22
Other 11
Name: 1, dtype: int64

Explanation

(series/series.sum() * 100) # This will give you the percentage i.e 

Windows 39.820158
iOS 30.092211
Android 19.964276
Macintosh 8.853165
Chrome OS 0.529755
Linux 0.435101
Windows Phone 0.254954
(not set) 0.033587
BlackBerry 0.016793
Name: 1, dtype: float64

.lt(1) is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data

Group low frequency counts in to a single 'other' category

One option to collapse the factor levels or character is using fct_collapse

library(dplyr)
library(forcats)
threshold <- 7
out <- df1 %>%
count(Col1 = fct_collapse(Col1, Other = unique(Col1[Col2 < threshold])),
wt = Col2)
out
# A tibble: 3 x 2
# Col1 n
# <fct> <int>
#1 A 25
#2 Other 8
#3 C 15

then, we can create a pie chart

library(ggplot2)
out %>%
ggplot(aes(x = "", y = n, fill = Col1)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)

Sample Image

Update

Based on the OP's dput, we can change the column names to the OP's column names

df2 %>%
count(Haplogroup = fct_collapse(as.character(Haplogroup),
Other = unique(as.character(Haplogroup)[n < threshold])),
wt = n, name = "n1")
# A tibble: 6 x 2
# Haplogroup n1
# <fct> <int>
#1 Other 40
#2 E1b 14
#3 N1a 12
#4 R1 10
#5 R1a 15
#6 R1b 25

Or another option is base R (assuming the column is character class), create a logical vector by comparing the 'threshold' with 'Col2', assign the elements in 'Col1' where the 'i1' is TRUE to 'Other', then do a group by sum with aggregate

i1 <- df1$Col2 < threshold
df1$Col1[i1] <- "Other"
aggregate(Col2 ~ Col1, df1, sum)
# Col1 Col2
#1 A 25
#2 C 15
#3 Other 8

Data

df1 <- structure(list(Col1 = c("A", "B", "C", "D", "E"), Col2 = c(25L, 
1L, 15L, 5L, 2L)), row.names = c(NA, -5L), class = "data.frame")

How can I convert the python code that combines low frequency categories/values to a function that can be applied to any pandas dataframe column?

Solution should be simplify with normalize=True in Series.value_counts:

def replace_thresh(df, col, thresh, new_val):
s = df[col].value_counts(normalize=True).mul(100)
df[col] = np.where(df[col].isin(s.index[s < thresh]), new_val, df[col])
return df

df = replace_thresh(df, 'col', 1, 'Other')

How do I assign 'other' to low frequency categories? (pandas)

Let's go through the logic of expected actions:

  1. Count frequencies for every city
  2. Calculate the bottom 10% percentage
  3. Find the cities with frequencies less then 10%
  4. Change them to other

You started in the right direction. To get frequencies for every city:

city_freq = (df['city'].value_counts())/df.shape[0]

We want to find the bottom 10%. We use pandas' quantile to do it:

bottom_decile = city_freq.quantile(q=0.1)

Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:

less_freq_cities = city_freq[city_freq<=botton_decile]

less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":

df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

complete code:

city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.

EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.

Is there a way to add frequency counts together in a for loop?

1

m = sapply(unique(dd$Allele), function(x)
sapply(unique(dd$Range), function(y){
sum(dd$Frequency[dd$Allele == x & dd$Range == y])
}))
dimnames(m) = list(Range = unique(dd$Range), Allele = unique(dd$Allele))
m
# Allele
#Range 0 1 2
# High 1 9 28
# Normal 2 7 17

2

with(dd, tapply(Frequency, list(Range, Allele), sum))
# 0 1 2
#High 1 9 28
#Normal 2 7 17

3

library(reshape2)
dcast(dd, Range ~ Allele, value.var = "Frequency", fun.aggregate = sum)
# Range 0 1 2
#1 High 1 9 28
#2 Normal 2 7 17


Related Topics



Leave a reply



Submit