Python: Combining Low Frequency Factors/Category Counts
Mask by finding percentage of occupency i.e :
series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])
To change the index with sum :
new = series[~mask]
new['Other'] = series[mask].sum()
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 832
Name: 1, dtype: int64
If you want to replace the index then :
series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)
Windows 26083
iOS 19711
Android 13077
Macintosh 5799
Other 347
Other 285
Other 167
Other 22
Other 11
Name: 1, dtype: int64
Explanation
(series/series.sum() * 100) # This will give you the percentage i.e
Windows 39.820158
iOS 30.092211
Android 19.964276
Macintosh 8.853165
Chrome OS 0.529755
Linux 0.435101
Windows Phone 0.254954
(not set) 0.033587
BlackBerry 0.016793
Name: 1, dtype: float64
.lt(1)
is equivalent to lesser than 1. That gives you a boolean mask, based on that mask index and assign the data
Group low frequency counts in to a single 'other' category
One option to collapse the factor
levels or character
is using fct_collapse
library(dplyr)
library(forcats)
threshold <- 7
out <- df1 %>%
count(Col1 = fct_collapse(Col1, Other = unique(Col1[Col2 < threshold])),
wt = Col2)
out
# A tibble: 3 x 2
# Col1 n
# <fct> <int>
#1 A 25
#2 Other 8
#3 C 15
then, we can create a pie chart
library(ggplot2)
out %>%
ggplot(aes(x = "", y = n, fill = Col1)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)
Update
Based on the OP's dput, we can change the column names to the OP's column names
df2 %>%
count(Haplogroup = fct_collapse(as.character(Haplogroup),
Other = unique(as.character(Haplogroup)[n < threshold])),
wt = n, name = "n1")
# A tibble: 6 x 2
# Haplogroup n1
# <fct> <int>
#1 Other 40
#2 E1b 14
#3 N1a 12
#4 R1 10
#5 R1a 15
#6 R1b 25
Or another option is base R
(assuming the column is character
class), create a logical vector by comparing the 'threshold' with 'Col2', assign the elements in 'Col1' where the 'i1' is TRUE to 'Other', then do a group by sum
with aggregate
i1 <- df1$Col2 < threshold
df1$Col1[i1] <- "Other"
aggregate(Col2 ~ Col1, df1, sum)
# Col1 Col2
#1 A 25
#2 C 15
#3 Other 8
Data
df1 <- structure(list(Col1 = c("A", "B", "C", "D", "E"), Col2 = c(25L,
1L, 15L, 5L, 2L)), row.names = c(NA, -5L), class = "data.frame")
How can I convert the python code that combines low frequency categories/values to a function that can be applied to any pandas dataframe column?
Solution should be simplify with normalize=True
in Series.value_counts
:
def replace_thresh(df, col, thresh, new_val):
s = df[col].value_counts(normalize=True).mul(100)
df[col] = np.where(df[col].isin(s.index[s < thresh]), new_val, df[col])
return df
df = replace_thresh(df, 'col', 1, 'Other')
How do I assign 'other' to low frequency categories? (pandas)
Let's go through the logic of expected actions:
- Count frequencies for every city
- Calculate the bottom 10% percentage
- Find the cities with frequencies less then 10%
- Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile
is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities
will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q
param in quantile
) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better usecity_freq = df['city'].value_counts(normalize=True)
instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:city_freq = df['city'].value_counts()
and it will still work.
Is there a way to add frequency counts together in a for loop?
1
m = sapply(unique(dd$Allele), function(x)
sapply(unique(dd$Range), function(y){
sum(dd$Frequency[dd$Allele == x & dd$Range == y])
}))
dimnames(m) = list(Range = unique(dd$Range), Allele = unique(dd$Allele))
m
# Allele
#Range 0 1 2
# High 1 9 28
# Normal 2 7 17
2
with(dd, tapply(Frequency, list(Range, Allele), sum))
# 0 1 2
#High 1 9 28
#Normal 2 7 17
3
library(reshape2)
dcast(dd, Range ~ Allele, value.var = "Frequency", fun.aggregate = sum)
# Range 0 1 2
#1 High 1 9 28
#2 Normal 2 7 17
Related Topics
Spatialpolygons - Creating a Set of Polygons in R from Coordinates
How to Change the Format of an Individual Facet_Wrap Panel
Partially Color Histogram in R
Different Robust Standard Errors of Logit Regression in Stata and R
Programming-Safe Version of Subset - to Evaluate Its Condition While Called from Another Function
Plotting Ordiellipse Function from Vegan Package Onto Nmds Plot Created in Ggplot2
Knitr: Include Figures in Report *And* Output Figures to Separate Files
Looping Through List of Data Frames in R
Asymmetric Expansion of Ggplot Axis Limits
Warning in Install.Packages: Unable to Move Temporary Installation
Convert Accented Characters into Ascii Character
Remove 'Search' Option But Leave 'Search Columns' Option
Sum of Two Columns of Data Frame with Na Values
Configuration Failed Because Libcurl Was Not Found
How to Access Global/Outer Scope Variable from R Apply Function
Lapply to Add Columns to Each Dataframe in a List
Specify Height and Width of Ggplot Graph in Rmarkdown Knitr Output