Filtering a dataframe showing only duplicates
Considering df
as your input, you can use dplyr
and try:
df %>% group_by(V1) %>% filter(n() > 1)
for the duplicates
and
df %>% group_by(V1) %>% filter(n() == 1)
for the unique entries.
Filter and display all duplicated rows based on multiple columns in Pandas
The following code works, by adding keep = False
:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
How do you filter duplicate columns in a dataframe based on a value in another column
IIUC, you want to keep all rows if Code is not equal to 10 but drop the first of duplicates otherwise, right? Then you could add that into the boolean mask:
cols = ['NID', 'Lact', 'Code']
out = df[~df.duplicated(cols, keep=False) | df.duplicated(cols) | df['Code'].ne(10)]
Output:
NID Lact Code
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0
Filter duplicate records in a dataframe using pandas and perform operations
You can leave one value per group right away like this:
columns = ['col1', 'col2', 'col3',"col4"]
grouped = dup_df.groupby(columns)
grouped[['Sex', 'Count']].apply(
lambda sub_df: (sub_df.groupby('Sex')
.agg(sum).T
.rename(columns={'Male': 'Total_Male',
'Female': 'Total_Female',
'Null': 'Null_column'}))
).assign(Total=lambda x: x.sum(axis=1))
.reset_index(level=4, drop=True)
.reset_index().rename_axis(columns=None)
)
col1 col2 col3 col4 Total_Female Total_Male Null_column Total
0 A B C D 50 100 NaN 150.0
1 X Y Z A 50 50 10.0 110.0
Pandas: How to filter dataframe for duplicate items that occur at least n times in a dataframe
You can use value_counts
to get the item count and then construct a boolean mask from this and reference the index and test membership using isin
:
In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df
Out[3]:
a
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]
Out[8]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
So breaking the above down:
In [9]:
df['a'].value_counts() > 2
Out[9]:
3 True
4 True
0 True
2 False
1 False
Name: a, dtype: bool
In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]
Out[10]:
3 6
4 3
0 3
Name: a, dtype: int64
In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index
Out[11]:
Int64Index([3, 4, 0], dtype='int64')
EDIT
As user @JonClements suggested a simpler and faster method would be to groupby
on the col of interest and filter
it:
In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)
Out[4]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
EDIT 2
To get just a single entry for each repeat call drop_duplicates
and pass param subset='a'
:
In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')
Out[2]:
a
0 0
6 3
12 4
Filtering duplicates from pandas dataframe with preference based on additional column
I think a more straightforward way is to first sort the DataFrame, then drop duplicates keeping the first entry. This is pretty robust (here, 'a' was a string with two values but you could apply a function that makes an integer column from the string if there were more string values to sort).
x = x.sort_values(['a']).drop_duplicates(cols='c')
How to filter the data from two data frames with the repeated values in pandas?
It looks like you want a right merge:
df1.merge(df[['Age']].dropna(), on='Age', how='right')
output:
Named Age
0 Raj 20
1 kir 21
2 cena 18
3 Raj 20
4 ang 30
5 Raj 20
6 cena 18
7 Raj 20
Filtering with two conditions - Remove duplicates less than a certain value while keeping the original
The conditions should be enclosed in parentheses, on the right you have square ones. And to get what you showed. You need to add a condition(df['type'] =="Original"), in my opinion.
a = df[(df['total'] > 10) & (df['type'] == "Duplicate")|(df['type'] == "Original")]
print(a)
Output a
total type
0 23 Original
2 11 Duplicate
3 5 Original
4 16 Duplicate
Related Topics
Test If Element Is in a List and Return 0 or 1
Accessing Functions with a Dot in Their Name (Eg. "As.Vector") Using Rpy2
R - Random Forest and More Than 53 Categories
How to Manage Parallel Processing with Animated Ggplot2-Plot
Replace Na with Mode Based on Id Attribute
What Is the Equivalent of Mutate_At (Dplyr) in Data.Table
Dist Function with Large Number of Points
Empty Output When Reading a CSV File into Rstudio Using Sparkr
Cannot Install Stringi Since Xcode Command Line Tools Update
Function/Loop to Replace Na with Values in Adjacent Columns in R
Split a Column to Multiple Columns
R: Miscellaneous Errors While Trying to Plot Graphs
Equivalent of Which in Scraping
Web Scraping Data Table with R Rvest
Convert Data with One Column and Multiple Rows into Multi Column Multi Row Data