Filtering a Dataframe Showing Only Duplicates

Filtering a dataframe showing only duplicates

Considering df as your input, you can use dplyr and try:

df %>% group_by(V1) %>% filter(n() > 1)

for the duplicates

and

df %>% group_by(V1) %>% filter(n() == 1)

for the unique entries.

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

How do you filter duplicate columns in a dataframe based on a value in another column

IIUC, you want to keep all rows if Code is not equal to 10 but drop the first of duplicates otherwise, right? Then you could add that into the boolean mask:

cols = ['NID', 'Lact', 'Code']
out = df[~df.duplicated(cols, keep=False) | df.duplicated(cols) | df['Code'].ne(10)]

Output:

  NID  Lact  Code
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0

Filter duplicate records in a dataframe using pandas and perform operations

You can leave one value per group right away like this:

columns = ['col1', 'col2', 'col3',"col4"]
grouped = dup_df.groupby(columns)

grouped[['Sex', 'Count']].apply(
lambda sub_df: (sub_df.groupby('Sex')
.agg(sum).T
.rename(columns={'Male': 'Total_Male',
'Female': 'Total_Female',
'Null': 'Null_column'}))
).assign(Total=lambda x: x.sum(axis=1))
.reset_index(level=4, drop=True)
.reset_index().rename_axis(columns=None)
)
  col1 col2 col3 col4  Total_Female  Total_Male  Null_column  Total
0 A B C D 50 100 NaN 150.0
1 X Y Z A 50 50 10.0 110.0

Pandas: How to filter dataframe for duplicate items that occur at least n times in a dataframe

You can use value_counts to get the item count and then construct a boolean mask from this and reference the index and test membership using isin:

In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df

Out[3]:
a
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4

In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]

Out[8]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4

So breaking the above down:

In [9]:
df['a'].value_counts() > 2

Out[9]:
3 True
4 True
0 True
2 False
1 False
Name: a, dtype: bool

In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]

Out[10]:
3 6
4 3
0 3
Name: a, dtype: int64

In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index

Out[11]:
Int64Index([3, 4, 0], dtype='int64')

EDIT

As user @JonClements suggested a simpler and faster method would be to groupby on the col of interest and filter it:

In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)

Out[4]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4

EDIT 2

To get just a single entry for each repeat call drop_duplicates and pass param subset='a':

In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')

Out[2]:
a
0 0
6 3
12 4

Filtering duplicates from pandas dataframe with preference based on additional column

I think a more straightforward way is to first sort the DataFrame, then drop duplicates keeping the first entry. This is pretty robust (here, 'a' was a string with two values but you could apply a function that makes an integer column from the string if there were more string values to sort).

x = x.sort_values(['a']).drop_duplicates(cols='c')

How to filter the data from two data frames with the repeated values in pandas?

It looks like you want a right merge:

df1.merge(df[['Age']].dropna(), on='Age', how='right')

output:

  Named  Age
0 Raj 20
1 kir 21
2 cena 18
3 Raj 20
4 ang 30
5 Raj 20
6 cena 18
7 Raj 20

Filtering with two conditions - Remove duplicates less than a certain value while keeping the original

The conditions should be enclosed in parentheses, on the right you have square ones. And to get what you showed. You need to add a condition(df['type'] =="Original"), in my opinion.

a = df[(df['total'] > 10) & (df['type'] == "Duplicate")|(df['type'] == "Original")]
print(a)

Output a

   total       type
0 23 Original
2 11 Duplicate
3 5 Original
4 16 Duplicate


Related Topics



Leave a reply



Submit