How to Count Duplicate Rows in Pandas Dataframe

How to count duplicate rows in pandas dataframe?

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64

pandas dataframe count duplicates by group and date

IIUC, you can replace the duplicated values with 1, the others with zero, then groupby+agg+sum:

(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)

output:

         Date Group  repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 C 1
8 2021-11-01 A 1
9 2021-11-01 B 2
10 2021-11-01 C 7
11 2021-11-02 A 1
12 2021-11-02 B 4
13 2021-11-02 C 5
14 2021-11-03 A 3
15 2021-11-03 B 1
16 2021-11-03 C 2
17 2021-11-04 A 0
18 2021-11-04 C 4

adding the missing combinations

dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])

(df.assign(Date=dates,
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
.reindex(idx, fill_value=0)
.reset_index()
)

output:

         Date Group  repetitions
0 2021-10-29 B 3
1 2021-10-29 A 1
2 2021-10-29 C 3
3 2021-10-30 B 2
4 2021-10-30 A 2
5 2021-10-30 C 2
6 2021-10-31 B 0
7 2021-10-31 A 4
8 2021-10-31 C 1
9 2021-11-01 B 2
10 2021-11-01 A 1
11 2021-11-01 C 7
12 2021-11-02 B 4
13 2021-11-02 A 1
14 2021-11-02 C 5
15 2021-11-03 B 1
16 2021-11-03 A 3
17 2021-11-03 C 2
18 2021-11-04 B 0
19 2021-11-04 A 0
20 2021-11-04 C 4

Count duplicates in column and add them to new col Pandas

I would first create the count of duplicates

df['Count'] = 1
df.groupby(['id','letter']).Count.count().reset_index()

And then drop the duplicates

df.drop_duplicates()

Duplicate rows in pandas DF

You can groupby these two columns and then calculate the sizes of the groups:

In [16]: df.groupby(['Letters', 'Numbers']).size()
Out[16]:
Letters Numbers
A 1 2
2 1
3 1
B 1 1
2 1
3 1
C 2 2
dtype: int64

To get a DataFrame like in your example output, you can reset the index with reset_index.

How to count the unique duplicate values in each column

Option 1

if we need to calculate the number of duplicate values

import pandas as pd

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})

df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)

Prints:

A    3
B 2
C 2
dtype: int64

Option 2

if we need to calculate the number of values that have duplicates

df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)

Prints:

A    2
B 2
C 1
dtype: int64

Option 2.1

detailed

df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)

Prints:

A    [val = 3, cnt = 3] [val = 2, cnt = 2]
B [val = 9, cnt = 2] [val = 3, cnt = 2]
C [val = 4, cnt = 3]
dtype: object

count the duplicate rows in pandas, and a very big csv file

Use Counter from collections module:

Input data:

>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1
from collections import Counter

for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())

Output result:

>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})

Convert the Counter to a DataFrame:

df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()
>>> df
head tail count
0 134 135 3
1 134 136 2
2 134 137 2

Pandas: count number of duplicate rows using groupby

One solution is use GroupBy.size for aggregate output with counter:

d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1

Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:

d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')

print (d)
n v c
0 a 1 2
1 a 2 1

What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:

d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2

How to count duplicates in column Pandas?

For new DataFrame call Series.duplicated per columns in DataFrame.apply, count Trues by sum and for one row DataFrame convert Series to DataFrame with transpose, also rename columns:

d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
num age
0 1 10
1 1 10
2 1 10
3 1 11
4 3 11
5 1 98
6 2 99
7 2 102

f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
.sum()
.rename(f)
.to_frame()
.T)

print (df)
Duplicates num (total) Duplicates age (total)
0 7 5

Alternative solution:

df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()] 
for x in ['num','age']})

print (df)

Duplicates num (total) Duplicates age (total)
0 7 5

EDIT: For test non duplicated rows use:

df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
NUM AGE
3 4 20

For get duplicated values by coumn NUM in list use:

duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]

If NUM is index:

print (df)
AGE
NUM
1 18
2 18
3 18
4 20

duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]

Parse CSV data to get a count on rows with duplicate values

The solution I came up with is in parts. My first issue was the casing, I need everything to be in lowercase. So after I appended items to employeeList, I added this code:

for i in range(len(employeeList)):
for j in range(len(employeeList[i])):
employeeList[i][j] = employeeList[i][j].lower()

This makes everything in my employeeList lowercase.

Now once I fixed that, I then needed to change the output of my ouList from a single list, and keep it as a list of lists. So all rows with only ou= are going to be in ouList.

#list for storing only OUs
ouList = []

#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
ouList.append([])
for j in range(len(employeeList[i])):
if employeeList[i][j].startswith('ou='):
ouList[i].append(employeeList[i][j])

Then I needed to remove any items that end with users, userMger, or employee. I reverse iterated and used .endswith() to achieve this without any errors.

#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
for j in reversed(range(len(ouList[i]))):
if (ouList[i][j].endswith('users')
or ouList[i][j].endswith('usermger')
or ouList[i][j].endswith('employee')):
ouList[i].remove(ouList[i][j])

Then to strip ou= or unnecessary strings, I used re (aka regular expressions or regex). Then I appended these new values to another list called ouListStrip

#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
ouListStrip.append([])
for j in range(len(ouList[i])):
ou = re.sub("ou=|_hq", "", ouList[i][j])
ouListStrip[i].append(ou)

This list outputs this:

[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]

Now that I have only a list of lists, I can now work on removing duplicates in the sublists. I achieve this through using not in and appending them still as a list of lists.

no_repeats = []
for i in range(len(ouListStrip)):
no_repeats.append([])
for j in range(len(ouListStrip[i])):
if ouListStrip[i][j] not in no_repeats[i]:
no_repeats[i].append(ouListStrip[i][j])

no_repeats outputs this:

[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]

Finally, I combine my list of list items into one single list:

allOUs = []
for i in range(len(no_repeats)):
for j in range(len(no_repeats[i])):
allOUs.append(no_repeats[i][j])

allOUs outputs:

['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']

Then I make this list into a dictionary and count the items within it using .count():

dict_of_counts = {item:allOUs.count(item) for item in allOUs}

Outputs:

{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}

To make it visually similar to what I want:

for key, value in dict_of_counts.items():
print(key,',',value)

Outputs:

news , 4
ice cream , 2
store , 4
sprinkles , 4


Related Topics



Leave a reply



Submit