How to count duplicate rows in pandas dataframe?
You can groupby
on all the columns and call size
the index indicates the duplicate values:
In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()
Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64
pandas dataframe count duplicates by group and date
IIUC, you can replace the duplicated values with 1, the others with zero, then groupby
+agg
+sum
:
(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)
output:
Date Group repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 C 1
8 2021-11-01 A 1
9 2021-11-01 B 2
10 2021-11-01 C 7
11 2021-11-02 A 1
12 2021-11-02 B 4
13 2021-11-02 C 5
14 2021-11-03 A 3
15 2021-11-03 B 1
16 2021-11-03 C 2
17 2021-11-04 A 0
18 2021-11-04 C 4
adding the missing combinations
dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])
(df.assign(Date=dates,
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
.reindex(idx, fill_value=0)
.reset_index()
)
output:
Date Group repetitions
0 2021-10-29 B 3
1 2021-10-29 A 1
2 2021-10-29 C 3
3 2021-10-30 B 2
4 2021-10-30 A 2
5 2021-10-30 C 2
6 2021-10-31 B 0
7 2021-10-31 A 4
8 2021-10-31 C 1
9 2021-11-01 B 2
10 2021-11-01 A 1
11 2021-11-01 C 7
12 2021-11-02 B 4
13 2021-11-02 A 1
14 2021-11-02 C 5
15 2021-11-03 B 1
16 2021-11-03 A 3
17 2021-11-03 C 2
18 2021-11-04 B 0
19 2021-11-04 A 0
20 2021-11-04 C 4
Count duplicates in column and add them to new col Pandas
I would first create the count of duplicates
df['Count'] = 1
df.groupby(['id','letter']).Count.count().reset_index()
And then drop the duplicates
df.drop_duplicates()
Duplicate rows in pandas DF
You can groupby these two columns and then calculate the sizes of the groups:
In [16]: df.groupby(['Letters', 'Numbers']).size()
Out[16]:
Letters Numbers
A 1 2
2 1
3 1
B 1 1
2 1
3 1
C 2 2
dtype: int64
To get a DataFrame like in your example output, you can reset the index with reset_index
.
How to count the unique duplicate values in each column
Option 1
if we need to calculate the number of duplicate values
import pandas as pd
df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})
df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)
Prints:
A 3
B 2
C 2
dtype: int64
Option 2
if we need to calculate the number of values that have duplicates
df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)
Prints:
A 2
B 2
C 1
dtype: int64
Option 2.1
detailed
df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)
Prints:
A [val = 3, cnt = 3] [val = 2, cnt = 2]
B [val = 9, cnt = 2] [val = 3, cnt = 2]
C [val = 4, cnt = 3]
dtype: object
count the duplicate rows in pandas, and a very big csv file
Use Counter
from collections
module:
Input data:
>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1
from collections import Counter
for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())
Output result:
>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})
Convert the Counter to a DataFrame:
df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()
>>> df
head tail count
0 134 135 3
1 134 136 2
2 134 137 2
Pandas: count number of duplicate rows using groupby
One solution is use GroupBy.size
for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby
, because no another columns n
, v
in input DataFrame
:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform
- new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2
How to count duplicates in column Pandas?
For new DataFrame call Series.duplicated
per columns in DataFrame.apply
, count True
s by sum
and for one row DataFrame convert Series
to DataFrame
with transpose, also rename
columns:
d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
num age
0 1 10
1 1 10
2 1 10
3 1 11
4 3 11
5 1 98
6 2 99
7 2 102
f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
.sum()
.rename(f)
.to_frame()
.T)
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
Alternative solution:
df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()]
for x in ['num','age']})
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
EDIT: For test non duplicated rows use:
df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
NUM AGE
3 4 20
For get duplicated values by coumn NUM
in list use:
duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]
If NUM
is index:
print (df)
AGE
NUM
1 18
2 18
3 18
4 20
duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]
Parse CSV data to get a count on rows with duplicate values
The solution I came up with is in parts. My first issue was the casing, I need everything to be in lowercase. So after I appended items to employeeList
, I added this code:
for i in range(len(employeeList)):
for j in range(len(employeeList[i])):
employeeList[i][j] = employeeList[i][j].lower()
This makes everything in my employeeList lowercase.
Now once I fixed that, I then needed to change the output of my ouList
from a single list, and keep it as a list of lists. So all rows with only ou=
are going to be in ouList
.
#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
ouList.append([])
for j in range(len(employeeList[i])):
if employeeList[i][j].startswith('ou='):
ouList[i].append(employeeList[i][j])
Then I needed to remove any items that end with users, userMger, or employee. I reverse iterated and used .endswith()
to achieve this without any errors.
#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
for j in reversed(range(len(ouList[i]))):
if (ouList[i][j].endswith('users')
or ouList[i][j].endswith('usermger')
or ouList[i][j].endswith('employee')):
ouList[i].remove(ouList[i][j])
Then to strip ou=
or unnecessary strings, I used re (aka regular expressions or regex). Then I appended these new values to another list called ouListStrip
#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
ouListStrip.append([])
for j in range(len(ouList[i])):
ou = re.sub("ou=|_hq", "", ouList[i][j])
ouListStrip[i].append(ou)
This list outputs this:
[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]
Now that I have only a list of lists, I can now work on removing duplicates in the sublists. I achieve this through using not in
and appending them still as a list of lists.
no_repeats = []
for i in range(len(ouListStrip)):
no_repeats.append([])
for j in range(len(ouListStrip[i])):
if ouListStrip[i][j] not in no_repeats[i]:
no_repeats[i].append(ouListStrip[i][j])
no_repeats
outputs this:
[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]
Finally, I combine my list of list items into one single list:
allOUs = []
for i in range(len(no_repeats)):
for j in range(len(no_repeats[i])):
allOUs.append(no_repeats[i][j])
allOUs
outputs:
['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']
Then I make this list into a dictionary and count the items within it using .count()
:
dict_of_counts = {item:allOUs.count(item) for item in allOUs}
Outputs:
{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}
To make it visually similar to what I want:
for key, value in dict_of_counts.items():
print(key,',',value)
Outputs:
news , 4
ice cream , 2
store , 4
sprinkles , 4
Related Topics
How to Check Whether All Elements of Array Are in Between Two Values
Loop Through Json Data in Python
How to Print Just the First Letters of Each Word
Convert a Standard Python Key Value Dictionary List to Pyspark Data Frame
How to Set the Default Python Path for Anaconda on Linux
How to Remove Commas and Dots of Individual Word in Two Dimensional List
How to Upgrade the Sqlite Version Used by Python'S Sqlite3 Module on Mac
How to Split a Huge Text File in Python
Find First Non-Zero Value in Each Column of Pandas Dataframe
How to Use Authenticated Proxy in Selenium Chromedriver
How to Read Pdf Files One by One from a Folder in Python
How to Properly Setup Pipenv in Pycharm
How to Get the Coordinates of the Bounding Box in Yolo Object Detection
How to Get Rid of the B-Prefix in a String in Python
Python Super :Typeerror: _Init_() Takes 2 Positional Arguments But 3 Were Given
Stripping Whitespaces from a List Inside the List of Tuples
Extract Values Between Two Strings in a Text File Using Python