Check for duplicate values in Pandas dataframe column
Main question
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Further reading and references
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
- drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Example to play around with
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018
How to identify consecutive repeating values in data frame column?
To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:
# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5
But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaN
s wherever we don't have 0 in the original series:
>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs
0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Now we can group by this series and look at the groups' sizes:
>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes
array([2, 3])
The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:
>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True
Now we can wrap this procedure in a function for reusability:
def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()
# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))
# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()
# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()
# whether a suitable sequence is found or not
return is_okay
and we can run it as:
# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True
>>> is_consec_found(df.Value, value=0, threshold=5)
False
>>> is_consec_found(df.Value, value=1, threshold=2)
True
>>> is_consec_found(df.Value, value=1, threshold=3)
False
Change duplicate value in a column
After the command:
SELECT Model, count (*) FROM Devices GROUP BY model HAVING count (*)> 1;
i get the result:
- 1895 lines = NULL;
- 3383 lines with duplicate values;
- and all these values are 1243.
after applying your command:
update Devices set
Model = '-'
where id not in
(select
min(Devices .id)
from Devices
group by Devices.Model)
i got 4035 lines changed.
if you count, it turns out, (3383 + 1895) = 5278 - 1243 = 4035
and it seems like everything fits together, the result suits, it works.
repeat values of a column based on a condition
Use this code after you calculate s
to get slope column with desired values:
sum_distance = 0
count = 0
idx = 0
slopes = []
for i in df['Distance'].values:
idx += 1
sum_distance += i
if sum_distance>=10:
slopes += [s[count]]*idx
count += 1
sum_distance = 0
idx = 0
if idx > 0:
slopes += [s[count]]*idx
df['Slope'] = slopes
Output:
>>> df
Altitude Distance Slope
0 11.2 0.000 0.898848
1 11.2 3.018 0.898848
2 10.9 4.180 0.898848
3 10.1 4.873 0.898848
4 9.9 5.499 0.844861
5 9.4 5.923 0.844861
6 9.2 6.415 0.693368
7 8.5 1.063 0.693368
8 8.4 1.667 0.693368
9 7.9 3.114 0.693368
Traversed the Distance column, summed up the values and kept count of the values traversed. Whenever the sum is 10 or more, pick value from s
and insert them as many times as count showed. Reset sum, count and continue.
reconciling duplicates in one column with various values in another column
If we turn FLD
into an ordered factor we can then use aggregate
to get the minimum value per ID as follows:
df$FLD <- factor(df$FLD, levels=c("TERRIBLE", "BAD", "GOOD", "NA"), ordered = TRUE)
aggregate(data=df, FLD ~ ID, min)
ID FLD
1 A GOOD
2 B TERRIBLE
3 C BAD
4 D BAD
5 E TERRIBLE
6 F NA
assign a unique ID number for every repeated value in a column R
It can be done using rowid
library(data.table)
library(dplyr)
weighted_df %>%
mutate(ID = rowid(Name))
-output
# Name Room1 Room2 Room3 ID
#1 H001 0.579649851 0.84602529 0.620850211 1
#2 H001 0.579649851 0.84602529 0.620850211 2
#3 H001 0.579649851 0.84602529 0.620850211 3
#4 H001 0.579649851 0.84602529 0.620850211 4
#5 H001 0.579649851 0.84602529 0.620850211 5
#6 H001 0.579649851 0.84602529 0.620850211 6
#7 H001 0.579649851 0.84602529 0.620850211 7
#8 H001 0.579649851 0.84602529 0.620850211 8
#9 H001 0.579649851 0.84602529 0.620850211 9
#10 H001 0.579649851 0.84602529 0.620850211 10
#11 H001 0.579649851 0.84602529 0.620850211 11
#12 H001 0.579649851 0.84602529 0.620850211 12
#13 H001 0.579649851 0.84602529 0.620850211 13
#14 H001 0.579649851 0.84602529 0.620850211 14
#15 H001 0.579649851 0.84602529 0.620850211 15
#16 H001 0.579649851 0.84602529 0.620850211 16
#17 H001 0.579649851 0.84602529 0.620850211 17
#18 H002 1.457267473 -1.18612874 0.553957293 1
#19 H002 1.457267473 -1.18612874 0.553957293 2
# ...
Repeat values down a grouping column using R
You may want to use the following:
library(dplyr)
df <- data.frame(value = 1:23)
df %>%
mutate(group = case_when(value < 4 ~ "A",
value >= 4 & value < 18 ~ "B",
value >= 18 & value <= 23 ~ "C")) %>%
group_by(group) %>%
mutate(color = if_else(row_number() %% 2 == 1, "white", "black"),
distance = rep(1:n(), each =2)[1:n()]) %>%
ungroup
#> # A tibble: 23 × 4
#> value group color distance
#> <int> <chr> <chr> <int>
#> 1 1 A white 1
#> 2 2 A black 1
#> 3 3 A white 2
#> 4 4 B white 1
#> 5 5 B black 1
#> 6 6 B white 2
#> 7 7 B black 2
#> 8 8 B white 3
#> 9 9 B black 3
#> 10 10 B white 4
#> # … with 13 more rows
How to count the unique duplicate values in each column
Option 1
if we need to calculate the number of duplicate values
import pandas as pd
df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})
df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)
Prints:
A 3
B 2
C 2
dtype: int64
Option 2
if we need to calculate the number of values that have duplicates
df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)
Prints:
A 2
B 2
C 1
dtype: int64
Option 2.1
detailed
df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)
Prints:
A [val = 3, cnt = 3] [val = 2, cnt = 2]
B [val = 9, cnt = 2] [val = 3, cnt = 2]
C [val = 4, cnt = 3]
dtype: object
Related Topics
SQL Performance of a Lookup Table
Varchar Requires a Length When Rendered on MySQL
How to Write a Query to Extract Individual Changes from Snapshots of Data
Delete Duplicate Record from Same Table in MySQL
Find Out the Calling Stored Procedure in SQL Server
Postgres: Convert Single Row to Multiple Rows (Unpivot)
Delete Records Within Instead of Delete Trigger
Elasticsearch Map Two SQL Tables with a Foreign Key
Differencebetween Cross Join and Multiple Tables in One From
Agregate Rows in Oracle SQL Statement
Is Querying on Views Slower Than Doing One Query
SQL Query Continues Running for a Very Long Time If Search Term Not Found
Filter Based on an Aliased Column Name
How to Return Empty Row from SQL Server