Repeating Values in a Column

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True


Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.


Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

Student Date
0 Joe December 2017
1 Bob April 2018

Student Date
0 Joe December 2017
1 Bob April 2018

How to identify consecutive repeating values in data frame column?

To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:

# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives

0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5

But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:

>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs

0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN

Now we can group by this series and look at the groups' sizes:

>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes

array([2, 3])

The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:

>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True

Now we can wrap this procedure in a function for reusability:

def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()

# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))

# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()

# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()

# whether a suitable sequence is found or not
return is_okay

and we can run it as:

# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True

>>> is_consec_found(df.Value, value=0, threshold=5)
False

>>> is_consec_found(df.Value, value=1, threshold=2)
True

>>> is_consec_found(df.Value, value=1, threshold=3)
False

Change duplicate value in a column

After the command:

SELECT Model, count (*) FROM Devices GROUP BY model HAVING count (*)> 1;

i get the result:

  • 1895 lines = NULL;
  • 3383 lines with duplicate values;
  • and all these values are 1243.

after applying your command:

update Devices set
Model = '-'
where id not in
(select
min(Devices .id)
from Devices
group by Devices.Model)

i got 4035 lines changed.
if you count, it turns out, (3383 + 1895) = 5278 - 1243 = 4035
and it seems like everything fits together, the result suits, it works.

repeat values of a column based on a condition

Use this code after you calculate s to get slope column with desired values:

sum_distance = 0
count = 0
idx = 0
slopes = []

for i in df['Distance'].values:
idx += 1
sum_distance += i
if sum_distance>=10:
slopes += [s[count]]*idx
count += 1
sum_distance = 0
idx = 0

if idx > 0:
slopes += [s[count]]*idx

df['Slope'] = slopes

Output:

>>> df
Altitude Distance Slope
0 11.2 0.000 0.898848
1 11.2 3.018 0.898848
2 10.9 4.180 0.898848
3 10.1 4.873 0.898848
4 9.9 5.499 0.844861
5 9.4 5.923 0.844861
6 9.2 6.415 0.693368
7 8.5 1.063 0.693368
8 8.4 1.667 0.693368
9 7.9 3.114 0.693368

Traversed the Distance column, summed up the values and kept count of the values traversed. Whenever the sum is 10 or more, pick value from s and insert them as many times as count showed. Reset sum, count and continue.

reconciling duplicates in one column with various values in another column

If we turn FLD into an ordered factor we can then use aggregate to get the minimum value per ID as follows:

df$FLD <- factor(df$FLD, levels=c("TERRIBLE", "BAD", "GOOD", "NA"), ordered = TRUE)
aggregate(data=df, FLD ~ ID, min)

ID FLD
1 A GOOD
2 B TERRIBLE
3 C BAD
4 D BAD
5 E TERRIBLE
6 F NA

assign a unique ID number for every repeated value in a column R

It can be done using rowid

library(data.table)
library(dplyr)
weighted_df %>%
mutate(ID = rowid(Name))

-output

#     Name        Room1       Room2        Room3 ID
#1 H001 0.579649851 0.84602529 0.620850211 1
#2 H001 0.579649851 0.84602529 0.620850211 2
#3 H001 0.579649851 0.84602529 0.620850211 3
#4 H001 0.579649851 0.84602529 0.620850211 4
#5 H001 0.579649851 0.84602529 0.620850211 5
#6 H001 0.579649851 0.84602529 0.620850211 6
#7 H001 0.579649851 0.84602529 0.620850211 7
#8 H001 0.579649851 0.84602529 0.620850211 8
#9 H001 0.579649851 0.84602529 0.620850211 9
#10 H001 0.579649851 0.84602529 0.620850211 10
#11 H001 0.579649851 0.84602529 0.620850211 11
#12 H001 0.579649851 0.84602529 0.620850211 12
#13 H001 0.579649851 0.84602529 0.620850211 13
#14 H001 0.579649851 0.84602529 0.620850211 14
#15 H001 0.579649851 0.84602529 0.620850211 15
#16 H001 0.579649851 0.84602529 0.620850211 16
#17 H001 0.579649851 0.84602529 0.620850211 17
#18 H002 1.457267473 -1.18612874 0.553957293 1
#19 H002 1.457267473 -1.18612874 0.553957293 2
# ...

Repeat values down a grouping column using R

You may want to use the following:

library(dplyr)

df <- data.frame(value = 1:23)

df %>%
mutate(group = case_when(value < 4 ~ "A",
value >= 4 & value < 18 ~ "B",
value >= 18 & value <= 23 ~ "C")) %>%
group_by(group) %>%
mutate(color = if_else(row_number() %% 2 == 1, "white", "black"),
distance = rep(1:n(), each =2)[1:n()]) %>%
ungroup

#> # A tibble: 23 × 4
#> value group color distance
#> <int> <chr> <chr> <int>
#> 1 1 A white 1
#> 2 2 A black 1
#> 3 3 A white 2
#> 4 4 B white 1
#> 5 5 B black 1
#> 6 6 B white 2
#> 7 7 B black 2
#> 8 8 B white 3
#> 9 9 B black 3
#> 10 10 B white 4
#> # … with 13 more rows

How to count the unique duplicate values in each column

Option 1

if we need to calculate the number of duplicate values

import pandas as pd

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})

df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)

Prints:

A    3
B 2
C 2
dtype: int64

Option 2

if we need to calculate the number of values that have duplicates

df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)

Prints:

A    2
B 2
C 1
dtype: int64

Option 2.1

detailed

df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)

Prints:

A    [val = 3, cnt = 3] [val = 2, cnt = 2]
B [val = 9, cnt = 2] [val = 3, cnt = 2]
C [val = 4, cnt = 3]
dtype: object



Related Topics



Leave a reply



Submit