Repeating Values in a Column

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

How to identify consecutive repeating values in data frame column?

To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:

# for the second frame
>>> consecutives = df.Value.diff().ne(0).cumsum()
>>> consecutives

0    1
1    1
2    2
3    2
4    3
5    4
6    4
7    4
8    5
9    5

But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:

>>> masked_consecs = consecutives.mask(df.Value.ne(0))
>>> masked_consecs

0    NaN
1    NaN
2    2.0
3    2.0
4    NaN
5    4.0
6    4.0
7    4.0
8    NaN
9    NaN

Now we can group by this series and look at the groups' sizes:

>>> consec_sizes = df.Value.groupby(masked_consecs).size().to_numpy()
>>> consec_sizes

array([2, 3])

The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:

>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True

Now we can wrap this procedure in a function for reusability:

def is_consec_found(series, value=0, threshold=2):
    # mark consecutive groups
    consecs = series.diff().ne(0).cumsum()

    # disregard those groups that are not of `value`
    masked_consecs = consecs.mask(series.ne(value))

    # get size of each
    consec_sizes = series.groupby(masked_consecs).size().to_numpy()

    # check sizes agains the threshold
    is_okay = (consec_sizes >= threshold).any()

    # whether a suitable sequence is found or not
    return is_okay

and we can run it as:

# these are all for the second dataframe you posted
>>> is_consec_found(df.Value, value=0, threshold=2)
True

>>> is_consec_found(df.Value, value=0, threshold=5)
False

>>> is_consec_found(df.Value, value=1, threshold=2)
True

>>> is_consec_found(df.Value, value=1, threshold=3)
False

Change duplicate value in a column

After the command:

SELECT Model, count (*) FROM Devices GROUP BY model HAVING count (*)> 1;

i get the result:

1895 lines = NULL;
3383 lines with duplicate values;
and all these values are 1243.

after applying your command:

update Devices set
  Model = '-'
where id not in
  (select
     min(Devices .id)
   from Devices 
   group by Devices.Model)

i got 4035 lines changed.
if you count, it turns out, (3383 + 1895) = 5278 - 1243 = 4035
and it seems like everything fits together, the result suits, it works.

repeat values of a column based on a condition

Use this code after you calculate s to get slope column with desired values:

sum_distance = 0
count = 0
idx = 0
slopes = []

for i in df['Distance'].values:
    idx += 1
    sum_distance += i
    if sum_distance>=10:
        slopes += [s[count]]*idx
        count += 1
        sum_distance = 0
        idx = 0

if idx > 0:
    slopes += [s[count]]*idx

df['Slope'] = slopes

Output:

>>> df
   Altitude  Distance     Slope
0      11.2     0.000  0.898848
1      11.2     3.018  0.898848
2      10.9     4.180  0.898848
3      10.1     4.873  0.898848
4       9.9     5.499  0.844861
5       9.4     5.923  0.844861
6       9.2     6.415  0.693368
7       8.5     1.063  0.693368
8       8.4     1.667  0.693368
9       7.9     3.114  0.693368

Traversed the Distance column, summed up the values and kept count of the values traversed. Whenever the sum is 10 or more, pick value from s and insert them as many times as count showed. Reset sum, count and continue.

reconciling duplicates in one column with various values in another column

If we turn FLD into an ordered factor we can then use aggregate to get the minimum value per ID as follows:

df$FLD <- factor(df$FLD, levels=c("TERRIBLE", "BAD", "GOOD", "NA"), ordered = TRUE)
aggregate(data=df, FLD ~ ID, min)

  ID      FLD
1  A     GOOD
2  B TERRIBLE
3  C      BAD
4  D      BAD
5  E TERRIBLE
6  F       NA

assign a unique ID number for every repeated value in a column R

It can be done using rowid

library(data.table)
library(dplyr)
weighted_df %>% 
   mutate(ID = rowid(Name))

-output

#     Name        Room1       Room2        Room3 ID
#1    H001  0.579649851  0.84602529  0.620850211  1
#2    H001  0.579649851  0.84602529  0.620850211  2
#3    H001  0.579649851  0.84602529  0.620850211  3
#4    H001  0.579649851  0.84602529  0.620850211  4
#5    H001  0.579649851  0.84602529  0.620850211  5
#6    H001  0.579649851  0.84602529  0.620850211  6
#7    H001  0.579649851  0.84602529  0.620850211  7
#8    H001  0.579649851  0.84602529  0.620850211  8
#9    H001  0.579649851  0.84602529  0.620850211  9
#10   H001  0.579649851  0.84602529  0.620850211 10
#11   H001  0.579649851  0.84602529  0.620850211 11
#12   H001  0.579649851  0.84602529  0.620850211 12
#13   H001  0.579649851  0.84602529  0.620850211 13
#14   H001  0.579649851  0.84602529  0.620850211 14
#15   H001  0.579649851  0.84602529  0.620850211 15
#16   H001  0.579649851  0.84602529  0.620850211 16
#17   H001  0.579649851  0.84602529  0.620850211 17
#18   H002  1.457267473 -1.18612874  0.553957293  1
#19   H002  1.457267473 -1.18612874  0.553957293  2
# ...

Repeat values down a grouping column using R

You may want to use the following:

library(dplyr)

df <- data.frame(value = 1:23)

df %>% 
  mutate(group = case_when(value < 4 ~ "A",
                           value >= 4 &  value < 18 ~ "B",
                           value >= 18 & value <= 23 ~ "C")) %>% 
  group_by(group) %>% 
  mutate(color = if_else(row_number() %% 2 == 1, "white", "black"),
         distance = rep(1:n(), each =2)[1:n()]) %>% 
  ungroup

#> # A tibble: 23 × 4
#>    value group color distance
#>    <int> <chr> <chr>    <int>
#>  1     1 A     white        1
#>  2     2 A     black        1
#>  3     3 A     white        2
#>  4     4 B     white        1
#>  5     5 B     black        1
#>  6     6 B     white        2
#>  7     7 B     black        2
#>  8     8 B     white        3
#>  9     9 B     black        3
#> 10    10 B     white        4
#> # … with 13 more rows

How to count the unique duplicate values in each column

Option 1

if we need to calculate the number of duplicate values

import pandas as pd

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
                     'B': [9,6,7,9,2,5,3,3],
                     'C': [4,4,4,5,9,3,2,1]})

df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)

Prints:

A    3
B    2
C    2
dtype: int64

Option 2

if we need to calculate the number of values that have duplicates

df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)

Prints:

A    2
B    2
C    1
dtype: int64

Option 2.1

detailed

df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)

Prints:

A    [val = 3, cnt = 3] [val = 2, cnt = 2]
B    [val = 9, cnt = 2] [val = 3, cnt = 2]
C                       [val = 4, cnt = 3]
dtype: object

Repeating Values in a Column

Check for duplicate values in Pandas dataframe column

Main question

Further reading and references

Example to play around with

How to identify consecutive repeating values in data frame column?

Change duplicate value in a column

repeat values of a column based on a condition

reconciling duplicates in one column with various values in another column

assign a unique ID number for every repeated value in a column R

Repeat values down a grouping column using R

How to count the unique duplicate values in each column

Option 1

Option 2

Option 2.1

Related Topics

Leave a reply