Get All Rows That Have Same Value in Pandas

get all rows that have same value in pandas

Another approach.. results is not in the format as you mentioned.. they are grouped

data = pd.read_csv('iris.data.txt', sep=',', header=None)
data.columns = ['Sepal.Length' , 'Sepal.Width' , 'Petal.Length', 'Petal.Width' ,'Species' , 'ID']
data['ID'] = data.index

#I guess you dont want these
data.drop(['Petal.Width','Petal.Length','Species'], axis=1, inplace=True)

def check(data):
if len(data) > 1:
index_list = list(data.index.values)
index_list.append(index_list[0])
data['ExSepal.Length'] = data['Sepal.Length']
data['ExSepal.Width'] = data['Sepal.Width']
data['ExId'] = [int(index_list[i]) for i in range(1,len(index_list))]
return data

data.groupby('Sepal.Length').apply(check)

Output

                 Sepal.Length  Sepal.Width  ID  ExSepal.Length  ExSepal.Width  \
Sepal.Length
4.4 8 4.4 2.9 8 4.4 2.9
38 4.4 3.0 38 4.4 3.0
42 4.4 3.2 42 4.4 3.2
4.6 3 4.6 3.1 3 4.6 3.1
6 4.6 3.4 6 4.6 3.4
22 4.6 3.6 22 4.6 3.6
47 4.6 3.2 47 4.6 3.2
4.7 2 4.7 3.2 2 4.7 3.2
29 4.7 3.2 29 4.7 3.2
4.8 11 4.8 3.4 11 4.8 3.4

ExId
Sepal.Length
4.4 8 38
38 42
42 8
4.6 3 6
6 22
22 47
47 3
4.7 2 29
29 2
4.8 11 12

How to get all the rows with the same values on a certain set of columns of an other specified row in Pandas?

You can have the subset of columns as a list and get the values at the given index for the subset of columns using .loc accessor, then check for equalities and call all for axis=1, finally get the resulting dataframe for this masking.

>>> cols = ['B', 'C']
>>> index = 3
>>> df[(df[cols]==df.loc[index, cols]).all(1)]

OUTPUT:

   A   B    C
0 9 80 900
3 8 80 900
6 2 80 900
9 7 80 900

Get rows that have the same value across its columns in pandas

Similar to Andy Hayden answer with check if min equal to max (then row elements are all duplicates):

df[df.apply(lambda x: min(x) == max(x), 1)]

Find rows of dataframe with the same column value in Pandas

I believe you need DataFrame.duplicated for all dupes by column and for ordering use DataFrame.sort_values:

df = pd.DataFrame({
'id':[1,2,3,4,5,6],
'code':list('abcdac'),

})

print (df)
id code
0 1 a
1 2 b
2 3 c
3 4 d
4 5 a
5 6 c

df1 = df[df.duplicated('code', keep=False)].sort_values('code')
print (df1)
id code
0 1 a
4 5 a
2 3 c
5 6 c

Or if need lists use groupby with list:

df2 = df[df.duplicated('code', keep=False)].groupby('code')['id'].apply(list).reset_index()
print (df2)
code id
0 a [1, 5]
1 c [3, 6]

Select rows containing certain values from pandas dataframe

Introduction

At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df, let's call it mask. So, finally with df[mask], we would get the selected rows off df following boolean-indexing.

Here's our starting df :

In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear

I. Match one string

Now, if we need to match just one string, it's straight-foward with elementwise equality :

In [42]: df == 'banana'
Out[42]:
A B C
1 False True False
2 False False False
3 True False False
4 False False False

If we need to look ANY one match in each row, use .any method :

In [43]: (df == 'banana').any(axis=1)
Out[43]:
1 True
2 False
3 True
4 False
dtype: bool

To select corresponding rows :

In [44]: df[(df == 'banana').any(axis=1)]
Out[44]:
A B C
1 apple banana pear
3 banana pear pear


II. Match multiple strings

1. Search for ANY match

Here's our starting df :

In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear

NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df. So, say we are looking for 'pear' or 'apple' in df :

In [51]: np.isin(df, ['pear','apple'])
Out[51]:
array([[ True, False, True],
[ True, True, True],
[False, True, True],
[ True, True, True]])

# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True, True, True, True])

# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear

2. Search for ALL match

Here's our starting df again :

In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear

So, now we are looking for rows that have BOTH say ['pear','apple']. We will make use of NumPy-broadcasting :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])

So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items. Thus, in the above result, we have the first col for 'pear' and second one for 'apple'.

To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :

In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1)
Out[62]:
array([[ True, True, True],
[ True, False, True],
[False, True, True],
[ True, False, True]])

The columns of this mask are for 'apple','banana', 'pear' respectively.

Back to 2 search items case, we had earlier :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])

Since, we are looking for ALL matches in each row :

In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True, True, False, True])

Finally, select rows :

In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]:
A B C
1 apple banana pear
2 pear pear apple
4 apple apple pear

In Pandas how do I select rows that have a duplicate in one column but different values in another?

First step is to find the names that have more than 1 unique Country, and then you can use loc on your dataframe to filter in only those values.

Method 1: groupby

# groupby name and return a boolean of whether each has more than 1 unique Country
multi_country = df.groupby(["Name"]).Country.nunique().gt(1)

# use loc to only see those values that have `True` in `multi_country`:
df.loc[df.Name.isin(multi_country[multi_country].index)]

Name Country
2 Mary US
3 Mary Canada
4 Mary US

Method 2: drop_duplicates and value_counts

You can follow the same logic, but use drop_duplicates and value_counts instead of groupby:

multi_country = df.drop_duplicates().Name.value_counts().gt(1)

df.loc[df.Name.isin(multi_country[multi_country].index)]

Name Country
2 Mary US
3 Mary Canada
4 Mary US

Method 3: drop_duplicates and duplicated

Note: this will give slightly different results: you'll only see Mary's unique values, this may or may not be desired...

You can drop the duplicates in the original frame, and return only the names that have multiple entries in the deduped frame:

no_dups = df.drop_duplicates()

no_dups[no_dups.duplicated(keep = False, subset="Name")]

Name Country
2 Mary US
3 Mary Canada


Related Topics



Leave a reply



Submit