get all rows that have same value in pandas
Another approach.. results is not in the format as you mentioned.. they are grouped
data = pd.read_csv('iris.data.txt', sep=',', header=None)
data.columns = ['Sepal.Length' , 'Sepal.Width' , 'Petal.Length', 'Petal.Width' ,'Species' , 'ID']
data['ID'] = data.index
#I guess you dont want these
data.drop(['Petal.Width','Petal.Length','Species'], axis=1, inplace=True)
def check(data):
if len(data) > 1:
index_list = list(data.index.values)
index_list.append(index_list[0])
data['ExSepal.Length'] = data['Sepal.Length']
data['ExSepal.Width'] = data['Sepal.Width']
data['ExId'] = [int(index_list[i]) for i in range(1,len(index_list))]
return data
data.groupby('Sepal.Length').apply(check)
Output
Sepal.Length Sepal.Width ID ExSepal.Length ExSepal.Width \
Sepal.Length
4.4 8 4.4 2.9 8 4.4 2.9
38 4.4 3.0 38 4.4 3.0
42 4.4 3.2 42 4.4 3.2
4.6 3 4.6 3.1 3 4.6 3.1
6 4.6 3.4 6 4.6 3.4
22 4.6 3.6 22 4.6 3.6
47 4.6 3.2 47 4.6 3.2
4.7 2 4.7 3.2 2 4.7 3.2
29 4.7 3.2 29 4.7 3.2
4.8 11 4.8 3.4 11 4.8 3.4
ExId
Sepal.Length
4.4 8 38
38 42
42 8
4.6 3 6
6 22
22 47
47 3
4.7 2 29
29 2
4.8 11 12
How to get all the rows with the same values on a certain set of columns of an other specified row in Pandas?
You can have the subset of columns as a list and get the values at the given index for the subset of columns using .loc
accessor, then check for equalities and call all
for axis=1
, finally get the resulting dataframe for this masking.
>>> cols = ['B', 'C']
>>> index = 3
>>> df[(df[cols]==df.loc[index, cols]).all(1)]
OUTPUT:
A B C
0 9 80 900
3 8 80 900
6 2 80 900
9 7 80 900
Get rows that have the same value across its columns in pandas
Similar to Andy Hayden answer with check if min equal to max (then row elements are all duplicates):
df[df.apply(lambda x: min(x) == max(x), 1)]
Find rows of dataframe with the same column value in Pandas
I believe you need DataFrame.duplicated
for all dupes by column and for ordering use DataFrame.sort_values
:
df = pd.DataFrame({
'id':[1,2,3,4,5,6],
'code':list('abcdac'),
})
print (df)
id code
0 1 a
1 2 b
2 3 c
3 4 d
4 5 a
5 6 c
df1 = df[df.duplicated('code', keep=False)].sort_values('code')
print (df1)
id code
0 1 a
4 5 a
2 3 c
5 6 c
Or if need lists use groupby
with list
:
df2 = df[df.duplicated('code', keep=False)].groupby('code')['id'].apply(list).reset_index()
print (df2)
code id
0 a [1, 5]
1 c [3, 6]
Select rows containing certain values from pandas dataframe
Introduction
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df
, let's call it mask
. So, finally with df[mask]
, we would get the selected rows off df
following boolean-indexing.
Here's our starting df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
I. Match one string
Now, if we need to match just one string, it's straight-foward with elementwise equality :
In [42]: df == 'banana'
Out[42]:
A B C
1 False True False
2 False False False
3 True False False
4 False False False
If we need to look ANY
one match in each row, use .any
method :
In [43]: (df == 'banana').any(axis=1)
Out[43]:
1 True
2 False
3 True
4 False
dtype: bool
To select corresponding rows :
In [44]: df[(df == 'banana').any(axis=1)]
Out[44]:
A B C
1 apple banana pear
3 banana pear pear
II. Match multiple strings
1. Search for ANY
match
Here's our starting df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
NumPy's np.isin
would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df
. So, say we are looking for 'pear'
or 'apple'
in df
:
In [51]: np.isin(df, ['pear','apple'])
Out[51]:
array([[ True, False, True],
[ True, True, True],
[False, True, True],
[ True, True, True]])
# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True, True, True, True])
# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
2. Search for ALL
match
Here's our starting df
again :
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
So, now we are looking for rows that have BOTH
say ['pear','apple']
. We will make use of NumPy-broadcasting
:
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
So, we have a search list of 2
items and hence we have a 2D mask with number of rows = len(df)
and number of cols = number of search items
. Thus, in the above result, we have the first col for 'pear'
and second one for 'apple'
.
To make things concrete, let's get a mask for three items ['apple','banana', 'pear']
:
In [62]: np.equal.outer(df.to_numpy(copy=False), ['apple','banana', 'pear']).any(axis=1)
Out[62]:
array([[ True, True, True],
[ True, False, True],
[False, True, True],
[ True, False, True]])
The columns of this mask are for 'apple','banana', 'pear'
respectively.
Back to 2
search items case, we had earlier :
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
Since, we are looking for ALL
matches in each row :
In [67]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True, True, False, True])
Finally, select rows :
In [70]: df[np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]:
A B C
1 apple banana pear
2 pear pear apple
4 apple apple pear
In Pandas how do I select rows that have a duplicate in one column but different values in another?
First step is to find the names that have more than 1 unique Country
, and then you can use loc
on your dataframe to filter in only those values.
Method 1: groupby
# groupby name and return a boolean of whether each has more than 1 unique Country
multi_country = df.groupby(["Name"]).Country.nunique().gt(1)
# use loc to only see those values that have `True` in `multi_country`:
df.loc[df.Name.isin(multi_country[multi_country].index)]
Name Country
2 Mary US
3 Mary Canada
4 Mary US
Method 2: drop_duplicates
and value_counts
You can follow the same logic, but use drop_duplicates
and value_counts
instead of groupby:
multi_country = df.drop_duplicates().Name.value_counts().gt(1)
df.loc[df.Name.isin(multi_country[multi_country].index)]
Name Country
2 Mary US
3 Mary Canada
4 Mary US
Method 3: drop_duplicates
and duplicated
Note: this will give slightly different results: you'll only see Mary's unique values, this may or may not be desired...
You can drop the duplicates in the original frame, and return only the names that have multiple entries in the deduped frame:
no_dups = df.drop_duplicates()
no_dups[no_dups.duplicated(keep = False, subset="Name")]
Name Country
2 Mary US
3 Mary Canada
Related Topics
Python: How to Calculate the Average Word Length in a Sentence Using the .Split Command
How to Vectorize (Make Use of Pandas/Numpy) Instead of Using a Nested for Loop
Find All CSV Files in a Directory Using Python
Heroku: No Default Language Could Be Detected for This App
Windowserror: [Error 193] %1 Is Not a Valid Win32 Application in Python
Typeerror: Unsupported Format String Passed to List._Format_
Cursor.Fetchone() Returns None But Row in the Database Exists
Pandas To_Csv() Slow Saving Large Dataframe
If-Condition With Multiple Actions in Robot Framework
Convert the String 2.90K to 2900 or 5.2M to 5200000 in Pandas Dataframe
Convert a Standard Python Key Value Dictionary List to Pyspark Data Frame
Python Sockets Multiple Messages on Same Connection
How to Hide Tkinter Python Gui
Calculate Rgb Value for a Range of Values to Create Heat Map
Django: Calling .Update() on a Single Model Instance Retrieved by .Get()
How to Automatically Download Files from a Pop Up Dialog Using Selenium-Python