Check If String Is in a Pandas Dataframe

Check if string is in a pandas dataframe

a['Names'].str.contains('Mel') will return an indicator vector of boolean values of size len(BabyDataSet)

Therefore, you can use

mel_count=a['Names'].str.contains('Mel').sum()
if mel_count>0:
print ("There are {m} Mels".format(m=mel_count))

Or any(), if you don't care how many records match your query

if a['Names'].str.contains('Mel').any():
print ("Mel is there")

Check if a string value of a column in a Pandas DataFrame starts with the value of another column

Inspired by this answer: https://stackoverflow.com/a/64332351/18090994

Write your own startswith function and vectorize it with numpy.vectorize. In this way, you can compare the strings in col1 and col2 row by row.

from numpy import vectorize

def startswith(str1, str2):
"""Check if str1 starts with str2 (case insensitive)"""
return str1.lower().startswith(str2.lower())

startswith = vectorize(startswith)
df['result'] = df['col2'].where(startswith(df['col2'], df['col1']), df['col1'] + df['col2'])

Check if Pandas DataFrame cell contains certain string

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})

# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan

yields

                a               b
0 NaN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NaN NaN
5 NaN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NaN NaN

One slightly tricky bit above is how mask is defined. Notice that str.contains
returns a Series which contains not only True and False values, but also NaN:

In [114]: df['b'].str.contains(r'EQUITY')
Out[114]:
0 True
1 False
2 False
3 False
4 NaN
5 True
6 False
7 False
8 False
9 NaN
Name: b, dtype: object

str.contains(..., na=True) is used to make the NaNs be treated as True:

In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]:
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 True
Name: b, dtype: bool

Once you have mask the idea is simple: Copy the values from b into a wherever mask is True:

df.loc[mask, 'a'] = df['b']

Forward-fill the NaN values in a:

df['a'] = df['a'].ffill()

Replace the values in a with NaN wherever mask is True:

df.loc[mask, 'a'] = np.nan

Check if at least one column contains a string in pandas

An option via applymap :

df['C'] = df.applymap(lambda x: 'c' in str(x).lower()).any(1)

Via stack/unstack:

df['C'] = df.stack().str.contains('c', case=False).unstack().any(1)
df['C'] = df.stack().str.lower().str.contains('c').unstack().any(1)

OUTPUT:

    A    B      C
0 ax YCm True
1 bx YAm False
2 cx YBm True
3 ax YAm False
4 bx YBm False
5 cx YCm True

Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row

You can't use a pandas builtin method directly. You will need to apply a re.search per row:

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

or using a (faster) list comprehension:

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output:

  strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2

Check if ENTIRE pandas object column is a string

You can use pandas.api.types.infer_dtype:

>>> pd.api.types.infer_dtype(df2["postal"])
'string'
>>> pd.api.types.infer_dtype(df1["postal"])
'floating'

From the docs:

Efficiently infer the type of a passed val, or list-like array of values. Return a string describing the type.

Check if string is in another column pandas

You can also replace the square brackets with word boundaries \b and use re.search like in

import re
#...
df.apply(lambda x: bool(re.search(x['col1'].replace("[",r"\b").replace("]",r"\b"), x['col2'])), axis=1)
# => 0 True
# 1 True
# 2 False
# 3 True
# dtype: bool

This will work because \b7\b will find a match in [0%, 7%] as 7 is neither preceded nor followed with letters, digits or underscores. There won't be any match found in [30%, 7%] as \b0\b does not match a zero after a digit (here, 3).



Related Topics



Leave a reply



Submit