Check if string is in a pandas dataframe
a['Names'].str.contains('Mel')
will return an indicator vector of boolean values of size len(BabyDataSet)
Therefore, you can use
mel_count=a['Names'].str.contains('Mel').sum()
if mel_count>0:
print ("There are {m} Mels".format(m=mel_count))
Or any()
, if you don't care how many records match your query
if a['Names'].str.contains('Mel').any():
print ("Mel is there")
Check if a string value of a column in a Pandas DataFrame starts with the value of another column
Inspired by this answer: https://stackoverflow.com/a/64332351/18090994
Write your own startswith
function and vectorize it with numpy.vectorize. In this way, you can compare the strings in col1
and col2
row by row.
from numpy import vectorize
def startswith(str1, str2):
"""Check if str1 starts with str2 (case insensitive)"""
return str1.lower().startswith(str2.lower())
startswith = vectorize(startswith)
df['result'] = df['col2'].where(startswith(df['col2'], df['col1']), df['col1'] + df['col2'])
Check if Pandas DataFrame cell contains certain string
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})
# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan
yields
a b
0 NaN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NaN NaN
5 NaN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NaN NaN
One slightly tricky bit above is how mask
is defined. Notice that str.contains
returns a Series which contains not only True
and False
values, but also NaN
:
In [114]: df['b'].str.contains(r'EQUITY')
Out[114]:
0 True
1 False
2 False
3 False
4 NaN
5 True
6 False
7 False
8 False
9 NaN
Name: b, dtype: object
str.contains(..., na=True)
is used to make the NaN
s be treated as True
:
In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]:
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 True
Name: b, dtype: bool
Once you have mask
the idea is simple: Copy the values from b
into a
wherever mask
is True:
df.loc[mask, 'a'] = df['b']
Forward-fill the NaN values in a
:
df['a'] = df['a'].ffill()
Replace the values in a
with NaN wherever mask
is True:
df.loc[mask, 'a'] = np.nan
Check if at least one column contains a string in pandas
An option via applymap
:
df['C'] = df.applymap(lambda x: 'c' in str(x).lower()).any(1)
Via stack/unstack
:
df['C'] = df.stack().str.contains('c', case=False).unstack().any(1)
df['C'] = df.stack().str.lower().str.contains('c').unstack().any(1)
OUTPUT:
A B C
0 ax YCm True
1 bx YAm False
2 cx YBm True
3 ax YAm False
4 bx YBm False
5 cx YCm True
Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row
You can't use a pandas builtin method directly. You will need to apply
a re.search
per row:
import re
mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]
or using a (faster) list comprehension:
mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]
output:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
Check if ENTIRE pandas object column is a string
You can use pandas.api.types.infer_dtype
:
>>> pd.api.types.infer_dtype(df2["postal"])
'string'
>>> pd.api.types.infer_dtype(df1["postal"])
'floating'
From the docs:
Efficiently infer the type of a passed val, or list-like array of values. Return a string describing the type.
Check if string is in another column pandas
You can also replace the square brackets with word boundaries \b
and use re.search
like in
import re
#...
df.apply(lambda x: bool(re.search(x['col1'].replace("[",r"\b").replace("]",r"\b"), x['col2'])), axis=1)
# => 0 True
# 1 True
# 2 False
# 3 True
# dtype: bool
This will work because \b7\b
will find a match in [0%, 7%]
as 7
is neither preceded nor followed with letters, digits or underscores. There won't be any match found in [30%, 7%]
as \b0\b
does not match a zero after a digit (here, 3
).
Related Topics
How to Use Python to Get the System Hostname
Python Equivalent of Filter() Getting Two Output Lists (I.E. Partition of a List)
How to Install Pip3 on Windows
Difference Between Parsing a Text File in R and Rb Mode
How to Remove the First Item from a List
Django Signals VS. Overriding Save Method
Convert Timedelta to Total Seconds
How to Represent an Infinite Number in Python
How to Get 'Real-Time' Information Back from a Subprocess.Popen in Python (2.5)
How to Print Out Status Bar and Percentage
How to Add Conda Environment to Jupyter Lab
Restart Python-Script from Within Itself
How to Extract an Ip Address from an HTML String