Python & Pandas: How to Query If a List-Type Column Contains Something

Filter dataframe rows if value in column is in a set list of values

Use the isin method:

rpt[rpt['STK_ID'].isin(stk_list)]

Check if certain value is contained in a dataframe column in pandas

I think you need str.contains, if you need rows where values of column date contains string 07311954:

print df[df['date'].astype(str).str.contains('07311954')]

Or if type of date column is string:

print df[df['date'].str.contains('07311954')]

If you want check last 4 digits for string 1954 in column date:

print df[df['date'].astype(str).str[-4:].str.contains('1954')]

Sample:

print df['date']
0 8152007
1 9262007
2 7311954
3 2252011
4 2012011
5 2012011
6 2222011
7 2282011
Name: date, dtype: int64

print df['date'].astype(str).str[-4:].str.contains('1954')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
Name: date, dtype: bool

print df[df['date'].astype(str).str[-4:].str.contains('1954')]
cmte_id trans_typ entity_typ state employer occupation date \
2 C00119040 24K CCM MD NaN NaN 7311954

amount fec_id cand_id
2 1000 C00140715 H2MD05155

How to test if a string contains one of the substrings in a list, in pandas?

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

How to filter out data in a column using pandas DF

You can use

>>> import pandas as pd
>>> df= pd.DataFrame({"Temp":["Temperature 1:33.1, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"Temperature 1:26.7, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"Temperature 1:31.1, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"302,16/06/2021 15:28:49,0,0,0,0,0,0^",
"$36,515,0,1,1,00124F^"]})
>>> df['Temp'] = pd.to_numeric(df['Temp'].str.extract(r'^Temperature\s+1:(\d+(?:\.\d+)?)', expand=False))
>>> df
Temp
0 33.1
1 26.7
2 31.1
3 NaN
4 NaN

See this regex demo. Details:

  • ^ - start of string
  • Temperature - a word
  • \s+ - one or more whitespaces
  • 1: - a 1: string
  • (\d+(?:\.\d+)?) - Group 1: one or more digits and then an optional sequence of a . and one or more digits.

Compare value of Dataframe column with list value

Provide your type as string "int" instead of int which is python's native type that spark doesn't recognize; Also to create a column in spark data frame, use withColumn method instead of direct assignment:

df.withColumn('E', df.articles.isin(a_list).astype('int')).show()
+---+--------+---+
| id|articles| E|
+---+--------+---+
| 1| 4| 1|
| 2| 3| 0|
| 5| 6| 1|
+---+--------+---+

Pandas dataframe get first row of each group

>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth

Select rows such that specific column contains values from a list

Use np.in1d to create a mask of any occurrence of the elements that we are searching for and then simply use boolean indexing to select the valid rows off input array -

arr[np.in1d(arr[:,3], [4,8])]

Sample run -

In [149]: arr
Out[149]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])

In [150]: np.in1d(arr[:,3], [4,8]) # Mask of valid ones
Out[150]: array([ True, True, False], dtype=bool)

In [151]: arr[np.in1d(arr[:,3], [4,8])] # Select rows off arr
Out[151]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

Set value for particular cell in pandas DataFrame with iloc

For mixed position and index, use .ix. BUT you need to make sure that your index is not of integer, otherwise it will cause confusions.

df.ix[0, 'COL_NAME'] = x

Update:

Alternatively, try

df.iloc[0, df.columns.get_loc('COL_NAME')] = x

Example:

import pandas as pd
import numpy as np

# your data
# ========================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 2), columns=['col1', 'col2'], index=np.random.randint(1,100,10)).sort_index()

print(df)

col1 col2
10 1.7641 0.4002
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337

# .iloc with get_loc
# ===================================
df.iloc[0, df.columns.get_loc('col2')] = 100

df

col1 col2
10 1.7641 100.0000
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337


Related Topics



Leave a reply



Submit