Filter dataframe rows if value in column is in a set list of values
Use the isin
method:
rpt[rpt['STK_ID'].isin(stk_list)]
Check if certain value is contained in a dataframe column in pandas
I think you need str.contains
, if you need rows where values of column date
contains string 07311954
:
print df[df['date'].astype(str).str.contains('07311954')]
Or if type
of date
column is string
:print df[df['date'].str.contains('07311954')]
If you want check last 4 digits for string
1954
in column date
:print df[df['date'].astype(str).str[-4:].str.contains('1954')]
Sample:print df['date']
0 8152007
1 9262007
2 7311954
3 2252011
4 2012011
5 2012011
6 2222011
7 2282011
Name: date, dtype: int64
print df['date'].astype(str).str[-4:].str.contains('1954')
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
Name: date, dtype: bool
print df[df['date'].astype(str).str[-4:].str.contains('1954')]
cmte_id trans_typ entity_typ state employer occupation date \
2 C00119040 24K CCM MD NaN NaN 7311954
amount fec_id cand_id
2 1000 C00140715 H2MD05155
How to test if a string contains one of the substrings in a list, in pandas?
One option is just to use the regex |
character to try to match each of the substrings in the words in your Series s
(still using str.contains
).
You can construct the regex by joining the words in searchfor
with |
:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $
and ^
which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape
:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains
. How to filter out data in a column using pandas DF
You can use
>>> import pandas as pd
>>> df= pd.DataFrame({"Temp":["Temperature 1:33.1, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"Temperature 1:26.7, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"Temperature 1:31.1, Temperature 2:-50.0, Temperature 3:-50.0, Temperature 4:-50.0^",
"302,16/06/2021 15:28:49,0,0,0,0,0,0^",
"$36,515,0,1,1,00124F^"]})
>>> df['Temp'] = pd.to_numeric(df['Temp'].str.extract(r'^Temperature\s+1:(\d+(?:\.\d+)?)', expand=False))
>>> df
Temp
0 33.1
1 26.7
2 31.1
3 NaN
4 NaN
See this regex demo. Details:^
- start of stringTemperature
- a word\s+
- one or more whitespaces1:
- a1:
string(\d+(?:\.\d+)?)
- Group 1: one or more digits and then an optional sequence of a.
and one or more digits.
Compare value of Dataframe column with list value
Provide your type as string "int"
instead of int
which is python's native type
that spark doesn't recognize; Also to create a column in spark data frame, use withColumn
method instead of direct assignment:
df.withColumn('E', df.articles.isin(a_list).astype('int')).show()
+---+--------+---+
| id|articles| E|
+---+--------+---+
| 1| 4| 1|
| 2| 3| 0|
| 5| 6| 1|
+---+--------+---+
Pandas dataframe get first row of each group
>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
If you need id
as column:>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth
To get n first records, you can use head():>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth
Select rows such that specific column contains values from a list
Use np.in1d
to create a mask of any occurrence of the elements that we are searching for and then simply use boolean indexing
to select the valid rows off input array -
arr[np.in1d(arr[:,3], [4,8])]
Sample run -In [149]: arr
Out[149]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [150]: np.in1d(arr[:,3], [4,8]) # Mask of valid ones
Out[150]: array([ True, True, False], dtype=bool)
In [151]: arr[np.in1d(arr[:,3], [4,8])] # Select rows off arr
Out[151]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Set value for particular cell in pandas DataFrame with iloc
For mixed position and index, use .ix
. BUT you need to make sure that your index is not of integer, otherwise it will cause confusions.
df.ix[0, 'COL_NAME'] = x
Update:
Alternatively, trydf.iloc[0, df.columns.get_loc('COL_NAME')] = x
Example:import pandas as pd
import numpy as np
# your data
# ========================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 2), columns=['col1', 'col2'], index=np.random.randint(1,100,10)).sort_index()
print(df)
col1 col2
10 1.7641 0.4002
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
# .iloc with get_loc
# ===================================
df.iloc[0, df.columns.get_loc('col2')] = 100
df
col1 col2
10 1.7641 100.0000
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
Related Topics
Reading Binary Data from Stdin
Pandas: Subindexing Dataframes: Copies VS Views
Which Classes Cannot Be Subclassed
Passing Double Quote Shell Commands in Python to Subprocess.Popen()
How to Group a Pandas Dataframe by a Defined Time Interval
Use Index in Pandas to Plot Data
Multiple Plots in One Figure in Python
Python Regex to Find a String in Double Quotes Within a String
Syntax Error: Invalid Syntax' for No Apparent Reason
Parallelize Apply After Pandas Groupby
Running a Command as a Super User from a Python Script
Change to Sudo User Within a Python Script
How to Skip Iterations in a Loop
Row-Wise Average for a Subset of Columns with Missing Values
Comparable Classes in Python 3