Filter a column which contains several keywords
grep
can use |
as an or, so why not paste your filters together with |
as a separator:
dfilter <- df1[grep(paste0(filter1, collapse = "|"), df1$type),]
filter a column with a multiple keywords, in django
You can try like this using Q()
:
from django.db.models import Q
query = Q()
for k in datas["keywords"]:
query |= Q(task_name__contains=k)
tasks = Task.objects.filter(query)
Filter Data by multiple keywords
I think you can create for each keyword separate mask and then combine them with chaining by &
- for at least one True
per row use DataFrame.any
:
df_rest = pd.DataFrame({0:['OpenSSL XYZ dd','dd OpenSSL','g OpenSSL'],
1:['CVE-2017-XX OpenSSL dd','dd OpenSSL','g XYZ'],
2:['OpenSSL t','dd XYZ','g CVE-2017-XX XYZ OpenSSL']})
cols = [0,1,2]
m1 = df_rest[cols].apply(lambda r: r.str.contains('OpenSSL', case=False))
print (m1)
0 1 2
0 True True True
1 True True False
2 True False True
m2 = df_rest[cols].apply(lambda r: r.str.contains('XYZ', case=False))
print (m2)
0 1 2
0 True False False
1 False False True
2 False True True
m3 = df_rest[cols].apply(lambda r: r.str.contains('CVE-2017-XX', case=False))
print (m3)
0 1 2
0 False True False
1 False False False
2 False False True
print (m1 & m2)
0 1 2
0 True False False
1 False False False
2 False False True
print ((m1 & m2).any(axis=1))
0 True
1 False
2 True
dtype: bool
df = df_rest[(m1 & m2).any(axis=1)]
print (df)
0 1 2
0 OpenSSL XYZ dd CVE-2017-XX OpenSSL dd OpenSSL t
2 g OpenSSL g XYZ g CVE-2017-XX XYZ OpenSSL
EDIT:
Is possible some keywords are interpreted as regex. For avoid it use regex=
:
False
df_rest = pd.DataFrame({0:['XYZ dd','dd OpenSSL 0.9.4','g 0.9.4'],
1:['0.9.4 OpenSSL dd','dd 0.9','g XYZ'],
2:['OpenSSL t','dd XYZ','OpenSSL 0.9.7']})
print (df_rest)
0 1 2
0 XYZ dd 0.9.4 OpenSSL dd OpenSSL t
1 dd OpenSSL 0.9.4 dd 0.9 dd XYZ
2 g 0.9.4 g XYZ OpenSSL 0.9.7
cols = [0,1,2]
m = df_rest[cols].apply(lambda r: (r.str.contains('0.9.4', case=False, regex=False) &
r.str.contains('OpenSSL', case=False, regex=False)))
df = df_rest[m.any(axis=1)]
print (df)
0 1 2
0 XYZ dd 0.9.4 OpenSSL dd OpenSSL t
1 dd OpenSSL 0.9.4 dd 0.9 dd XYZ
EDIT1:
df_rest = pd.DataFrame({0:['XYZ dd','dd OpenSSL 0.9.1','g 0.9.4'],
1:['0.9.2 OpenSSL dd','dd 0.9','g XYZ'],
2:['OpenSSL t','dd XYZ','OpenSSL 0.9.1']})
print (df_rest)
df = pd.read_csv('keywords.txt', names=('a','b'))
print (df)
a b
0 OpenSSL 0.9.1
1 OpenSSL 0.9.2
2 OpenSSL 0.9.4
cols = [0,1,2]
for i, x in df.iterrows():
m = df_rest[cols].apply(lambda r: (r.str.contains(x['a'], case=False, regex=False) &
r.str.contains(x['b'], case=False, regex=False)))
df = df_rest[m.any(axis=1)]
f = '{0[0]}_{0[1]}.txt'.format((x['a'], x['b']))
df.to_csv(f, index=False, header=False)
EDIT2:
dfs = []
for i, x in dfkey.iterrows():
cols = [0,1,2,3,4,5]
m = df_rest[cols].apply(lambda r: (r.str.contains(x['a'], case=False, regex=False) &
r.str.contains(x['b'], case=False, regex=False)))
df_rest = df_rest[m.any(axis=1)]
dfs.append(df_rest)
pd.concat(dfs).to_csv('text.csv', index=False, header=False)
Advanced Filter for multiple keywords anywhere in a cell
Ohh you just put text asterisks around the text you want to search, not <> before.
So
*vice*
*health*
*medical*
Etc.
Filter a dataframe column for a keyword, return seperate column value (name) from the row where each keyword is found
You could do
list(df[df['words'].str.contains('apple', na=False)]['names'])
resulting in
['a', 'b']
df['words'].str.contains('apple', na=False)
build a boolean pandas series for the condition, and taking care of eventual missing values in the column.- the series resulting from previous line is used filter the original dataframe df.
- in the dataframe resulting from previous line, the 'names' column is selected.
- in the dataframe resulting from previous line, the column is cas to a list.
Full code:
import io
import pandas as pd
data = """
names words
a apple
b apple
c pear
"""
df = pd.read_csv(io.StringIO(data), sep='\s+')
lst = list(df[df['words'].str.contains('apple')]['names'])
>>>print(lst)
['a', 'b']
Filtering text from dataframe based on keywords in a list
How to filter a DataFrame by a volatile subset of words?
Dummy data
import numpy as np
import pandas as pd
columns = ['transaction_description', 'value']
data = [
['pac c.misalud conv. unificado', 12320.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac c.misalud conv. unificado', 12320.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac c.misalud conv. unificado', 12320.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['cargo seguro proteccion bancaria', 40222.0]]
df=pd.DataFrame(data, columns=columns)
keywords = [
[('tarifa',), ('mantenimiento',), ('mensual',)],
[('tasa',), ('anual',)],
[('seguro',), ('bancaria',)],
[('seguro',), ('generales',)],
[('mi salud',), ('unific',)]]
Solving
I will use a structure where the words of the sublists are arranged in columns, or to be precise, each word is placed in the list as the only element of a tuple.
Let's vectorize str.__contains__
to make the str1 in str2
code applicable to arrays:
contains = np.vectorize(str.__contains__)
Now, I'll test this function on df["transaction_description"]
and the 4th set of keywords [('seguro',), ('generales',)]
for example:
desc = df['transaction_description']
contains(desc, keywords[3])
In this case, we get the following result:
array([[False, True, True, True, False, True, True, False, True, True, True],
[False, False, True, False, False, True, False, False, True, False, False]])
Now, to see if all words of this subset can be found in a description, we apply the method all
along the first index of the previous matrix:
df[contains(desc, keywords[3]).all(axis=0)]
And we obtain these filtered data:
transaction_description value
2 pac sura cia seguros generales 8657.0
5 pac sura cia seguros generales 8657.0
8 pac sura cia seguros generales 8657.0
Long story short
contains = np.vectorize(str.__contains__)
desc = df['transaction_description']
contain_all = lambda words: df[contains(desc, words).all(axis=0)]
Related Topics
Extract English Words from a Text in R
Solve Homogenous System Ax = 0 for Any M * N Matrix a in R (Find Null Space Basis for A)
Grouped Bar Graph Custom Colours
Reshape Data from Long to Wide Format - More Than One Variable
How to Edit Column Names in Datatable Function When Running R Shiny App
Visual Bug When Changing Robinson Projection's Central Meridian with Ggplot2
Code Folding for Individual Chunks in R Markdown
Usage of Dot/Period in R Functions
R: "Make" Not Found When Installing a R-Package from Local Tar.Gz
R Table Function - How to Remove 0 Counts
How to Annotate Ggplot2 Qplot Outside of Legend and Plotarea? (Similar to Mtext())
How to Display Line Numbers for Code Chunks in Rmarkdown HTML and PDF
Return Call from Ggplot Object
R: Read in Random Rows from File Using Fread or Equivalent
Simulate an Ar(1) Process with Uniform Innovations
Preventing Incosistent Spacing/Bar Widths in Geom_Bar with Many Bars
Lm and Predict - Agreement of Data.Frame Names
How to Drop Factor Levels While Scraping Data Off Us Census HTML Site