Filter a Column Which Contains Several Keywords

Filter a column which contains several keywords

grep can use | as an or, so why not paste your filters together with | as a separator:

dfilter <- df1[grep(paste0(filter1, collapse = "|"), df1$type),]

filter a column with a multiple keywords, in django

You can try like this using Q():

from django.db.models import Q

query = Q()

for k in datas["keywords"]:
query |= Q(task_name__contains=k)

tasks = Task.objects.filter(query)

Filter Data by multiple keywords

I think you can create for each keyword separate mask and then combine them with chaining by & - for at least one True per row use DataFrame.any:

df_rest = pd.DataFrame({0:['OpenSSL XYZ dd','dd OpenSSL','g OpenSSL'],
1:['CVE-2017-XX OpenSSL dd','dd OpenSSL','g XYZ'],
2:['OpenSSL t','dd XYZ','g CVE-2017-XX XYZ OpenSSL']})

cols = [0,1,2]
m1 = df_rest[cols].apply(lambda r: r.str.contains('OpenSSL', case=False))
print (m1)
0 1 2
0 True True True
1 True True False
2 True False True

m2 = df_rest[cols].apply(lambda r: r.str.contains('XYZ', case=False))
print (m2)
0 1 2
0 True False False
1 False False True
2 False True True

m3 = df_rest[cols].apply(lambda r: r.str.contains('CVE-2017-XX', case=False))
print (m3)
0 1 2
0 False True False
1 False False False
2 False False True

print (m1 & m2)
0 1 2
0 True False False
1 False False False
2 False False True

print ((m1 & m2).any(axis=1))
0 True
1 False
2 True
dtype: bool

df = df_rest[(m1 & m2).any(axis=1)]
print (df)
0 1 2
0 OpenSSL XYZ dd CVE-2017-XX OpenSSL dd OpenSSL t
2 g OpenSSL g XYZ g CVE-2017-XX XYZ OpenSSL

EDIT:

Is possible some keywords are interpreted as regex. For avoid it use regex=
False
:

df_rest = pd.DataFrame({0:['XYZ dd','dd OpenSSL 0.9.4','g 0.9.4'],
1:['0.9.4 OpenSSL dd','dd 0.9','g XYZ'],
2:['OpenSSL t','dd XYZ','OpenSSL 0.9.7']})

print (df_rest)
0 1 2
0 XYZ dd 0.9.4 OpenSSL dd OpenSSL t
1 dd OpenSSL 0.9.4 dd 0.9 dd XYZ
2 g 0.9.4 g XYZ OpenSSL 0.9.7

cols = [0,1,2]
m = df_rest[cols].apply(lambda r: (r.str.contains('0.9.4', case=False, regex=False) &
r.str.contains('OpenSSL', case=False, regex=False)))

df = df_rest[m.any(axis=1)]
print (df)
0 1 2
0 XYZ dd 0.9.4 OpenSSL dd OpenSSL t
1 dd OpenSSL 0.9.4 dd 0.9 dd XYZ

EDIT1:

df_rest = pd.DataFrame({0:['XYZ dd','dd OpenSSL 0.9.1','g 0.9.4'],
1:['0.9.2 OpenSSL dd','dd 0.9','g XYZ'],
2:['OpenSSL t','dd XYZ','OpenSSL 0.9.1']})

print (df_rest)

df = pd.read_csv('keywords.txt', names=('a','b'))
print (df)
a b
0 OpenSSL 0.9.1
1 OpenSSL 0.9.2
2 OpenSSL 0.9.4

cols = [0,1,2]
for i, x in df.iterrows():
m = df_rest[cols].apply(lambda r: (r.str.contains(x['a'], case=False, regex=False) &
r.str.contains(x['b'], case=False, regex=False)))

df = df_rest[m.any(axis=1)]
f = '{0[0]}_{0[1]}.txt'.format((x['a'], x['b']))
df.to_csv(f, index=False, header=False)

EDIT2:

dfs = []
for i, x in dfkey.iterrows():

cols = [0,1,2,3,4,5]
m = df_rest[cols].apply(lambda r: (r.str.contains(x['a'], case=False, regex=False) &
r.str.contains(x['b'], case=False, regex=False)))

df_rest = df_rest[m.any(axis=1)]

dfs.append(df_rest)
pd.concat(dfs).to_csv('text.csv', index=False, header=False)

Advanced Filter for multiple keywords anywhere in a cell

Ohh you just put text asterisks around the text you want to search, not <> before.
So

*vice*
*health*
*medical*

Etc.

Filter a dataframe column for a keyword, return seperate column value (name) from the row where each keyword is found

You could do

list(df[df['words'].str.contains('apple', na=False)]['names'])

resulting in

['a', 'b']
  1. df['words'].str.contains('apple', na=False) build a boolean pandas series for the condition, and taking care of eventual missing values in the column.
  2. the series resulting from previous line is used filter the original dataframe df.
  3. in the dataframe resulting from previous line, the 'names' column is selected.
  4. in the dataframe resulting from previous line, the column is cas to a list.

Full code:

import io
import pandas as pd
data = """
names words
a apple
b apple
c pear
"""
df = pd.read_csv(io.StringIO(data), sep='\s+')

lst = list(df[df['words'].str.contains('apple')]['names'])

>>>print(lst)

['a', 'b']

Filtering text from dataframe based on keywords in a list

How to filter a DataFrame by a volatile subset of words?

Dummy data

import numpy as np
import pandas as pd

columns = ['transaction_description', 'value']
data = [
['pac c.misalud conv. unificado', 12320.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac c.misalud conv. unificado', 12320.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['pac c.misalud conv. unificado', 12320.0],
['pac sura cia seguros generales', 8657.0],
['cargo seguro proteccion bancaria', 31222.0],
['cargo seguro proteccion bancaria', 40222.0]]

df=pd.DataFrame(data, columns=columns)

keywords = [
[('tarifa',), ('mantenimiento',), ('mensual',)],
[('tasa',), ('anual',)],
[('seguro',), ('bancaria',)],
[('seguro',), ('generales',)],
[('mi salud',), ('unific',)]]

Solving

I will use a structure where the words of the sublists are arranged in columns, or to be precise, each word is placed in the list as the only element of a tuple.

Let's vectorize str.__contains__ to make the str1 in str2 code applicable to arrays:

contains = np.vectorize(str.__contains__)

Now, I'll test this function on df["transaction_description"] and the 4th set of keywords [('seguro',), ('generales',)] for example:

desc = df['transaction_description']
contains(desc, keywords[3])

In this case, we get the following result:

array([[False,  True,  True,  True, False,  True,  True, False,  True,  True,  True],
[False, False, True, False, False, True, False, False, True, False, False]])

Now, to see if all words of this subset can be found in a description, we apply the method all along the first index of the previous matrix:

df[contains(desc, keywords[3]).all(axis=0)]

And we obtain these filtered data:

          transaction_description   value
2 pac sura cia seguros generales 8657.0
5 pac sura cia seguros generales 8657.0
8 pac sura cia seguros generales 8657.0

Long story short

contains = np.vectorize(str.__contains__)
desc = df['transaction_description']
contain_all = lambda words: df[contains(desc, words).all(axis=0)]

the code and its output



Related Topics



Leave a reply



Submit