Python Searching for Partial Matches in a List

How to retrieve partial matches from a list of strings

  • startswith and in, return a Boolean.
  • The in operator is a test of membership.
  • This can be performed with a list-comprehension or filter.
  • Using a list-comprehension, with in, is the fastest implementation tested.
  • If case is not an issue, consider mapping all the words to lowercase.
    • l = list(map(str.lower, l)).
  • Tested with python 3.10.0

filter:

  • Using filter creates a filter object, so list() is used to show all the matching values in a list.
l = ['ones', 'twos', 'threes']
wanted = 'three'

# using startswith
result = list(filter(lambda x: x.startswith(wanted), l))

# using in
result = list(filter(lambda x: wanted in x, l))

print(result)
[out]:
['threes']

list-comprehension

l = ['ones', 'twos', 'threes']
wanted = 'three'

# using startswith
result = [v for v in l if v.startswith(wanted)]

# using in
result = [v for v in l if wanted in v]

print(result)
[out]:
['threes']

Which implementation is faster?

  • Tested in Jupyter Lab using the words corpus from nltk v3.6.5, which has 236736 words
  • Words with 'three'
    • ['three', 'threefold', 'threefolded', 'threefoldedness', 'threefoldly', 'threefoldness', 'threeling', 'threeness', 'threepence', 'threepenny', 'threepennyworth', 'threescore', 'threesome']
from nltk.corpus import words

%timeit list(filter(lambda x: x.startswith(wanted), words.words()))
[out]:
64.8 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(filter(lambda x: wanted in x, words.words()))
[out]:
54.8 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [v for v in words.words() if v.startswith(wanted)]
[out]:
57.5 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [v for v in words.words() if wanted in v]
[out]:
50.2 ms ± 791 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python matching partial strings in list elements between two lists

You don't want to remove elements from the list you are iterating in. Instead, you can add a condition to verify if the matched word has already been added to your output list.

It should be something like:

lst = []
for i in match:
has_match = False
for j in data:
if i.split()[0] in j:
has_match = True
print(i, j)
if j not in lst:
lst.append(j)
if len(i) > 1:
k = ' '.join(i.split()[:2])
if k in j:
has_match = True
print(i, j)
if j not in lst:
lst.append(j)
if not has_match:
lst.append(i + ' - not found')

I also removed the break keywords, since they may stop your code from finding matches in multiple strings in data. Using a boolean should do the work. Let us know if you have further questions.

Finding partial string matches between list and elements of list of lists

I want to suggest a solution to your problem.

Firstly, we create function that recognizes if a word is a substring of any word in another list:

def is_substring_of_element_in_list(word, list_of_str):
if len(list_of_str) == 0:
return (False, -1)
is_sub = any([word in s for s in list_of_str])
if (is_sub == True):
ix = [word in s for s in list_of_str].index(True)
else:
ix = -1
return is_sub, ix

Now, we can use this function to check if each word from the test list is a substring of a word on your list. Notice, we can use every word only once so we need to remove a string if a given word is a substring of.

def is_list_is_in_mylist(t, mylist):
mylist_now = sorted(mylist, key=len)
test_now = sorted(t, key=len)
counter = 0
for word in t:
is_sub, index = is_substring_of_element_in_list(word, mylist_now)
if is_sub:
mylist_now.pop(index)
test_now.remove(word)
counter += 1
if counter == len(t) and counter == len(mylist):
print("success")
else:
print("fail")

Pay attention, we need to sort the elements in the list to avoiding mistakes caused by the order of the words. For example, if my_list = ['f', 'foo'] and test1 = ['f', 'foo'] and test2 = ['foo', 'f'] without sorting, one of the success and the other will be faild.

Now, you can iterate over your test with simple for loop:

for t in test:
is_list_is_in_mylist(t, mylist)

Finding partial matches in a list of lists in Python

My guess is, you're just not matching the second condition properly e.g. if you do something like this:

'127.0.0.1' in i and 'Misconfiguration' in i

but i looks like:

['2014', '127.0.0.1', '127', 'DNS sever Misconfiguration']

then '127.0.0.1' will be in i, but 'Misconfiguration' won't - because it's a list, and in for lists is exact match, but what you're looking for is a substring of an element of i. If these are consistent, you can do something like:

'127.0.0.1' in i and 'Misconfiguration' in i[3]

or if they aren't, and you have to substring check all entries:

'127.0.0.1' in i and any('Misconfiguration' in x for x in i)

should do it. That will substring check each item in i for your search term.



Related Topics



Leave a reply



Submit