Python - Remove Any Element from a List of Strings That Is a Substring of Another Element

Python - Remove any element from a list of strings that is a substring of another element

First building block: substring.

You can use in to check:

>>> 'rest' in 'resting'
True
>>> 'sing' in 'resting'
False

Next, we're going to choose the naive method of creating a new list. We'll add items one by one into the new list, checking if they are a substring or not.

def substringSieve(string_list):
out = []
for s in string_list:
if not any([s in r for r in string_list if s != r]):
out.append(s)
return out

You can speed it up by sorting to reduce the number of comparisons (after all, a longer string can never be a substring of a shorter/equal length string):

def substringSieve(string_list):
string_list.sort(key=lambda s: len(s), reverse=True)
out = []
for s in string_list:
if not any([s in o for o in out]):
out.append(s)
return out

Remove items in a list based on a list of substrings in another list in python3.x

Using Regex.

Ex:

import re


list1 = ['lunch time', 'sandwich shop', 'starts at noon','grocery store']
list2 = ['lunch','noon']
pattern = re.compile(r"|".join(list2))
print([i for i in list1 if not pattern.search(i)])

Output:

['sandwich shop', 'grocery store']

Removing substrings in list of list of strings, maintain order - Python

Instead of removing elements from a list, why not create a new one matching your requirements (since being safer)?

# method to filter out substrings
def substr_in_list(elem, lst):
for s in lst:
if elem != s and elem in s:
return True
return False

words = [[j for j in i if not substr_in_list(j, i)] for i in words]

Output :

[['gamma_ray_bursts', 'merger', 'death', 'throes', 'magnetic_flares', 'neutrino_antineutrino', 'objections', 'double_neutron_star', 'parker_instability', 'positrons'], ['dot', 'gravitational_lensing', 'splittings', 'limits', 'amplifications', 'time_delays', 'extracting_information', 'fix', 'distant_quasars'], ['recoil', 'gamma_ray_bursts', 'neutron_stars', 'jennings', 'possible_origins', 'birthplaces', 'disjoint', 'arrival_directions'], ['sn_sn', 'type_ii_supernovae', 'distances', 'dilution', 'extinction', 'extragalactic_distance_scale', 'expanding_photosphere', 'photospheres', 'supernovae_sn', 'span_wide_range'], ['photon_pair', 'high_energy', 'gamma_ray_burst', 'optical_depth', 'absorbing_medium', 'implications', 'problem', 'annihilation_radiation', 'emergent_spectrum', 'limit', 'radiation_transfer', 'collimation', 'regions']]

Python: Remove Strings in a List that are contained by at least one other String in the same List

Quite optimized function with 2 loops, which saves a lot of loop iterations:

def filterlist(l):
# keep track of elements, which will be deleted
deletelist = [False for _ in l]

for i, el in enumerate(l):
# already in deletelist, jump right to the next el
if deletelist[i]:
continue

for j, el2 in enumerate(l):
# comparing item to itself or el2 already in deletelist?
# jump to next el2
if i == j or deletelist[j]:
continue

# the comparison everyone expects
if el in el2:
deletelist[j] = True

# also, check the other way around
# will save loop iterations later
elif el2 in el:
deletelist[i] = True
break # causes jump to next el

# create new list, keep elements that are not in deletelist
return [el for i, el in enumerate(l) if not deletelist[i]]

Usually built-in functions are faster, so let's compare it to Ed Ward's solution:

# result of Ed Ward's solution using timeit:
100000 loops, best of 10: 5.38 usec per loop

# filterlist function with loops using timeit:
100000 loops, best of 10: 4.42 usec per loop

Interesting, but to get a really representative result, you should run timeit with a larger element list.

find and remove some substrings from a long list of string in python

Create a string with all the special characters you'd like to remove, and strip them off the right side:

strings = ['short', 'club', 'edit', 'post\C2', 'le\C3', 'lundi', 'janvier', '2008']
special = ''.join(['\C2','\C3','\E2']) # see note

Note at this point that \ is a special character and you should escape it whenever you use it, to avoid ambiguity. You can also simply create a string literal rather than using str.join.

special = '\\C2\\C3\\E2' # that's better

strings[:] = [item.rstrip(special) for item in strings]


Related Topics



Leave a reply



Submit