Find Substring in String But Only If Whole Words

Matching whole words using in in python

If you want to do a plain match of just one word, use ==:

'abc' == 'abc123' # false

If you're doing 'abc' in ['cde','fdabc','abc123'], that returns False anyway:

'abc' in ['cde','fdabc','abc123'] # False

The reason 'abc' in 'abc123' returns true, from the docs:

For the Unicode and string types, x in y is true if and only if x is a
substring
of y. An equivalent test is y.find(x) != -1.

So for comparing against a single string, use '==', and if comparing in a collection of strings, in can be used (you could also do 'abc' in ['abc123'] - since the behaviour of in works as your intuition imagines when y is a list or collection of sorts.

Pyspark, find substring as whole word(s)

This probably still has edge cases but I hope you get some ideas.
I would use regex_extract to match the candidate against the sentence.

First, I convert the candidate to regex (ie, convert space to \s), then use regex_extract with word boundary (\b).

df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
.withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))

Result

+-------------+-----------------------+--------------+-------------+
| candidate| sentence| regex| match|
+-------------+-----------------------+--------------+-------------+
| su| We saw the survivors.| su| |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
+-------------+-----------------------+--------------+-------------+

Check for (whole only) words in string

you'd be better off splitting your sentence, then count the words, not the substrings:

textt="When I was One I had just begun When I was Two I was nearly new"
wwords=['i', 'was', 'three', 'near']
text_words = textt.lower().split()
result = {w:text_words.count(w) for w in wwords}

print(result)

prints:

{'three': 0, 'i': 4, 'near': 0, 'was': 3}

if the text has punctuation now, you're better off with regular expressions to split the string according to non-alphanum:

import re

textt="When I was One, I had just begun.I was Two when I was nearly new"

wwords=['i', 'was', 'three', 'near']
text_words = re.split("\W+",textt.lower())
result = {w:text_words.count(w) for w in wwords}

result:

{'was': 3, 'near': 0, 'three': 0, 'i': 4}

(another alternative is to use findall on word characters: text_words = re.findall(r"\w+",textt.lower()))

Now if your list of "important" words is big, maybe it's better to count all the words, and filter afterwards, using the classical collections.Counter:

text_words = collections.Counter(re.split("\W+",textt.lower()))
result = {w:text_words.get(w) for w in wwords}

python - string match only whole words

Here is one way:

re.search(r'\b' + re.escape(' '.join(query)) + r'\b', ' '.join(line)) is not None

how to search for specific whole words within a string , via SQL, compatible with both HIVE/IMPALA

You can add word boundary \\b to match only exact words:

rlike '(?i)\\bFECHADO\\b|\\bCIERRE\\b|\\bCLOSED\\b'

(?i) means case insensitive, no need to use UPPER.

And the last alternative in your regex pattern is REVISTO. NORMAL.

If dots in it should be literally dots, use \\.

Like this: REVISTO\\. NORMAL\\.

Dot in regexp means any character and should be shielded with two backslashes to match dot literally.

Above regex works in Hive. Unfortunately I have no Impala to test it

Check if a word is in a string in Python

What is wrong with:

if word in mystring: 
print('success')


Related Topics



Leave a reply



Submit