Matching whole words using in in python
If you want to do a plain match of just one word, use ==:
'abc' == 'abc123' # false
If you're doing 'abc' in ['cde','fdabc','abc123']
, that returns False anyway:
'abc' in ['cde','fdabc','abc123'] # False
The reason 'abc' in 'abc123'
returns true, from the docs:
For the Unicode and string types,
x in y
is true if and only ifx
is a
substring ofy
. An equivalent test isy.find(x) != -1
.
So for comparing against a single string, use '==', and if comparing in a collection of strings, in
can be used (you could also do 'abc' in ['abc123']
- since the behaviour of in
works as your intuition imagines when y
is a list or collection of sorts.
Pyspark, find substring as whole word(s)
This probably still has edge cases but I hope you get some ideas.
I would use regex_extract
to match the candidate against the sentence.
First, I convert the candidate to regex (ie, convert space to \s), then use regex_extract
with word boundary (\b).
df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
.withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))
Result
+-------------+-----------------------+--------------+-------------+
| candidate| sentence| regex| match|
+-------------+-----------------------+--------------+-------------+
| su| We saw the survivors.| su| |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
+-------------+-----------------------+--------------+-------------+
Check for (whole only) words in string
you'd be better off splitting your sentence, then count the words, not the substrings:
textt="When I was One I had just begun When I was Two I was nearly new"
wwords=['i', 'was', 'three', 'near']
text_words = textt.lower().split()
result = {w:text_words.count(w) for w in wwords}
print(result)
prints:
{'three': 0, 'i': 4, 'near': 0, 'was': 3}
if the text has punctuation now, you're better off with regular expressions to split the string according to non-alphanum:
import re
textt="When I was One, I had just begun.I was Two when I was nearly new"
wwords=['i', 'was', 'three', 'near']
text_words = re.split("\W+",textt.lower())
result = {w:text_words.count(w) for w in wwords}
result:
{'was': 3, 'near': 0, 'three': 0, 'i': 4}
(another alternative is to use findall
on word characters: text_words = re.findall(r"\w+",textt.lower())
)
Now if your list of "important" words is big, maybe it's better to count all the words, and filter afterwards, using the classical collections.Counter
:
text_words = collections.Counter(re.split("\W+",textt.lower()))
result = {w:text_words.get(w) for w in wwords}
python - string match only whole words
Here is one way:
re.search(r'\b' + re.escape(' '.join(query)) + r'\b', ' '.join(line)) is not None
how to search for specific whole words within a string , via SQL, compatible with both HIVE/IMPALA
You can add word boundary \\b
to match only exact words:
rlike '(?i)\\bFECHADO\\b|\\bCIERRE\\b|\\bCLOSED\\b'
(?i)
means case insensitive, no need to use UPPER.
And the last alternative in your regex pattern is REVISTO. NORMAL.
If dots in it should be literally dots, use \\.
Like this: REVISTO\\. NORMAL\\.
Dot in regexp means any character and should be shielded with two backslashes to match dot literally.
Above regex works in Hive. Unfortunately I have no Impala to test it
Check if a word is in a string in Python
What is wrong with:
if word in mystring:
print('success')
Related Topics
Why Are Default Arguments Evaluated at Definition Time
How to Loop Through a List by Twos
Beautifulsoup:Difference Between .Find() and .Select()
How to Merge Two Lists into a Single List
Django.Db.Utils.Operationalerror Could Not Connect to Server
How to Call an Async Function Without Await
Split a List into Nested Lists on a Value
Restricting the Value in Tkinter Entry Widget
Do I Need to Import Submodules Directly
How to Leave/Exit/Deactivate a Python Virtualenv
Why Does Appending to One List Also Append to All Other Lists in My List of Lists
Simulate Python Keypresses for Controlling a Game
Python Django Global Variables
How to Access a Standard-Library Module in Python When There Is a Local Module with the Same Name