Finding If a Sentence Contains a Specific Phrase in Ruby

Finding if a sentence contains a specific phrase in Ruby

Here are some variations:

require 'benchmark'

lorem = ('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut' # !> unused literal ignored
'enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in' # !> unused literal ignored
'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,' # !> unused literal ignored
'sunt in culpa qui officia deserunt mollit anim id est laborum.' * 10) << ' foo'


lorem.split.include?('foo') # => true
lorem['foo'] # => "foo"
lorem.include?('foo') # => true
lorem[/foo/] # => "foo"
lorem[/fo{2}/] # => "foo"
lorem[/foo$/] # => "foo"
lorem[/fo{2}$/] # => "foo"
lorem[/fo{2}\Z/] # => "foo"
/foo/.match(lorem)[-1] # => "foo"
/foo$/.match(lorem)[-1] # => "foo"
/foo/ =~ lorem # => 621

n = 500_000

puts RUBY_VERSION
puts "n=#{ n }"
Benchmark.bm(25) do |x|
x.report("array search:") { n.times { lorem.split.include?('foo') } }
x.report("literal search:") { n.times { lorem['foo'] } }
x.report("string include?:") { n.times { lorem.include?('foo') } }
x.report("regex:") { n.times { lorem[/foo/] } }
x.report("wildcard regex:") { n.times { lorem[/fo{2}/] } }
x.report("anchored regex:") { n.times { lorem[/foo$/] } }
x.report("anchored wildcard regex:") { n.times { lorem[/fo{2}$/] } }
x.report("anchored wildcard regex2:") { n.times { lorem[/fo{2}\Z/] } }
x.report("/regex/.match") { n.times { /foo/.match(lorem)[-1] } }
x.report("/regex$/.match") { n.times { /foo$/.match(lorem)[-1] } }
x.report("/regex/ =~") { n.times { /foo/ =~ lorem } }
x.report("/regex$/ =~") { n.times { /foo$/ =~ lorem } }
x.report("/regex\Z/ =~") { n.times { /foo\Z/ =~ lorem } }
end

And the results for Ruby 1.9.3:

1.9.3
n=500000
user system total real
array search: 12.960000 0.010000 12.970000 ( 12.978311)
literal search: 0.800000 0.000000 0.800000 ( 0.807110)
string include?: 0.760000 0.000000 0.760000 ( 0.758918)
regex: 0.660000 0.000000 0.660000 ( 0.657608)
wildcard regex: 0.660000 0.000000 0.660000 ( 0.660296)
anchored regex: 0.660000 0.000000 0.660000 ( 0.664025)
anchored wildcard regex: 0.660000 0.000000 0.660000 ( 0.664897)
anchored wildcard regex2: 0.320000 0.000000 0.320000 ( 0.328876)
/regex/.match 1.430000 0.000000 1.430000 ( 1.424602)
/regex$/.match 1.430000 0.000000 1.430000 ( 1.434538)
/regex/ =~ 0.530000 0.000000 0.530000 ( 0.538128)
/regex$/ =~ 0.540000 0.000000 0.540000 ( 0.536318)
/regexZ/ =~ 0.210000 0.000000 0.210000 ( 0.214547)

And 1.8.7:

1.8.7
n=500000
user system total real
array search: 21.250000 0.000000 21.250000 ( 21.296039)
literal search: 0.660000 0.000000 0.660000 ( 0.660102)
string include?: 0.610000 0.000000 0.610000 ( 0.612433)
regex: 0.950000 0.000000 0.950000 ( 0.946308)
wildcard regex: 2.840000 0.000000 2.840000 ( 2.850198)
anchored regex: 0.950000 0.000000 0.950000 ( 0.951270)
anchored wildcard regex: 2.870000 0.010000 2.880000 ( 2.874209)
anchored wildcard regex2: 2.870000 0.000000 2.870000 ( 2.868291)
/regex/.match 1.470000 0.000000 1.470000 ( 1.479383)
/regex$/.match 1.480000 0.000000 1.480000 ( 1.498106)
/regex/ =~ 0.680000 0.000000 0.680000 ( 0.677444)
/regex$/ =~ 0.700000 0.000000 0.700000 ( 0.704486)
/regexZ/ =~ 0.700000 0.000000 0.700000 ( 0.701943)

So, from the results, using a fixed string search like 'foobar'['foo'] is slower than using a regex 'foobar'[/foo/], which slower than the equivalent 'foobar' =~ /foo/.

The OPs original solution suffers badly because it traverses the string twice: Once to split it into individual words, and a second time iterating the array looking for the actual target word. Its performance will degrade worse as the string size increases.

One thing I find interesting about the performance of Ruby, is that an anchored regex is slightly slower than unanchored regex. In Perl, the opposite was true when I first ran this sort of benchmark, several years ago.

Here's an updated version using Fruity. The various expressions return different results. Any could be used if you want to see whether the target string exists. If you want to see whether the value is at the end of the string, like these are testing, or to get the location of the target, then some are definitely faster than others so pick accordingly.

require 'fruity'

TARGET_STR = (' ' * 100) + ' foo'

TARGET_STR['foo'] # => "foo"
TARGET_STR[/foo/] # => "foo"
TARGET_STR[/fo{2}/] # => "foo"
TARGET_STR[/foo$/] # => "foo"
TARGET_STR[/fo{2}$/] # => "foo"
TARGET_STR[/fo{2}\Z/] # => "foo"
TARGET_STR[/fo{2}\z/] # => "foo"
TARGET_STR[/foo\Z/] # => "foo"
TARGET_STR[/foo\z/] # => "foo"
/foo/.match(TARGET_STR)[-1] # => "foo"
/foo$/.match(TARGET_STR)[-1] # => "foo"
/foo/ =~ TARGET_STR # => 101
/foo$/ =~ TARGET_STR # => 101
/foo\Z/ =~ TARGET_STR # => 101
TARGET_STR.include?('foo') # => true
TARGET_STR.index('foo') # => 101
TARGET_STR.rindex('foo') # => 101


puts RUBY_VERSION
puts "TARGET_STR.length = #{ TARGET_STR.length }"

puts
puts 'compare fixed string vs. unanchored regex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
end

puts
puts 'compare /foo/ to /fo{2}/'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_regex2 { TARGET_STR[/fo{2}/] }
end

puts
puts 'compare unanchored vs. anchored regex' # !> assigned but unused variable - delay
compare do
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_dollar { TARGET_STR[/foo$/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
anchored_regex_z { TARGET_STR[/foo\z/] }
end

puts
puts 'compare /foo/, match and =~'
compare do
unanchored_regex { TARGET_STR[/foo/] }
unanchored_match { /foo/.match(TARGET_STR)[-1] }
unanchored_eq_match { /foo/ =~ TARGET_STR }
end

puts
puts 'compare fixed, unanchored, Z, include?, index and rindex'
compare do
fixed_str { TARGET_STR['foo'] }
unanchored_regex { TARGET_STR[/foo/] }
anchored_regex_Z { TARGET_STR[/foo\Z/] }
include_eh { TARGET_STR.include?('foo') }
_index { TARGET_STR.index('foo') }
_rindex { TARGET_STR.rindex('foo') }
end

Which results in:

# >> 2.2.3
# >> TARGET_STR.length = 104
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 2x ± 0.1 (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is faster than _index by 10.000000000000009% ± 10.0% (results differ: true vs 101)
# >> _index is faster than fixed_str by 19.999999999999996% ± 10.0% (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 39.99999999999999% ± 10.0%
# >> anchored_regex_Z is similar to unanchored_regex

Modifying the size of the string reveals good stuff to know.

Changing to 1,000 characters:

# >> 2.2.3
# >> TARGET_STR.length = 1004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 4096 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 50.0% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is faster than anchored_regex_Z by 10.000000000000009% ± 10.0%
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 0.1
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 4096 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 2x ± 0.1
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 4 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 1.0 (results differ: 1001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 2x ± 0.1 (results differ: foo vs true)
# >> include_eh is faster than fixed_str by 10.000000000000009% ± 10.0% (results differ: true vs foo)
# >> fixed_str is similar to _index (results differ: foo vs 1001)
# >> _index is similar to unanchored_regex (results differ: 1001 vs foo)

Bumping it to 10,000:

# >> 2.2.3
# >> TARGET_STR.length = 10004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 512 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 21x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0%
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 18 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 0.1 (results differ: 10001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 15x ± 1.0 (results differ: foo vs true)
# >> include_eh is similar to _index (results differ: true vs 10001)
# >> _index is similar to fixed_str (results differ: 10001 vs foo)
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%

Ruby v2.6.5 results:

# >> 2.6.5
# >> n=500000
# >> user system total real
# >> array search: 6.744581 0.012204 6.756785 ( 6.766078)
# >> literal search: 0.351014 0.000334 0.351348 ( 0.351866)
# >> string include?: 0.325576 0.000493 0.326069 ( 0.326331)
# >> regex: 0.373231 0.000512 0.373743 ( 0.374197)
# >> wildcard regex: 0.371914 0.000356 0.372270 ( 0.372549)
# >> anchored regex: 0.373606 0.000568 0.374174 ( 0.374736)
# >> anchored wildcard regex: 0.374923 0.000349 0.375272 ( 0.375729)
# >> anchored wildcard regex2: 0.136772 0.000384 0.137156 ( 0.137474)
# >> /regex/.match 0.662532 0.003377 0.665909 ( 0.666605)
# >> /regex$/.match 0.671762 0.005036 0.676798 ( 0.677691)
# >> /regex/ =~ 0.322114 0.000404 0.322518 ( 0.322917)
# >> /regex$/ =~ 0.332067 0.000995 0.333062 ( 0.334226)
# >> /regexZ/ =~ 0.078958 0.000069 0.079027 ( 0.079082)

and:

# >> 2.6.5
# >> TARGET_STR.length = 104
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 16384 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is similar to anchored_regex_dollar
# >> anchored_regex_dollar is similar to unanchored_regex
# >>
# >> compare /foo/, match and =~
# >> Running each test 16384 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 1.0 (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is similar to _index (results differ: true vs 101)
# >> _index is similar to fixed_str (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 0.1
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> 2.6.5
# >> TARGET_STR.length = 1004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 7x ± 1.0
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 10.000000000000009% ± 10.0% (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 39.99999999999999% ± 10.0% (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 4 seconds.
# >> _rindex is similar to include_eh (results differ: 1001 vs true)
# >> include_eh is similar to _index (results differ: true vs 1001)
# >> _index is similar to fixed_str (results differ: 1001 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 1.0
# >> anchored_regex_Z is faster than unanchored_regex by 4x ± 1.0

# >> 2.6.5
# >> TARGET_STR.length = 10004
# >>
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 31x ± 10.0
# >>
# >> compare /foo/ to /fo{2}/
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >>
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 27x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >>
# >> compare /foo/, match and =~
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0% (results differ: foo vs )
# >>
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 14 seconds.
# >> _rindex is faster than _index by 2x ± 1.0
# >> _index is similar to include_eh (results differ: 10001 vs true)
# >> include_eh is similar to fixed_str (results differ: true vs foo)
# >> fixed_str is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 26x ± 1.0

"Best way to find a substring in a string" is related.

Check if a sentence includes any words in an array

You are almost there:

splitDescription.select { |word| keywords.include?(word) }

since here keywords are an array to lookup the single word in.

More robust solution would be:

new_score += 0.5 * (splitDescription & keywords).size

The latter will intersect the keywords array and splitted description, calculates the size of intersection and increase the score by this value divided by two.

How to check whether a string contains a substring in Ruby

You can use the include? method:

my_string = "abcdefg"
if my_string.include? "cde"
puts "String includes 'cde'"
end

Check if string contains one word or more

Since you're asking about the 'most Ruby' way, I'd rename the method to single_word?

One way is to check for the presence of a space character.

def single_word?(string)
!string.strip.include? " "
end

But if you want to allow a particular set of characters that meet your definition of word, perhaps including apostrophes and hyphens, use a regex:

def single_word?(string)
string.scan(/[\w'-]+/).length == 1
end

How to find a specific word in string with Ruby/Rails


Regex

1st regex : extract word for further checking

Here's a regex which only matches the interesting part :

(?<=_)[a-z_]+(?=(?:_|\b))

It means lowercase word with possible underscore inside, between 2 underscores or after 1 underscore and before a word boundary.
If you need some logic depending on the case (widesky, sky, light or dark), you could use this solution.

Here in action.

2nd regex : direct check if one of 4 words is present

If you just want to know if any of the 4 cases is present :

(?<=_)(?:wide)?sky_(?:dark|light)(?=(?:_|\b))

Here in action, with either _something_after or nothing.

Case statement

list = %w(
TFjyg9780878_867978-DGB097908-78679iuhi698_widesky_light_87689uiyhk
TFjyg9780878_867978-DGB097908-78679iuhi698_sky_light_87689uiyhk
TFjyg9780878_867978-DGB097908-78679iuhi698_widesky_dark_87689uiyhk
TFjyg9780878_867978-DGB097908-78679iuhi698_sky_dark_87689uiyhk
TFjyg9780878_867978-DGB097908-78679iuhi698_trash_dark_87689uiyhk
)

list.each do |string|
case string
when /widesky_light/ then puts "widesky light found!"
when /sky_light/ then puts "sky light found!"
when /widesky_dark/ then puts "widesky dark found!"
when /sky_dark/ then puts "sky dark found!"
else puts "Nothing found!"
end
end

In this order, the case statement should be fine. widesky_dark won't match twice, for example.

Find common words in sentences with Ruby

You could do this (I modified your example slightly):

str = "a lorem ipsum lorem dolor sit amet. a tut toje est lorem! a i tuta toje lorem?"  

str.split(/[.!?]/).map(&:split).reduce(:&)
#=> ["a", "lorem"]

We have:

d = str.split(/[.!?]/)
#=> ["a lorem ipsum lorem dolor sit amet",
# " a tut toje est lorem",
# " a i tuta toje lorem"]
e = d.map(&:split)
#=> [["a", "lorem", "ipsum", "lorem", "dolor", "sit", "amet"],
# ["a", "tut", "toje", "est", "lorem"],
# ["a", "i", "tuta", "toje", "lorem"]]
e.reduce(:&)
#=> ["a", "lorem"]

To make it case-insensitive, change str.split... to str.downcase.split....

How to check if a string begins with prefix and contains specific keyword in Ruby

The code looks good. The only feedback I have is regarding the use of 'include?' function for the prefix. Try to use 'start_with?' function instead, so that you don't get True even when the the "Dr" is within the string.

def match st
if st.start_with?('Dr.') and st.include?('Alex')
return true
else
return false
end
end

Extract a word from a sentence in Ruby

If you have numbers, use the following regex:

(?<=host:)\d+

The lookbehind will find the numbers right after host:.

See IDEONE demo:

str = "XXX host:1233455 YYY ZZZ!"
puts str.match(/(?<=host:)\d+/)

Note that if you want to match alphanumerics and not any punctuation, you can replace \d+ with \w+.

Also, if you also have dots, or commas inside, you can use

/(?<=host:)\d+(?:[.,]\d+)*/

It will extract values like 4,445 or 44.45.455.

UPDATE:

In case you need a more universal solution (especially if you need to use the regex on another platform where look-behind is not supported (as in JavaScript), use capture group approach:

str.match(/\bhost:(\d+)/).captures.first

Note that \b makes sure we find host: as a whole word, not localhost:. (\d+) is the capture group whose value we can refer to with the backreferences, or via .captures.first in Ruby.



Related Topics



Leave a reply



Submit