Ruby: Extracting Words From String
The split command.
words = @string1.split(/\W+/)
will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.
Ruby regular expression to extract words in a string that contain no spaces
Using String#scan
with character class ranges will get you what you want with a simple, easy-to-understand regex:
str = "ASimpleNoSpaceTitle"
str.scan(/[A-Z][a-z]*/) # => ["A", "Simple", "No", "Space", "Title"]
You could use the POSIX bracket expressions [[:upper:]]
and [[:lower:]]
, which would allow your regex to also deal with non-ASCII letters such as À or ç:
str = "ÀSimpleNoSpaçeTitle"
str.scan(/[A-Z][a-z]*/) # => ["Simple", "No", "Spa", "Title"]
str.scan(/[[:upper:]][[:lower:]]*/) # => ["À", "Simple", "No", "Spaçe", "Title"]
To allow words to begin with a lowercase letter when not preceded by another letter, you can use this varuation:
str = "ASimpleNoSpaceTitle and a subtitle"
str.scan(/[A-Za-z][a-z]*/) # => ["A", "Simple", "No", "Space", "Title", "and", "a", "subtitle"]
# OR
str.scan(/[[:alpha:]][[:lower:]]*/)
Ruby regex extracting words
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject
like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
Extract a substring from a string in Ruby using a regular expression
String1.scan(/<([^>]*)>/).last.first
scan
creates an array which, for each <item>
in String1
contains the text between the <
and the >
in a one-element array (because when used with a regex containing capturing groups, scan creates an array containing the captures for each match). last
gives you the last of those arrays and first
then gives you the string in it.
Extract a word from a string based on the character index
If you just want the first word that starts with a capital T:
"What the hell is T3GARY and U81J9H"[/T\w+/]
# => "T3GARY"
If you want all those words comprised of just upper-case letters and numbers:
"What the hell is T3GARY and U81J9H".scan(/\b[A-Z0-9]+\b/)
# => => ["T3GARY", "U81J9H"]
Extract a word from a sentence in Ruby
If you have numbers, use the following regex:
(?<=host:)\d+
The lookbehind will find the numbers right after host:
.
See IDEONE demo:
str = "XXX host:1233455 YYY ZZZ!"
puts str.match(/(?<=host:)\d+/)
Note that if you want to match alphanumerics and not any punctuation, you can replace \d+
with \w+
.
Also, if you also have dots, or commas inside, you can use
/(?<=host:)\d+(?:[.,]\d+)*/
It will extract values like 4,445
or 44.45.455
.
UPDATE:
In case you need a more universal solution (especially if you need to use the regex on another platform where look-behind is not supported (as in JavaScript), use capture group approach:
str.match(/\bhost:(\d+)/).captures.first
Note that \b
makes sure we find host:
as a whole word, not localhost:
. (\d+)
is the capture group whose value we can refer to with the backreferences, or via .captures.first
in Ruby.
Extracting unique words
Let's first create a test file.
str =<<END
We like pancakes for breakfast,
but we know others like waffles.
END
FName = 'temp'
File.write(FName, str)
#=> 65 (characters written)
We need to return an array containing the first nbr_unique
unique words from the file named file
, so let's write a method that will do that.
def unique_words(fname, nbr_unique)
<code needed here>
end
You need to add unique words to an array that will be returned by this method, so let's begin by creating an empty array and then return that array at the end of the method.
def unique_words(fname, nbr_unique)
arr = []
<code needed here>
arr
end
You know how to read a file line-by-line, so let's do that, using the class method IO::foreach1.
def unique_words(fname, nbr_unique)
arr = []
File.foreach(fname) do |line|
<code need here to process line>
end
arr
end
The block variable line
equals "We like pancakes for breakfast,\n"
after the first line is read. Firstly, the newline character needs to be removed. Examine the methods of the class
String to see if one can be used to do that.
The second line contains the word "we"
. I assume "We"
and "we"
are not to be regarded as unique words. This is usually handled by converting all characters of a string to either all lowercase or all uppercase. You can do this to each line or to each word (after words have been extracted from a line). Again, look for a suitable method in the class String
for doing this.
Next you need to extract words from each line. Once again, look for a String
method for doing that.
Next we need to determine if, say, "like"
(or "LIKE"
) is to be added to the array arr
. Look at the instance methods for the class Array for a suitable method. If it is added we need to see if arr
now contains nbr_unique
words. If it does we don't need to read any more lines of the file, so we need to break out of foreach
's block (perhaps use the keyword break
).
There's one more thing we need to take care of. The first line contains "breakfast,"
, the second, "waffles."
. We obviously don't want the words returned to contain punctuation. There are two ways to do that. The first is to remove the punctuation, the second is to accept only letters.
Given a string that contains punctuation (a line or a word) we can create a second string that equals the original string with the punctuation removed. One way to do that is to use the method String#tr. Suppose the string is "breakfast,"
. Then
"breakfast,".tr(".,?!;:'", "") #=> "breakfast"
To only accept letters we could use any of the following regular expressions (all return "breakfast"
):
"breakfast,".gsub(/[a-zA-Z]+/, "")
"breakfast,".gsub(/[a-z]+/i, "")
"breakfast,".gsub(/[[:alphaa:]]+/, "")
"breakfast,".gsub(/\p{L}+/, "")
The first two work with ASCII characters only. The third (POSIX) and fourth work (\p{} construct) with Unicode (search within Regexp).
Note that it is more efficient to remove punctuation from a line before words are extracted.
Extra credit: use Enumerator#with_object
Whenever you see an object (here arr
) initialized to be be empty, manipulated and then returned at the end of a method, you should consider using the method Enumerator#with_object
or (more commonly), Enumerable#each_with_object. Both of these return the object referred to in the method name.
The method IO::foreach
returns an enumerator (an instance of the class Enumerator
) when it does not have a block (see doc). We therefore could write
def unique_words(fname, nbr_unique)
File.foreach(fname).with_object([]) do |line, arr|
<code need here to process line>
end
end
We have eliminated two lines (arr = []
and arr
), but have also confined arr
's scope to the block. This is not a big deal but is the Ruby way.
More extra credit: use methods of the class Set
Suppose we wrote the following.
require 'set'
def unique_words(fname, nbr_unique)
File.foreach(fname).with_object(Set.new) do |line, set|
<code need here to process line>
end.to_a
end
When we extract the word "we"
from the second line we need to check if it should be added to the set. Since sets have unique elements we can just try to do it. We won't be able to do that because set
will already contain that word from the first line of the file. A handy method for doing that is Set#add?:
set.add?("we")
#=> nil
Here the method returns nil
, meaning the set already contains that word. It also tells us that we don't need to check if the set now contains nbr_unique
words. Had we been able to add the word to the set, set
(with the added word) would be returned.
The block returns the value of set
(a set). The method Set#to_a converts that set to an array, which is returned by the method.
1 Notice that I've invoked the class method IO::foreach
by writing File.foreach(fname)...
below. This is permissible because File
is a subclass of IO
(File.superclass #=> IO
). I could have instead written IO.foreach(fname)...
, but it is more common to use File
as the receiver.
Related Topics
Is There a Literal Notation for an Array of Symbols
Ruby/Rails: Convert Int to Time or Get Time from Integer
Find Classes Available in a Module
Iterating Between Two Datetimes, with a One Hour Step
Force Strings to Utf-8 from Any Encoding
Rails Link_To External Site, Url Is Attribute of User Table, Like: @Users.Website
Ruby - Determining Method Origins
Rubygems Do Not Install on Os X Lion
Rails 3. How to Add a Helper That Activeadmin Will Use
How to Share the Factories That I Have in a Gem and Use It in Other Project
How to Run Rails Console in the Test Environment and Load Test_Helper.Rb
How to Use Unicorn as "Rails S"
What Is the Correct Way to Detect If Ruby Is Running on Windows
How to Check Whether a Value in a String Is an Ip Address
Bundler Could Not Find Compatible Versions for Gem "Bundler":