Ruby: Extracting Words from String

Ruby: Extracting Words From String

The split command.

   words = @string1.split(/\W+/)

will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.

Ruby regular expression to extract words in a string that contain no spaces

Using String#scan with character class ranges will get you what you want with a simple, easy-to-understand regex:

str = "ASimpleNoSpaceTitle"
str.scan(/[A-Z][a-z]*/) # => ["A", "Simple", "No", "Space", "Title"]

You could use the POSIX bracket expressions [[:upper:]] and [[:lower:]], which would allow your regex to also deal with non-ASCII letters such as À or ç:

str = "ÀSimpleNoSpaçeTitle"
str.scan(/[A-Z][a-z]*/) # => ["Simple", "No", "Spa", "Title"]
str.scan(/[[:upper:]][[:lower:]]*/) # => ["À", "Simple", "No", "Spaçe", "Title"]

To allow words to begin with a lowercase letter when not preceded by another letter, you can use this varuation:

str = "ASimpleNoSpaceTitle and a subtitle"
str.scan(/[A-Za-z][a-z]*/) # => ["A", "Simple", "No", "Space", "Title", "and", "a", "subtitle"]
# OR
str.scan(/[[:alpha:]][[:lower:]]*/)

Ruby regex extracting words

result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. It will print

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings.

Explanation

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints

=> ["hello", "\"my name\"", "is", "\"Tom\""]

Extract a substring from a string in Ruby using a regular expression

String1.scan(/<([^>]*)>/).last.first

scan creates an array which, for each <item> in String1 contains the text between the < and the > in a one-element array (because when used with a regex containing capturing groups, scan creates an array containing the captures for each match). last gives you the last of those arrays and first then gives you the string in it.

Extract a word from a string based on the character index

If you just want the first word that starts with a capital T:

"What the hell is T3GARY and U81J9H"[/T\w+/]
# => "T3GARY"

If you want all those words comprised of just upper-case letters and numbers:

"What the hell is T3GARY and U81J9H".scan(/\b[A-Z0-9]+\b/)
# => => ["T3GARY", "U81J9H"]

Extract a word from a sentence in Ruby

If you have numbers, use the following regex:

(?<=host:)\d+

The lookbehind will find the numbers right after host:.

See IDEONE demo:

str = "XXX host:1233455 YYY ZZZ!"
puts str.match(/(?<=host:)\d+/)

Note that if you want to match alphanumerics and not any punctuation, you can replace \d+ with \w+.

Also, if you also have dots, or commas inside, you can use

/(?<=host:)\d+(?:[.,]\d+)*/

It will extract values like 4,445 or 44.45.455.

UPDATE:

In case you need a more universal solution (especially if you need to use the regex on another platform where look-behind is not supported (as in JavaScript), use capture group approach:

str.match(/\bhost:(\d+)/).captures.first

Note that \b makes sure we find host: as a whole word, not localhost:. (\d+) is the capture group whose value we can refer to with the backreferences, or via .captures.first in Ruby.

Extracting unique words

Let's first create a test file.

str =<<END
We like pancakes for breakfast,
but we know others like waffles.
END

FName = 'temp'
File.write(FName, str)
  #=> 65 (characters written)

We need to return an array containing the first nbr_unique unique words from the file named file, so let's write a method that will do that.

def unique_words(fname, nbr_unique)
  <code needed here>
end

You need to add unique words to an array that will be returned by this method, so let's begin by creating an empty array and then return that array at the end of the method.

def unique_words(fname, nbr_unique)
  arr = []
  <code needed here>
  arr
end

You know how to read a file line-by-line, so let's do that, using the class method IO::foreach¹.

def unique_words(fname, nbr_unique)
  arr = []
  File.foreach(fname) do |line|
    <code need here to process line>
  end
  arr
end

The block variable line equals "We like pancakes for breakfast,\n" after the first line is read. Firstly, the newline character needs to be removed. Examine the methods of the class
String to see if one can be used to do that.

The second line contains the word "we". I assume "We" and "we" are not to be regarded as unique words. This is usually handled by converting all characters of a string to either all lowercase or all uppercase. You can do this to each line or to each word (after words have been extracted from a line). Again, look for a suitable method in the class String for doing this.

Next you need to extract words from each line. Once again, look for a String method for doing that.

Next we need to determine if, say, "like" (or "LIKE") is to be added to the array arr. Look at the instance methods for the class Array for a suitable method. If it is added we need to see if arr now contains nbr_unique words. If it does we don't need to read any more lines of the file, so we need to break out of foreach's block (perhaps use the keyword break).

There's one more thing we need to take care of. The first line contains "breakfast,", the second, "waffles.". We obviously don't want the words returned to contain punctuation. There are two ways to do that. The first is to remove the punctuation, the second is to accept only letters.

Given a string that contains punctuation (a line or a word) we can create a second string that equals the original string with the punctuation removed. One way to do that is to use the method String#tr. Suppose the string is "breakfast,". Then

"breakfast,".tr(".,?!;:'", "") #=> "breakfast"

To only accept letters we could use any of the following regular expressions (all return "breakfast"):

"breakfast,".gsub(/[a-zA-Z]+/, "")
"breakfast,".gsub(/[a-z]+/i, "")
"breakfast,".gsub(/[[:alphaa:]]+/, "")
"breakfast,".gsub(/\p{L}+/, "")

The first two work with ASCII characters only. The third (POSIX) and fourth work (\p{} construct) with Unicode (search within Regexp).

Note that it is more efficient to remove punctuation from a line before words are extracted.

Extra credit: use Enumerator#with_object

Whenever you see an object (here arr) initialized to be be empty, manipulated and then returned at the end of a method, you should consider using the method Enumerator#with_object or (more commonly), Enumerable#each_with_object. Both of these return the object referred to in the method name.

The method IO::foreach returns an enumerator (an instance of the class Enumerator) when it does not have a block (see doc). We therefore could write

def unique_words(fname, nbr_unique)
  File.foreach(fname).with_object([]) do |line, arr|
    <code need here to process line>
  end
end

We have eliminated two lines (arr = [] and arr), but have also confined arr's scope to the block. This is not a big deal but is the Ruby way.

More extra credit: use methods of the class Set

Suppose we wrote the following.

require 'set'

def unique_words(fname, nbr_unique)
  File.foreach(fname).with_object(Set.new) do |line, set|
    <code need here to process line>
  end.to_a
end

When we extract the word "we" from the second line we need to check if it should be added to the set. Since sets have unique elements we can just try to do it. We won't be able to do that because set will already contain that word from the first line of the file. A handy method for doing that is Set#add?:

set.add?("we")
  #=> nil

Here the method returns nil, meaning the set already contains that word. It also tells us that we don't need to check if the set now contains nbr_unique words. Had we been able to add the word to the set, set (with the added word) would be returned.

The block returns the value of set (a set). The method Set#to_a converts that set to an array, which is returned by the method.

^{1 Notice that I've invoked the class method IO::foreach by writing File.foreach(fname)... below. This is permissible because File is a subclass of IO (File.superclass #=> IO). I could have instead written IO.foreach(fname)..., but it is more common to use File as the receiver.}

Ruby: Extracting Words from String