How to Tokenize This String in Ruby

Better way to tokenize string in Ruby?

Don't use split, use the scan method:

> "ab.:u/87z".scan(/\w+|\W/)
=> ["ab", ".", ":", "u", "/", "87z"]

How do I tokenize this string in Ruby?

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
{ :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

A quick breakdown of the regex:

  • \w+ matches any single-term keywords
  • (?:\\.|[^\\"]])* uses non-capturing parentheses ((?:...)) to match the contents of an escaped double quoted string - either an escaped symbol (\n, \", \\, etc.) or any single character that's not an escape symbol or an end quote.
  • "((?:\\.|[^\\"]])*)" captures only the contents of a quoted keyword phrase.
  • (?:(\w+)|"((?:\\.|[^\\"])*)") matches any keyword - single term or phrase, capturing single terms into $1 and phrase contents into $2
  • \d+ matches a number.
  • \^(\d+) captures a number following a caret (^). Since this is the third set of capturing parentheses, it will be caputred into $3.
  • (?:\^(\d+))? captures a number following a caret if it's there, matches the empty string otherwise.

String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.

regular expression to tokenize string

Keyword matching is straight-forward if you know exact boundaries. In this case, you have single apostrophes as string boundaries and a comma as a separator. So, this is the regex to match a value for a given key (based on your input example):

(?<=key1\:).+?(?=,|'|$) --> finds 3 "value1" matches
(?<=key2\:).+?(?=,|'|$) --> finds 1 "value2" match
(?<=key3\:).+?(?=,|'|$) --> finds 2 "value3" matches
(?<=key4\:).+?(?=,|'|$) --> no match

How to split an string containing an arithmetic expression while preserving floating point numbers in Ruby

I think Regex is a good approach, something like the following pattern could be
used to extract the floats:

  • any number of digits: \d+
  • a period (optional): \.?
  • any number of digits (optional). \d*

All together now: /\d+\.?\d*/

Here's a more complete code example:

s = "2 /2+3 * 4.75- -6"
s.gsub(" ", "").split(/(\d+\.?\d*)/).reject(&:empty?)
# => => ["2", "/", "2", "+", "3", "*", "4.75", "--", "6"]

A couple things to note here:

  • the regex is wrapped in parenthesis (i.e. /()/) so that the matched text is
    included in the results array. I actually looked here
    to figure out how to do that.
  • This solves your problems except for the - minus sign. You might be able to
    figure this out with some more Regex wrangling, but I think a simpler solution is
    to interpret -- as a plus sign when it comes time to process the math operators.
  • The above regex requires your float strings to have a zero before the decimal point,
    i.e. 0.5 and not just .5

In response to your comment, you are correct that this won't separate parenthesis
from operators. It can be updated to do so, though:

Using the same method as above, but with an example string that contains
parentheses:

s = "( 2 / 2 ) +3 * 4.75- -6"
new_string = s.gsub(" ", "").split(/(\d+\.?\d*)/).reject(&:empty?)
# => ["(", "2", "/", "2", ")+", "3", "*", "4.75", "--", "6"]

Then you could write the following to separate out the parentheses:

new_string.map { |str| str.split(/([\(\)])/) }.flatten.reject(&:empty?)
# => ["(", "2", "/", "2", ")", "+", "3", "*", "4.75", "--", "6"]

This is an ugly looking regex (aren't they all), but in short:

  • wrap the whole regex in parens (i.e. /()/) so that split includes
    the matched part in the produced array.
  • use [\(\)] to select either a ( or ) character.

The use of map & flatten enables you to split each string in your array without
creating sub-arrays.

Tokenise lines with quoted elements in Ruby

Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep and :quote_char options.
If you insist to parse manually you may use the following pattern in a more ruby way:

file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end

Decompose words into letters with Ruby

You might be able to get started looking at String#scan, which appears to be giving decent results for your examples:

"csobolyó".scan(Regexp.union(abc.keys))
# => ["cs", "o", "b", "o", "ly", "ó"]
"nyirettyű".scan(Regexp.union(abc.keys))
# => ["ny", "i", "r", "e", "tty", "ű"]
"dzsesszmuzsikus".scan(Regexp.union(abc.keys))
# => ["dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s"]

The last case doesn't match your expected output, but it matches your statement in the comments

I sorted the letters in the alphabet: if a letter appears earlier, then it should be recognized instead of its simple letters. When a word contains "dzs" it should be considered to "dzs" and not to "d" and "zs"



Related Topics



Leave a reply



Submit