Better way to tokenize string in Ruby?
Don't use split, use the scan method:
> "ab.:u/87z".scan(/\w+|\W/)
=> ["ab", ".", ":", "u", "/", "87z"]
How do I tokenize this string in Ruby?
For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:
irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
{ :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]
If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.
A quick breakdown of the regex:
\w+
matches any single-term keywords(?:\\.|[^\\"]])*
uses non-capturing parentheses ((?:...)
) to match the contents of an escaped double quoted string - either an escaped symbol (\n
,\"
,\\
, etc.) or any single character that's not an escape symbol or an end quote."((?:\\.|[^\\"]])*)"
captures only the contents of a quoted keyword phrase.(?:(\w+)|"((?:\\.|[^\\"])*)")
matches any keyword - single term or phrase, capturing single terms into$1
and phrase contents into$2
\d+
matches a number.\^(\d+)
captures a number following a caret (^
). Since this is the third set of capturing parentheses, it will be caputred into$3
.(?:\^(\d+))?
captures a number following a caret if it's there, matches the empty string otherwise.
String#scan(regex)
matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1
becomes match[0]
, $2
becomes match[1]
, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil
entry in the resulting "match".
The #map
then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match
), and then creates your desired hashes. Exactly one of word
or phrase
will be nil
, since both can't be matched against the input, so (word || phrase)
will return the non-nil
one, and #downcase
will convert it to all lowercase. boost.to_i
will convert a string to an integer while (boost.nil? ? nil : boost.to_i)
will ensure that nil
boosts stay nil
.
regular expression to tokenize string
Keyword matching is straight-forward if you know exact boundaries. In this case, you have single apostrophes as string boundaries and a comma as a separator. So, this is the regex to match a value for a given key (based on your input example):
(?<=key1\:).+?(?=,|'|$) --> finds 3 "value1" matches
(?<=key2\:).+?(?=,|'|$) --> finds 1 "value2" match
(?<=key3\:).+?(?=,|'|$) --> finds 2 "value3" matches
(?<=key4\:).+?(?=,|'|$) --> no match
How to split an string containing an arithmetic expression while preserving floating point numbers in Ruby
I think Regex is a good approach, something like the following pattern could be
used to extract the floats:
- any number of digits:
\d+
- a period (optional):
\.?
- any number of digits (optional).
\d*
All together now: /\d+\.?\d*/
Here's a more complete code example:
s = "2 /2+3 * 4.75- -6"
s.gsub(" ", "").split(/(\d+\.?\d*)/).reject(&:empty?)
# => => ["2", "/", "2", "+", "3", "*", "4.75", "--", "6"]
A couple things to note here:
- the regex is wrapped in parenthesis (i.e.
/()/
) so that the matched text is
included in the results array. I actually looked here
to figure out how to do that. - This solves your problems except for the
-
minus sign. You might be able to
figure this out with some more Regex wrangling, but I think a simpler solution is
to interpret--
as a plus sign when it comes time to process the math operators. - The above regex requires your float strings to have a zero before the decimal point,
i.e.0.5
and not just.5
In response to your comment, you are correct that this won't separate parenthesis
from operators. It can be updated to do so, though:
Using the same method as above, but with an example string that contains
parentheses:
s = "( 2 / 2 ) +3 * 4.75- -6"
new_string = s.gsub(" ", "").split(/(\d+\.?\d*)/).reject(&:empty?)
# => ["(", "2", "/", "2", ")+", "3", "*", "4.75", "--", "6"]
Then you could write the following to separate out the parentheses:
new_string.map { |str| str.split(/([\(\)])/) }.flatten.reject(&:empty?)
# => ["(", "2", "/", "2", ")", "+", "3", "*", "4.75", "--", "6"]
This is an ugly looking regex (aren't they all), but in short:
- wrap the whole regex in parens (i.e.
/()/
) so thatsplit
includes
the matched part in the produced array. - use
[\(\)]
to select either a(
or)
character.
The use of map & flatten enables you to split each string in your array without
creating sub-arrays.
Tokenise lines with quoted elements in Ruby
Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep
and :quote_char
options.
If you insist to parse manually you may use the following pattern in a more ruby way:
file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end
Decompose words into letters with Ruby
You might be able to get started looking at String#scan
, which appears to be giving decent results for your examples:
"csobolyó".scan(Regexp.union(abc.keys))
# => ["cs", "o", "b", "o", "ly", "ó"]
"nyirettyű".scan(Regexp.union(abc.keys))
# => ["ny", "i", "r", "e", "tty", "ű"]
"dzsesszmuzsikus".scan(Regexp.union(abc.keys))
# => ["dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s"]
The last case doesn't match your expected output, but it matches your statement in the comments
I sorted the letters in the alphabet: if a letter appears earlier, then it should be recognized instead of its simple letters. When a word contains "dzs" it should be considered to "dzs" and not to "d" and "zs"
Related Topics
How to Get a HTML Table Row with Capybara
Sinatra with a Persistent Variable
Scientific Programming with Ruby
In Ruby, How to I Control the Order in Which Test::Unit Tests Are Run
What Is a Robust Installation Process for Nokogiri (On Ubuntu)
How to Configure Mongomapper and Activerecord in Same Ruby Rails Project
How to Remove Repeated Spaces in a String
Rubocop Error 'Class Definition Is Too Long Ruby'
Including/Extending the Kernel Doesn't Add Those Methods on Main:Object
Error Occurred While Installing Mini_Racer (0.2.0)
Heroku: No Rakefile Found (But Works Locally)
What's the Fastest Way to Check If a Word from One String Is in Another String
What Does Class Classname < ::Otherclassname Do in Ruby
How to Convert a Net::Http Response to a Certain Encoding in Ruby 1.9.1
How to Generate Coordinates in Between Two Known Points
Advantages and Disadvantages of Ruby on Rails Polymorphic Relationships
Dynamically Define Named Classes in Ruby
Why Is Devise Not Displaying Authentication Errors on Sign in Page