Indentation Sensitive Parser Using Parslet in Ruby

Indentation sensitive parser using Parslet in Ruby?

There are a few approaches.

  1. Parse the document by recognising each line as a collection of indents and an identifier, then apply a transformation afterwards to reconstruct the hierarchy based on the number of indents.

  2. Use captures to store the current indent and expect the next node to include that indent plus more to match as a child (I didn't dig into this approach much as the next one occurred to me)

  3. Rules are just methods. So you can define 'node' as a method, which means you can pass parameters! (as follows)

This lets you define node(depth) in terms of node(depth+1). The problem with this approach, however, is that the node method doesn't match a string, it generates a parser. So a recursive call will never finish.

This is why dynamic exists. It returns a parser that isn't resolved until the point it tries to match it, allowing you to now recurse without problems.

See the following code:

require 'parslet'

class IndentationSensitiveParser < Parslet::Parser

def indent(depth)
str(' '*depth)
end

rule(:newline) { str("\n") }

rule(:identifier) { match['A-Za-z0-9'].repeat(1).as(:identifier) }

def node(depth)
indent(depth) >>
identifier >>
newline.maybe >>
(dynamic{|s,c| node(depth+1).repeat(0)}).as(:children)
end

rule(:document) { node(0).repeat }

root :document
end

This is my favoured solution.

Create a sepBy parser combinator sensitive to the indentation of the first parser

I think there are two problems here:

  1. You're requiring the first "Example" to be indented beyond itself, which is impossible. You should instead let the first parser succeed regardless of the current position.
  2. greater is not atomic, so when it fails, your parser is left in an invalid state. This might or might not be considered a bug in the library. In any case, you can make it atomic via attempt.

With that in mind, I think the following parser does roughly what you want:

let indentSepBy p sep =
parse {
let! pos = getPosition
let! head = p
let! tail =
let p' = attempt (greater pos p)
let sep' = attempt (greater pos sep)
many (sep' >>. p')
return head :: tail
}

You can test this as follows:

let test =
indentSepBy (pstring "Example") (pchar '.')

let run text =
printfn "***"
runParser (test .>> eof) () text
|> printfn "%A"

[<EntryPoint>]
let main argv =
run "Example.Example" // success
run "Example\n.Example" // failure
run "Example\n .Example" // success
0

Note that I've forced the test parser to consume the entire input via eof. Otherwise, it will falsely report success when it can't in fact parse the full string.

Parsing text structured as tree with fixed width columns using parslet in ruby

I was going to say the same thing as "the Tin Man". There has to be another format you can generate the data in.

If you want to parse this however... Parslet works like a map/reduce algorythm.
You're first pass (parsing) is not intended to give you your final output, just to capture all the information you need from your source document.

Once you have that stored in a tree, you can then transform it to get the output you want.

So... I would write a parser that records each white space as a node, aswell as matching the text and percentages you need. I would group the white space nodes in an "indentation" node.

I would then use a transform to replace the whitespace nodes with a count of nodes to calculate the indentations.

Remember: Parslet generates a standard ruby hash. You can then write whatever code you like to make sense of this tree.

The parser is just converting the text file into a data-stucture you can manipulate.

Just to reiterate though. I think "the Tin Man" has the right answer.. generate the data in a machine readable way instead.

Update:

For an alternative approach you can check out: Indentation sensitive parser using Parslet in Ruby?

Parse markdown indented code block

In this case newline can be eof. In which case newline. repeat(2) repeatedly matches eof. You want "repeat(2,2)". You can made these bugs easy to find :)... Just use my fork.

You can detect how it's looping by using my fork of parslet.
It catches loops and tells you what's happening.
It's slower than the usual parslet, so switch back for production parsing.

Use this Gemfile:

source "https://rubygems.org"

gem "parslet" , :git => "https://github.com/NigelThorne/parslet.git"
gem 'rspec'

And you get these results:

 9:23:40.20 > bundle exec rspec parser.rb
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Failures:

1) RecurringGroupParser should parse a
Failure/Error: is_expected.to parse "a"
RuntimeError:
Grammar contains an infinite loop applying 'NEWLINE{2, }' at char position 1
...a<-- here
# ./parser.rb:33:in `block (2 levels) in <top (required)>'

2) RecurringGroupParser should parse aa
Failure/Error: is_expected.to parse "a\na"
RuntimeError:
Grammar contains an infinite loop applying 'NEWLINE{2, }' at char position 3
...a
a<-- here
# ./parser.rb:37:in `block (2 levels) in <top (required)>'

3) RecurringGroupParser should parse aaa
Failure/Error: is_expected.to parse "a\na\na"
RuntimeError:
Grammar contains an infinite loop applying 'NEWLINE{2, }' at char position 5
...a
a
a<-- here
# ./parser.rb:41:in `block (2 levels) in <top (required)>'

4) RecurringGroupParser should parse a a
Failure/Error: is_expected.to parse "a\n\na"
expected BLOCK to be able to parse "a\n\na"
# ./parser.rb:45:in `block (2 levels) in <top (required)>'

5) RecurringGroupParser should parse aa a
Failure/Error: is_expected.to parse "a\na\n\na"
expected BLOCK to be able to parse "a\na\n\na"
# ./parser.rb:49:in `block (2 levels) in <top (required)>'

6) RecurringGroupParser should parse aaa a
Failure/Error: is_expected.to parse "a\naa\n\na"
expected BLOCK to be able to parse "a\naa\n\na"
# ./parser.rb:53:in `block (2 levels) in <top (required)>'

7) RecurringGroupParser should parse a aa
Failure/Error: is_expected.to parse "a\n\na\na"
expected BLOCK to be able to parse "a\n\na\na"
# ./parser.rb:57:in `block (2 levels) in <top (required)>'

8) RecurringGroupParser should parse a aaa
Failure/Error: is_expected.to parse "a\n\na\na\na"
expected BLOCK to be able to parse "a\n\na\na\na"
# ./parser.rb:61:in `block (2 levels) in <top (required)>'

9) RecurringGroupParser should parse aa a
Failure/Error: is_expected.to parse "a\na\n\na"
expected BLOCK to be able to parse "a\na\n\na"
# ./parser.rb:65:in `block (2 levels) in <top (required)>'

10) RecurringGroupParser should parse aa aa
Failure/Error: is_expected.to parse "a\na\n\na\na"
expected BLOCK to be able to parse "a\na\n\na\na"
# ./parser.rb:69:in `block (2 levels) in <top (required)>'

11) RecurringGroupParser should parse aa aaa
Failure/Error: is_expected.to parse "a\na\n\na\na\na"
expected BLOCK to be able to parse "a\na\n\na\na\na"
# ./parser.rb:73:in `block (2 levels) in <top (required)>'

12) RecurringGroupParser should parse aaa aa
Failure/Error: is_expected.to parse "a\naa\n\na\na"
expected BLOCK to be able to parse "a\naa\n\na\na"
# ./parser.rb:77:in `block (2 levels) in <top (required)>'

13) RecurringGroupParser should parse aaa aaa
Failure/Error: is_expected.to parse "a\naa\n\na\na\na"
expected BLOCK to be able to parse "a\naa\n\na\na\na"
# ./parser.rb:81:in `block (2 levels) in <top (required)>'

14) RecurringGroupParser should parse a a a
Failure/Error: is_expected.to parse "a\n\na\n\na"
expected BLOCK to be able to parse "a\n\na\n\na"
# ./parser.rb:85:in `block (2 levels) in <top (required)>'

15) RecurringGroupParser should parse aa a a
Failure/Error: is_expected.to parse "a\na\n\na\n\na"
expected BLOCK to be able to parse "a\na\n\na\n\na"
# ./parser.rb:89:in `block (2 levels) in <top (required)>'

16) RecurringGroupParser should parse aaa a a
Failure/Error: is_expected.to parse "a\naa\n\na\n\na"
expected BLOCK to be able to parse "a\naa\n\na\n\na"
# ./parser.rb:93:in `block (2 levels) in <top (required)>'

17) RecurringGroupParser should parse a aa a
Failure/Error: is_expected.to parse "a\n\na\na\n\na"
expected BLOCK to be able to parse "a\n\na\na\n\na"
# ./parser.rb:97:in `block (2 levels) in <top (required)>'

18) RecurringGroupParser should parse aa aa a
Failure/Error: is_expected.to parse "a\na\n\na\na\n\na"
expected BLOCK to be able to parse "a\na\n\na\na\n\na"
# ./parser.rb:101:in `block (2 levels) in <top (required)>'

19) RecurringGroupParser should parse aaa aa a
Failure/Error: is_expected.to parse "a\naa\n\na\na\n\na"
expected BLOCK to be able to parse "a\naa\n\na\na\n\na"
# ./parser.rb:105:in `block (2 levels) in <top (required)>'

20) RecurringGroupParser should parse a aaa a
Failure/Error: is_expected.to parse "a\n\na\naa\n\na"
expected BLOCK to be able to parse "a\n\na\naa\n\na"
# ./parser.rb:109:in `block (2 levels) in <top (required)>'

21) RecurringGroupParser should parse aa aaa a
Failure/Error: is_expected.to parse "a\na\n\na\naa\n\na"
expected BLOCK to be able to parse "a\na\n\na\naa\n\na"
# ./parser.rb:113:in `block (2 levels) in <top (required)>'

22) RecurringGroupParser should parse aaa aaa a
Failure/Error: is_expected.to parse "a\naa\n\na\naa\n\na"
expected BLOCK to be able to parse "a\naa\n\na\naa\n\na"
# ./parser.rb:117:in `block (2 levels) in <top (required)>'

23) RecurringGroupParser should parse a a aa
Failure/Error: is_expected.to parse "a\n\na\n\na\na"
expected BLOCK to be able to parse "a\n\na\n\na\na"
# ./parser.rb:121:in `block (2 levels) in <top (required)>'

24) RecurringGroupParser should parse aa a aa
Failure/Error: is_expected.to parse "a\na\n\na\n\na\na"
expected BLOCK to be able to parse "a\na\n\na\n\na\na"
# ./parser.rb:125:in `block (2 levels) in <top (required)>'

25) RecurringGroupParser should parse aaa a aa
Failure/Error: is_expected.to parse "a\naa\n\na\n\na\na"
expected BLOCK to be able to parse "a\naa\n\na\n\na\na"
# ./parser.rb:129:in `block (2 levels) in <top (required)>'

26) RecurringGroupParser should parse a aa aa
Failure/Error: is_expected.to parse "a\n\na\na\n\na\na"
expected BLOCK to be able to parse "a\n\na\na\n\na\na"
# ./parser.rb:133:in `block (2 levels) in <top (required)>'

27) RecurringGroupParser should parse aa aa aa
Failure/Error: is_expected.to parse "a\na\n\na\na\n\na\na"
expected BLOCK to be able to parse "a\na\n\na\na\n\na\na"
# ./parser.rb:137:in `block (2 levels) in <top (required)>'

28) RecurringGroupParser should parse aaa aa aa
Failure/Error: is_expected.to parse "a\naa\n\na\na\n\na\na"
expected BLOCK to be able to parse "a\naa\n\na\na\n\na\na"
# ./parser.rb:141:in `block (2 levels) in <top (required)>'

29) RecurringGroupParser should parse a aaa aa
Failure/Error: is_expected.to parse "a\n\na\naa\n\na\na"
expected BLOCK to be able to parse "a\n\na\naa\n\na\na"
# ./parser.rb:145:in `block (2 levels) in <top (required)>'

30) RecurringGroupParser should parse aa aaa aa
Failure/Error: is_expected.to parse "a\na\n\na\naa\n\na\na"
expected BLOCK to be able to parse "a\na\n\na\naa\n\na\na"
# ./parser.rb:149:in `block (2 levels) in <top (required)>'

31) RecurringGroupParser should parse aaa aaa aa
Failure/Error: is_expected.to parse "a\naa\n\na\naa\n\na\na"
expected BLOCK to be able to parse "a\naa\n\na\naa\n\na\na"
# ./parser.rb:153:in `block (2 levels) in <top (required)>'

32) RecurringGroupParser should parse a a aaa
Failure/Error: is_expected.to parse "a\n\na\n\na\na\na"
expected BLOCK to be able to parse "a\n\na\n\na\na\na"
# ./parser.rb:157:in `block (2 levels) in <top (required)>'

33) RecurringGroupParser should parse aa a aaa
Failure/Error: is_expected.to parse "a\na\n\na\n\na\na\na"
expected BLOCK to be able to parse "a\na\n\na\n\na\na\na"
# ./parser.rb:161:in `block (2 levels) in <top (required)>'

34) RecurringGroupParser should parse aaa a aaa
Failure/Error: is_expected.to parse "a\naa\n\na\n\na\na\na"
expected BLOCK to be able to parse "a\naa\n\na\n\na\na\na"
# ./parser.rb:165:in `block (2 levels) in <top (required)>'

35) RecurringGroupParser should parse a aa aaa
Failure/Error: is_expected.to parse "a\n\na\na\n\na\na\na"
expected BLOCK to be able to parse "a\n\na\na\n\na\na\na"
# ./parser.rb:169:in `block (2 levels) in <top (required)>'

36) RecurringGroupParser should parse aa aa aaa
Failure/Error: is_expected.to parse "a\na\n\na\na\n\na\na\na"
expected BLOCK to be able to parse "a\na\n\na\na\n\na\na\na"
# ./parser.rb:173:in `block (2 levels) in <top (required)>'

37) RecurringGroupParser should parse aaa aa aaa
Failure/Error: is_expected.to parse "a\naa\n\na\na\n\na\na\na"
expected BLOCK to be able to parse "a\naa\n\na\na\n\na\na\na"
# ./parser.rb:177:in `block (2 levels) in <top (required)>'

38) RecurringGroupParser should parse a aaa aaa
Failure/Error: is_expected.to parse "a\n\na\naa\n\na\na\na"
expected BLOCK to be able to parse "a\n\na\naa\n\na\na\na"
# ./parser.rb:181:in `block (2 levels) in <top (required)>'

39) RecurringGroupParser should parse aa aaa aaa
Failure/Error: is_expected.to parse "a\na\n\na\naa\n\na\na\na"
expected BLOCK to be able to parse "a\na\n\na\naa\n\na\na\na"
# ./parser.rb:185:in `block (2 levels) in <top (required)>'

40) RecurringGroupParser should parse aaa aaa aaa
Failure/Error: is_expected.to parse "a\naa\n\na\naa\n\na\na\na"
expected BLOCK to be able to parse "a\naa\n\na\naa\n\na\na\na"
# ./parser.rb:189:in `block (2 levels) in <top (required)>'

Finished in 0.01702 seconds (files took 0.26725 seconds to load)
40 examples, 40 failures

see this question on parsing indentation with Parslet.

Parslet : exclusion clause

You can do something like this:

rule(:word) { match['^")(\\s'].repeat(1) } # normal word
rule(:op) { str('AND') | str('OR') | str('NOT') }
rule(:keyword) { str('all:') | str('any:') }
rule(:searchterm) { keyword.absent? >> op.absent? >> word }

In this case, the absent? does a lookahead to make sure the next token is not a keyword; if not, then it checks to make sure it's not an operator; if not, finally see if it's a valid word.

An equivalent rule would be:

rule(:searchterm) { (keyword | op).absent? >> word }

Ruby:parslet for a system verilog interface parser

Ok... this parses the file you mentioned. I don't understand the desired format so I can't say it will work for all your files, but hopefully this will get you started.

require 'parslet'

class MyParse < Parslet::Parser
rule(:lparen) { space? >> str('(') }
rule(:rparen) { space? >> str(')') }
rule(:lbox) { space? >> str('[') }
rule(:rbox) { space? >> str(']') }
rule(:lcurly) { space? >> str('{') }
rule(:rcurly) { space? >> str('}') }
rule(:comma) { space? >> str(',') }
rule(:semicolon) { space? >> str(';') }
rule(:eof) { any.absent? }
rule(:space) { match["\t\s"] }
rule(:whitespace) { space.repeat(1) }
rule(:space?) { space.repeat(0) }
rule(:blank_line) { space? >> newline.repeat(1) }
rule(:newline) { str("\n") }

# Things
rule(:integer) { space? >> match('[0-9]').repeat(1).as(:int) >> space? }
rule(:identifier) { match['a-z'].repeat(1) }

def line( expression )
space? >>
expression >>
space? >>
str(';') >>
space? >>
str("\n")
end

rule(:expression?) { ( interface ).repeat(0) }

rule(:interface) { intf_start >> interface_body.repeat(0) >> intf_end }

rule(:interface_body) {
intf_end.absent? >>
interface_bodyline >>
blank_line.repeat(0)
}

rule(:intf_start) {
line (
str('interface') >>
space? >>
( match['a-zA-Z_'].repeat(1,1) >>
match['[:alnum:]_'].repeat(0)).as(:intf_name)
)
}

rule(:interface_bodyline) {
line ( protocol | transmit )
}

rule(:protocol) {
str('protocol') >> whitespace >>
(str('validonly').maybe).as(:protocol)
}

rule(:transmit) {
str('transmit') >> whitespace >>
(bool | transmit_width) >> whitespace >>
name.as(:transmit_name)
}

rule(:name) {
match('[a-zA-Z_]') >>
(match['[:alnum:]'] | str("_")).repeat(0)
}

rule(:bool) { lbox >> str('Bool').as(:bool) >> rbox }

rule(:transmit_width) {
lbox >>
space? >>
match('[0-9]').repeat(1).as(:msb) >>
space? >>
str(':') >>
space? >>
match('[0-9]').repeat(1).as(:lsb) >>
space? >>
rbox
}

rule(:intf_end) { str('endinterface') }

root :expression?
end

require 'rspec'
require 'parslet/rig/rspec'

RSpec.describe MyParse do
let(:parser) { MyParse.new }
context "simple_rule" do
it "should consume protocol line" do
expect(parser.interface_bodyline).to parse(' protocol validonly;
')
end
it 'name' do
expect(parser.name).to parse('valid')
end
it "bool" do
expect(parser.bool).to parse('[Bool]')
end
it "transmit line" do
expect(parser.transmit).to parse('transmit [Bool] valid')
end
it "transmit as bodyline'" do
expect(parser.interface_bodyline).to parse(' transmit [Bool] valid;
')
end
end
end

RSpec::Core::Runner.run(['--format', 'documentation'])

begin
doc = File.read("test.txt")
MyParse.new.parse(doc)
rescue Parslet::ParseFailed => error
puts error.cause.ascii_tree
end

The main changes...

  • Don't consume whitespace both side of your tokens.
    You had expressions that parsed "[Bool] valid" as LBOX BOOL RBOX SPACE? then expected another WHITESPACE but couldn't find one (as the previous rule had consumed it).

  • When an expression can validly parse as a zero length (e.g. something with repeat(0)) and there is a problem with who it's written, then you get an odd error. The rule pass and match nothing, then the next rule will typically fail. I explicitly matched 'body lines' as 'not the end line' so it would fail with an error.

  • 'repeat' defaults to (0) which I would love to change. I see mistakes around this all the time.

  • x.repeat(1,1) means make one match. That's the same as having x. :)

  • there were more whitespace problems

so....

Write your parser from the top down. Write tests from the bottom up.
When your tests get to the top you are done! :)

Good luck.

SystemStackError: when parsing SCIM 2.0 filter query using Parslet

Always consume something before you recurse.

For example:
Don't define a list of numbers as

NumList = NumList >> "," >> NumList | Number

Defined it as

NumList = Number >> ("," >> NumList).maybe

or even

NumList = Number >> ("," >> Number).repeat(0)

So for logican_expression...

 # logExp = FILTER SP ("and" / "or") SP FILTER
rule(:logical_expression) do
filter >> space >> (and_op | or_op) >> space >> filter
end

You need the first filter to be something that can't be a logical_expression.

   # FILTER = attrExp / logExp / valuePath / *1"not" "(" FILTER ")"
rule(:filter) do
logical_expression | filter_atom
end

rule(:filter_atom) do
(not_op? >> lparen >> filter >> rparen) | attribute_expression | value_path
end

# logExp = FILTER SP ("and" / "or") SP FILTER
rule(:logical_expression) do
filter_atom >> space >> (and_op | or_op) >> space >> filter
end

Fast and reliable way to find out if a source code file implements an interface

I would say that parsing a source code file is both, fast and reliable. But "fast" is such a vague notion that it really depends, I guess. I wouldn't expect too much overhead when compared to scanning the source file for occurrences of the words "implements", though -- thus if the latter is fine for you, I'd assume the former should be acceptable too?

The javax.tools.* API would be a good entrance point to get started; however, there are also a number of (open source) source code parsers for Java out there.

Also, here is an introductory blog post on Oracle's websites.

RESTful web services and HTTP verbs

Yes, you can live without PUT and DELETE.

This article tells you why:
http://www.artima.com/lejava/articles/why_put_and_delete.html

While to true RESTafrians this may be heresy, in the real world you do what you can, with what you have. Be as rational as you can and as consistent with your own convention as you can, but you can definitely build a good RESTful system without P and D.

rp



Related Topics



Leave a reply



Submit