How to Provider User with Autocomplete Suggestions for Given Boost::Spirit Grammar

How to provider user with autocomplete suggestions for given boost::spirit grammar?

Out of curiosity, I decided to try the concept.

Here's my attempt.

Plan

Let's parse arithmetic expressions with function calls.

Now, we want to parse (partial) expression with possible unknown identifiers.

In case of incomplete expressions, we want to "imply" the minimal token to complete it (and potentially continue parsing).

In case of unknown identifiers, we want to fuzzy-match the known identifiers in the domain (either variables or functions) and rank them in order of decreasing probability.


Base Definitions

Let's start out by deciding we want our input to be in memory, so we can refer to locations/substrings by using string_views:

#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

Completion Hints

Besides the AST, we want our parser to generate completion hints (the implicit completion tokens and suggestions)

namespace Completion {

using Candidates = std::vector<std::string>;

class Hints {
struct ByLocation {
template <typename T, typename U>
bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
private:
static Location loc(Source const& s) { return s.begin(); }
static Location loc(Location const& l) { return l; }
};

public:
std::map<Location, std::string, ByLocation> incomplete;
std::map<Source, Candidates, ByLocation> suggestions;

/*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
};

In addition, let's code up a quick & dirty fuzzy identifier match scoring function.

I opted for a simple synchronizing compare that

  • scores corresponding runs of characters progressively, and
  • favours skipping characters from candidates over skipping characters from the input (meaning that adj_diff is a good fuzzy match for adjacent_difference even though characters have been skipped from the candidate, but adj_qqq_diff is worse because the qqq from the input could not be matched)
  • the algorithm is done recursively and without allocations
  • first characters get a boost if matched (rate=1 at first invocation)


    static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
int score = 0;

while (!(input.empty() || candidate.empty())) {
if (input.front() != candidate.front()) {
return score + std::max(
fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
}

input.remove_prefix(1);
candidate.remove_prefix(1);
score += ++rate;
}
return score;
}

} // namespace Completion

We'll see how this is used in the grammar.

AST

A run-of-the-mill AST that can do binary expressions, string/numeric literals, variables and (nested) function calls:

#include <boost/variant.hpp>

namespace Ast {
using NumLiteral = double;
using StringLiteral = std::string; // de-escaped, not source view

struct Identifier : std::string {
using std::string::string;
using std::string::operator=;
};

struct BinaryExpression;
struct CallExpression;

using Expression = boost::variant<
NumLiteral,
StringLiteral,
Identifier,
boost::recursive_wrapper<BinaryExpression>,
boost::recursive_wrapper<CallExpression>
>;

struct BinaryExpression {
Expression lhs;
char op;
Expression rhs;
};

using ArgList = std::vector<Expression>;

struct CallExpression {
Identifier function;
ArgList args;
};
}

Grammar


Basics

The grammar starts off pretty "basic" as well:

namespace Parsing {
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

template <typename It>
struct Expression : qi::grammar<It, Ast::Expression()> {
Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
using namespace qi;

start = skip(space) [expression];

expression = term [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
term = factor [_val = _1] >> *(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];
factor = simple [_val = _1] >> *(char_("^") >> factor) [_val = make_binary(_val, _1, _2)];

simple = call | variable | compound | number | string;

So far so good: The constructor stores a reference to the Completion::Hints& to be recorded. All these rules have been declared as qi::rule<It, Expression(), qi::space_type>.

Implied Tokens

Now it gets slightly more interesting, some rules here are lexemes¹

            number     = double_;
identifier = raw[(alpha|'_') >> *(alnum|'_')];

And some productions are tolerant of missing closing tokens:

            // imply the closing quotes
string %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"');
compound %= '(' >> expression >> implied(')');

The implied magic makes it so that if the expected closing token is missing, it will be recorded as a hint, and parsing continues:

            auto implied = [=](char ch) {
return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
};

Of course, this begs the question what imply(_1, ch) really does, and we'll see later.

For now, observe that we do raw[eps] to get an (empty) source iterator_range to construct a Location from in the semantic action.

Identifier Lookup

We store "symbol tables" in Domain members _variables and _functions:

        using Domain = qi::symbols<char>;
Domain _variables, _functions;

Then we declare some rules that can do lookups on either of them:

        // domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

The corresponding declarations will be shown shortly.

Variables are pretty simple:

        variable   = maybe_known(phx::ref(_variables));

Calls are trickier. If a name is unknown we don't want to assume it implies a function unless it's followed by a '(' character. However, if an identifier is a known function name, we want even to imply the ( (this gives the UX the appearance of autocompletion where when the user types sqrt, it suggests the next character to be ( magically).

        // The heuristics:
// - an unknown identifier followed by (
// - an unclosed argument list implies )
call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
) >> implied('(') >> -(expression % ',') >> implied(')');

It all builds on known, unknown and maybe_known:

            ///////////////////////////////
// identifier loopkup, suggesting
{
maybe_known = known(_domain) | unknown(_domain);

// distinct to avoid partially-matching identifiers
using boost::spirit::repository::qi::distinct;
auto kw = distinct(copy(alnum | '_'));

known = raw[kw[lazy(_domain)]];
unknown = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
}

Debug, Predefined Variables/Functions

Remaining is a bit of red tape:

            BOOST_SPIRIT_DEBUG_NODES(
(start)
(expression)(term)(factor)(simple)(compound)
(call)(variable)
(identifier)(number)(string)
//(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
)

_variables += "foo", "bar", "qux";
_functions += "print", "sin", "tan", "sqrt", "frobnicate";
}

private:
Completion::Hints& _hints;

using Domain = qi::symbols<char>;
Domain _variables, _functions;

qi::rule<It, Ast::Expression()> start;
qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
// completables
qi::rule<It, Ast::Expression(), qi::space_type> compound;
qi::rule<It, Ast::CallExpression(), qi::space_type> call;

// implicit lexemes
qi::rule<It, Ast::Identifier()> variable, identifier;
qi::rule<It, Ast::NumLiteral()> number;
qi::rule<It, Ast::StringLiteral()> string;

// domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

Phoenix Helpers

Let's start with the usual helper to construct binary AST nodes:

        ///////////////////////////////
// binary expression factory
struct make_binary_f {
Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
return {lhs, op, rhs};
}
};
boost::phoenix::function<make_binary_f> make_binary;

Similarly, we can have a Phoenix function object to register implied chars:

        ///////////////////////////////
// auto-completing incomplete expressions
struct imply_f {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, char implied_char) const {
auto inserted =
_hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
// add the implied char to existing completion
if (!inserted.second)
inserted.first->second += implied_char;
}
};
boost::phoenix::function<imply_f> imply { imply_f { _hints } };

Note that it orders by location (which makes UX easier) and detects when a previous implied token lived at the same location, in which case we simply append to it.

By now it won't be much of a surprise that generating relevant suggestions for unknown identifiers is also a phoenix actor:

        ///////////////////////////////
// suggest_for
struct suggester {
Completion::Hints& _hints;

void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
using namespace Completion;
Source id(&*where.begin(), where.size());
Candidates c;

symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

auto score = [id](Source v) { return fuzzy_match(id, v); };
auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

sort(c.begin(), c.end(), byscore);
c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

_hints.suggestions.emplace(id, c);
}
};
boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

That's all. If it looks more complicated, that's because it does a lot more: it scores all candidates fuzzily, orders them by descending score and removes any candidates with score < 3.

    };
}

BONUS

Let's alter things a little more and allow ',' to be implied inside argument lists:

        call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
)
>> implied('(')
>> -(expression % ',')
>> implied(')')
;

Let's replace ',' there:

            >> -(expression % (',' | !(')'|eoi) >> implied(',')))

NOTE: It might seem to make more sense to detect &expression to assert that an expression follows, instead asserting that the end of the input/argument list has not been reached.

Doing that goes bad, though, because any contained expressions could trigger more implied tokens as a side effect, leading to duplicated implied tokens.

This (side-effectful semantic actions) is one of the chief reasons why I usually avoid semantic actions, or use them to store effect only in the rule's (exposed) attribute(s). See Boost Spirit: "Semantic actions are evil"?

TEST DRIVER

This post would be nothing without some nice test cases. So here goes:

int main() {
for (Source const input : {
"", // invalid
"(3", // incomplete, imply ')'
"3*(6+sqrt(9))^7 - 1e8", // completely valid
"(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
"print(\"hello \\\"world!", // completes the string literal and the parameter list
"foo", // okay, known variable
"baz", // (suggest bar)
"baz(", // incomplete, imply ')', unknown function
"taz(", // incomplete, imply ')', unknown function
"san(", // 2 suggestions (sin/tan)
"print(1, 2, \"three\", complicated(san(78",
"(print sqrt sin 9) -0) \"bye",
})
{
std::cout << "-------------- '" << input << "'\n";
Location f = input.begin(), l = input.end();

Ast::Expression expr;
Completion::Hints hints;
bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

if (hints) {
std::cout << "Input: '" << input << "'\n";
}
for (auto& c : hints.incomplete) {
std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
}
for (auto& id : hints.suggestions) {
std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
if (id.second.empty())
std::cout << " (no suggestions)\n";
else {
std::cout << " (did you mean ";
once_t first;
for (auto& s : id.second)
std::cout << (first?"":" or ") << "'" << s << "'";
std::cout << "?)\n";
}
}

if (ok) {
std::cout << "AST: " << expr << "\n";
} else {
std::cout << "Parse failed\n";
}

if (f!=l)
std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
}
}

Note that, besides the first input ("") everything is heuristically parsed as some kind of expression! The last one is designed to show off how all the parameter lists are implied because print, sqrt and sin are known functions. Then some ',' are implied, before finally the unclosed string literal and remaining parentheses are closed. Here's the (non-debug) output:

-------------- ''
Parse failed
-------------- '(3'
Input: '(3'
Missing ^ implied: ')'
AST: 3
-------------- '3*(6+sqrt(9))^7 - 1e8'
AST: ((3 * ((6 + (sqrt (9))) ^ 7)) - 1e+08)
-------------- '(3*(((6+sqrt(9))^7 - 1e8'
Input: '(3*(((6+sqrt(9))^7 - 1e8'
Missing ^ implied: ')))'
AST: (3 * (((6 + (sqrt (9))) ^ 7) - 1e+08))
-------------- 'print("hello \"world!'
Input: 'print("hello \"world!'
Missing ^ implied: '")'
AST: (print (hello "world!))
-------------- 'foo'
AST: foo
-------------- 'baz'
Input: 'baz'
Unknown ^^^ (did you mean 'bar'?)
AST: baz
-------------- 'baz('
Input: 'baz('
Missing ^ implied: ')'
Unknown ^^^ (no suggestions)
AST: (baz ())
-------------- 'taz('
Input: 'taz('
Missing ^ implied: ')'
Unknown ^^^ (did you mean 'tan'?)
AST: (taz ())
-------------- 'san('
Input: 'san('
Missing ^ implied: ')'
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST: (san ())
-------------- 'print(1, 2, "three", complicated(san(78'
Input: 'print(1, 2, "three", complicated(san(78'
Missing ^ implied: ')))'
Unknown ^^^^^^^^^^^ (did you mean 'frobnicate' or 'print'?)
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST: (print (1, 2, three, (complicated ((san (78))))))
-------------- '(print sqrt sin 9) -0) "bye'
Input: '(print sqrt sin 9) -0) "bye'
Missing ^ implied: '('
Missing ^ implied: '('
Missing ^ implied: '('
Missing ^ implied: ','
Missing ^ implied: '"))'
AST: (print ((sqrt (((sin (9)) - 0))), bye))

Full Listing / Live Demo

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

#include <map>
#include <vector>

namespace Completion {

static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
int score = 0;

while (!(input.empty() || candidate.empty())) {
if (input.front() != candidate.front()) {
return score + std::max(
fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
}

input.remove_prefix(1);
candidate.remove_prefix(1);
score += ++rate;
}
return score;
}

using Candidates = std::vector<std::string>;

class Hints {
struct ByLocation {
template <typename T, typename U>
bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
private:
static Location loc(Source const& s) { return s.begin(); }
static Location loc(Location const& l) { return l; }
};

public:
std::map<Location, std::string, ByLocation> incomplete;
std::map<Source, Candidates, ByLocation> suggestions;

/*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
};
}

#include <boost/variant.hpp>

namespace Ast {
using NumLiteral = double;
using StringLiteral = std::string; // de-escaped, not source view

struct Identifier : std::string {
using std::string::string;
using std::string::operator=;
};

struct BinaryExpression;
struct CallExpression;

using Expression = boost::variant<
NumLiteral,
StringLiteral,
Identifier,
boost::recursive_wrapper<BinaryExpression>,
boost::recursive_wrapper<CallExpression>
>;

struct BinaryExpression {
Expression lhs;
char op;
Expression rhs;
};

using ArgList = std::vector<Expression>;

struct CallExpression {
Identifier function;
ArgList args;
};
}

#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(Ast::BinaryExpression, lhs, op, rhs)
BOOST_FUSION_ADAPT_STRUCT(Ast::CallExpression, function, args)

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/repository/include/qi_distinct.hpp>

// for debug printing:
namespace {
struct once_t { // an auto-reset flag
operator bool() { bool v = flag; flag = false; return v; }
bool flag = true;
};
}

// for debug printing:
namespace Ast {

static inline std::ostream& operator<<(std::ostream& os, std::vector<Expression> const& args) {
os << "(";
once_t first;
for (auto& a : args) os << (first?"":", ") << a;
return os << ")";
}

static inline std::ostream& operator<<(std::ostream& os, BinaryExpression const& e) {
return os << boost::fusion::as_vector(e);
}
static inline std::ostream& operator<<(std::ostream& os, CallExpression const& e) {
return os << boost::fusion::as_vector(e);
}
}

namespace Parsing {
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

template <typename It>
struct Expression : qi::grammar<It, Ast::Expression()> {
Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
using namespace qi;

start = skip(space) [expression];

expression = term [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
term = factor [_val = _1] >> *(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];
factor = simple [_val = _1] >> *(char_("^") >> factor) [_val = make_binary(_val, _1, _2)];

simple = call | variable | compound | number | string;

auto implied = [=](char ch) {
return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
};

variable = maybe_known(phx::ref(_variables));

compound %= '(' >> expression >> implied(')');

// The heuristics:
// - an unknown identifier followed by (
// - an unclosed argument list implies )
call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
)
>> implied('(')
>> -(expression % (',' | !(')'|eoi) >> implied(',')))
>> implied(')')
;

// lexemes, primitive rules
identifier = raw[(alpha|'_') >> *(alnum|'_')];

// imply the closing quotes
string %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"'); // TODO more escapes

number = double_; // TODO integral arguments

///////////////////////////////
// identifier loopkup, suggesting
{
maybe_known = known(_domain) | unknown(_domain);

// distinct to avoid partially-matching identifiers
using boost::spirit::repository::qi::distinct;
auto kw = distinct(copy(alnum | '_'));

known = raw[kw[lazy(_domain)]];
unknown = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
}

BOOST_SPIRIT_DEBUG_NODES(
(start)
(expression)(term)(factor)(simple)(compound)
(call)(variable)
(identifier)(number)(string)
//(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
)

_variables += "foo", "bar", "qux";
_functions += "print", "sin", "tan", "sqrt", "frobnicate";
}

private:
Completion::Hints& _hints;

using Domain = qi::symbols<char>;
Domain _variables, _functions;

qi::rule<It, Ast::Expression()> start;
qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
// completables
qi::rule<It, Ast::Expression(), qi::space_type> compound;
qi::rule<It, Ast::CallExpression(), qi::space_type> call;

// implicit lexemes
qi::rule<It, Ast::Identifier()> variable, identifier;
qi::rule<It, Ast::NumLiteral()> number;
qi::rule<It, Ast::StringLiteral()> string;

// domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

///////////////////////////////
// binary expression factory
struct make_binary_f {
Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
return {lhs, op, rhs};
}
};
boost::phoenix::function<make_binary_f> make_binary;

///////////////////////////////
// auto-completing incomplete expressions
struct imply_f {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, char implied_char) const {
auto inserted =
_hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
// add the implied char to existing completion
if (!inserted.second)
inserted.first->second += implied_char;
}
};
boost::phoenix::function<imply_f> imply { imply_f { _hints } };

///////////////////////////////
// suggest_for
struct suggester {
Completion::Hints& _hints;

void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
using namespace Completion;
Source id(&*where.begin(), where.size());
Candidates c;

symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

auto score = [id](Source v) { return fuzzy_match(id, v); };
auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

sort(c.begin(), c.end(), byscore);
c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

_hints.suggestions.emplace(id, c);
}
};
boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

};
}

#include <iostream>
#include <iomanip>

int main() {
for (Source const input : {
"", // invalid
"(3", // incomplete, imply ')'
"3*(6+sqrt(9))^7 - 1e8", // completely valid
"(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
"print(\"hello \\\"world!", // completes the string literal and the parameter list
"foo", // okay, known variable
"baz", // (suggest bar)
"baz(", // incomplete, imply ')', unknown function
"taz(", // incomplete, imply ')', unknown function
"san(", // 2 suggestions (sin/tan)
"print(1, 2, \"three\", complicated(san(78",
"(print sqrt sin 9) -0) \"bye",
})
{
std::cout << "-------------- '" << input << "'\n";
Location f = input.begin(), l = input.end();

Ast::Expression expr;
Completion::Hints hints;
bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

if (hints) {
std::cout << "Input: '" << input << "'\n";
}
for (auto& c : hints.incomplete) {
std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
}
for (auto& id : hints.suggestions) {
std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
if (id.second.empty())
std::cout << " (no suggestions)\n";
else {
std::cout << " (did you mean ";
once_t first;
for (auto& s : id.second)
std::cout << (first?"":" or ") << "'" << s << "'";
std::cout << "?)\n";
}
}

if (ok) {
std::cout << "AST: " << expr << "\n";
} else {
std::cout << "Parse failed\n";
}

if (f!=l)
std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
}
}

¹ Boost spirit skipper issues

How to provide user with advanced autocomplete suggestions for given boost::spirit grammar?

In relation to syntax highlighting, have you seen these?

  • Map input to ast types in boost spirit
  • Cannot get Boost Spirit grammar to use known keys for std::map<>

In relation to the second part of the question about context-sensitive completion, you simply need semantic analysis. The most important thing to remember here is to separate parsing and semantic analysis to simplify your life.

In the original answer we already have an AST capable of representing partially correct inputs.
This is all you need.



Related Topics



Leave a reply



Submit