Using Boost Tokenizer Escaped_List_Separator with Different Parameters

Using Boost Tokenizer escaped_list_separator with different parameters

try this:

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

int main()
{
using namespace std;
using namespace boost;
string s = "exec script1 \"script argument number one\"";
string separator1("");//dont let quoted arguments escape themselves
string separator2(" ");//split on spaces
string separator3("\"\'");//let it have quoted arguments

escaped_list_separator<char> els(separator1,separator2,separator3);
tokenizer<escaped_list_separator<char>> tok(s, els);

for(tokenizer<escaped_list_separator<char>>::iterator beg=tok.begin(); beg!=tok.end();++beg)
{
cout << *beg << "\n";
}
}

Splitting string with multiple delimiters, allowing quoted values

Judging by the Boost Tokenizer documentation, you are indeed correct in assuming that if multiple consecutive delimiters are encountered empty tokens will be produced when using boost::escaped_list_separator. Unlike boost::char_separator, boost::escaped_list_separator does not provide any constructor that allows you to pass in whether to keep or discard any empty tokens produced.

While having the option to discard empty tokens can be nice, when you consider the use case (parsing CSV files) presented in the documentation (http://www.boost.org/doc/libs/1_64_0/libs/tokenizer/escaped_list_separator.htm), keeping empty tokens makes perfect sense. An empty field is still a field.

One option is to simply discard empty tokens after tokenizing. If the generation of empty tokens concerns you, an alternative is removing repeated delimiters prior to passing it to the tokenizer, but obviously you will need to take care not to remove anything inside quotes.

boost tokenizer / char separator

You forgot the newline separator: string separator2(",\n");

#include <iostream>
#include <boost/tokenizer.hpp>
#include <boost/algorithm/string.hpp>

using namespace std;

using namespace boost;

int main() {
string str = "TEst,hola\nhola";
string separator1(""); //dont let quoted arguments escape themselves
string separator2(",\n"); //split on comma and newline
string separator3("\""); //let it have quoted arguments

escaped_list_separator<char> els(separator1, separator2, separator3);
tokenizer<escaped_list_separator<char>> tok(str, els);

int counter = 0, current_siding = 0, wagon_pos = 0, cur_vector_pos = 0;

string next;

for (tokenizer<escaped_list_separator<char>>::iterator beg = tok.begin(); beg != tok.end(); ++beg) {
next = *beg;
boost::trim(next);
cout << counter << " " << next << endl;
counter++;

}
return 0;
}

Using escaped_list_separator with boost split

It doesn't seem that there is any simple way to do this using the boost::split method. The shortest piece of code I can find to do this is

vector<string> tokens; 
tokenizer<escaped_list_separator<char> > t(str, escaped_list_separator<char>("\\", ",", "\""));
BOOST_FOREACH(string s, escTokeniser)
tokens.push_back(s);

which is only marginally more verbose than the original snippet

vector<string> tokens;  
boost::split(tokens, str, boost::is_any_of(","));

Boost tokenizer to treat quoted string as one token

Try this code and this way you can avoid using Boost.Tokenizer and Boost.Spirit libs

#include <vector>
#include <string>
#include <iostream>

const char Separators[] = { ' ', 9 };

bool Str_IsSeparator( const char Ch )
{
for ( size_t i = 0; i != sizeof( Separators ); i++ )
{
if ( Separators[i] == Ch ) { return true; }
}

return false;
}

void SplitLine( size_t FromToken, size_t ToToken, const std::string& Str, std::vector<std::string>& Components /*, bool ShouldTrimSpaces*/ )
{
size_t TokenNum = 0;
size_t Offset = FromToken - 1;

const char* CStr = Str.c_str();
const char* CStrj = Str.c_str();

while ( *CStr )
{
// bypass spaces & delimiting chars
while ( *CStr && Str_IsSeparator( *CStr ) ) { CStr++; }

if ( !*CStr ) { return; }

bool InsideQuotes = ( *CStr == '\"' );

if ( InsideQuotes )
{
for ( CStrj = ++CStr; *CStrj && *CStrj != '\"'; CStrj++ );
}
else
{
for ( CStrj = CStr; *CStrj && !Str_IsSeparator( *CStrj ); CStrj++ );
}

// extract token
if ( CStr != CStrj )
{
TokenNum++;

// store each token found
if ( TokenNum >= FromToken )
{
Components[ TokenNum-Offset ].assign( CStr, CStrj );
// if ( ShouldTrimSpaces ) { Str_TrimSpaces( &Components[ TokenNum-Offset ] ); }
// proceed to next token
if ( TokenNum >= ToToken ) { return; }
}
CStr = CStrj;

// exclude last " from token, handle EOL
if ( *CStr ) { CStr++; }
}
}
}

int main()
{
std::string test = "1st 2nd \"3rd with some comment\" 4th";
std::vector<std::string> Out;

Out.resize(5);
SplitLine(1, 4, test, Out);

for(size_t j = 0 ; j != Out.size() ; j++) { std::cout << Out[j] << std::endl; }

return 0;
}

It uses a preallocated string array (it is not zero-based, but that's easily fixable) and it's pretty simple.

Tokenize a Braced Initializer List-Style String in C++ (With Boost?)

Simple (Flat) Take

Defining a flat data structure like:

using token  = std::string;
using tokens = std::vector<token>;

We can define an X3 parser like:

namespace Parser {
using namespace boost::spirit::x3;

rule<struct list_, token> item;

auto quoted = lexeme [ '"' >> *('\\' >> char_ | ~char_('"')) >> '"' ];
auto bare = lexeme [ +(graph-','-'}') ];

auto list = '{' >> (item % ',') >> '}';
auto sublist = raw [ list ];

auto item_def = sublist | quoted | bare;

BOOST_SPIRIT_DEFINE(item)
}

Live On Wandbox

#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>

using token = std::string;
using tokens = std::vector<token>;

namespace x3 = boost::spirit::x3;

namespace Parser {
using namespace boost::spirit::x3;

rule<struct list_, token> item;

auto quoted = lexeme [ '"' >> *('\\' >> char_ | ~char_('"')) >> '"' ];
auto bare = lexeme [ +(graph-','-'}') ];

auto list = '{' >> (item % ',') >> '}';
auto sublist = raw [ list ];

auto item_def = sublist | quoted | bare;

BOOST_SPIRIT_DEFINE(item)
}

int main() {
for (std::string const input : {
R"({one, "five, six"})",
R"({one, {2, "three four"}, "five, six", {"seven, eight"}})",
})
{
auto f = input.begin(), l = input.end();

std::vector<std::string> parsed;
bool ok = phrase_parse(f, l, Parser::list, x3::space, parsed);

if (ok) {
std::cout << "Parsed: " << parsed.size() << " elements\n";
for (auto& el : parsed) {
std::cout << " - " << std::quoted(el, '\'') << "\n";
}
} else {
std::cout << "Parse failed\n";
}

if (f != l)
std::cout << "Remaining unparsed: " << std::quoted(std::string{f, l}) << "\n";
}
}

Prints

Parsed: 2 elements
- 'one'
- 'five, six'
Parsed: 4 elements
- 'one'
- '{2, "three four"}'
- 'five, six'
- '{"seven, eight"}'

Nested Data

Changing the datastructure to be a bit more specific/realistic:

namespace ast {
using value = boost::make_recursive_variant<
double,
std::string,
std::vector<boost::recursive_variant_>
>::type;
using list = std::vector<value>;
}

Now we can change the grammar, as we no longer need to treat sublist as if it is a string:

namespace Parser {
using namespace boost::spirit::x3;

rule<struct item_, ast::value> item;

auto quoted = lexeme [ '"' >> *('\\' >> char_ | ~char_('"')) >> '"' ];
auto bare = lexeme [ +(graph-','-'}') ];

auto list = x3::rule<struct list_, ast::list> {"list" }
= '{' >> (item % ',') >> '}';

auto item_def = list | double_ | quoted | bare;

BOOST_SPIRIT_DEFINE(item)
}

Everything "still works": Live On Wandbox

#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>

namespace ast {
using value = boost::make_recursive_variant<
double,
std::string,
std::vector<boost::recursive_variant_>
>::type;
using list = std::vector<value>;
}

namespace x3 = boost::spirit::x3;

namespace Parser {
using namespace boost::spirit::x3;

rule<struct item_, ast::value> item;

auto quoted = lexeme [ '"' >> *('\\' >> char_ | ~char_('"')) >> '"' ];
auto bare = lexeme [ +(graph-','-'}') ];

auto list = x3::rule<struct list_, ast::list> {"list" }
= '{' >> (item % ',') >> '}';

auto item_def = list | double_ | quoted | bare;

BOOST_SPIRIT_DEFINE(item)
}

struct pretty_printer {
using result_type = void;
std::ostream& _os;
int _indent;

pretty_printer(std::ostream& os, int indent = 0) : _os(os), _indent(indent) {}

void operator()(ast::value const& v) { boost::apply_visitor(*this, v); }

void operator()(double v) { _os << v; }
void operator()(std::string s) { _os << std::quoted(s); }
void operator()(ast::list const& l) {
_os << "{\n";
_indent += 2;
for (auto& item : l) {
_os << std::setw(_indent) << "";
operator()(item);
_os << ",\n";
}
_indent -= 2;
_os << std::setw(_indent) << "" << "}";
}
};

int main() {
pretty_printer print{std::cout};

for (std::string const input : {
R"({one, "five, six"})",
R"({one, {2, "three four"}, "five, six", {"seven, eight"}})",
})
{
auto f = input.begin(), l = input.end();

ast::value parsed;
bool ok = phrase_parse(f, l, Parser::item, x3::space, parsed);

if (ok) {
std::cout << "Parsed: ";
print(parsed);
std::cout << "\n";
} else {
std::cout << "Parse failed\n";
}

if (f != l)
std::cout << "Remaining unparsed: " << std::quoted(std::string{f, l}) << "\n";
}
}

Prints:

Parsed: {
"one",
"five, six",
}
Parsed: {
"one",
{
2,
"three four",
},
"five, six",
{
"seven, eight",
},
}

How to define boost tokenizer to return boost::iterator_rangeconst char*

As I said, you want parsing, not splitting. Specifically, if you were to split the input into iterator ranges, you would have to repeat the effort of parsing e.g. quoted constructs to get the intended (unquoted) value.

I'd go by your specifications with Boost Spirit:

using Attribute = std::pair<std::string /*key*/, //
std::string /*value*/>;
using Line = std::vector<Attribute>;
using File = std::vector<Line>;

A Grammar

Now using X3 we can write expressions to define the syntax:

auto file      = x3::skip(x3::blank)[ line % x3::eol ];

Within a file, blank space (std::isblank) is generally skipped.

The content consists of one or more lines separated by newlines.

auto line      = attribute % ';';

A line consists of one or more attributes separated by ';'

auto attribute = field >> -x3::lit('=') >> field;
auto field = quoted | unquoted;

An attribute is two fields, optionally separated by =. Note that each field is either a quoted or unquoted value.

Now, things get a little more tricky: when defining the field rules we want them to be "lexemes", i.e. any whitespace is not to be skipped.

auto unquoted = x3::lexeme[+(x3::graph - ';' - '=')];

Note how graph already excludes whitespace (see
std::isgraph). In addition we prohibit a naked ';' or '=' so that we don't run into a next attribute/field.

For fields that may contain whitespace, and or those special characters, we define the quoted lexeme:

auto quoted      = x3::lexeme['"' >> *quoted_char >> '"'];

So, that's just "" with any number of quoted characters in between, where

auto quoted_char = '\\' >> x3::char_ | ~x3::char_('"');

the character can be anything escapped with \ OR any character other than the closing quote.

TEST TIME

Let's exercise *Live On Compiler Explorer

for (std::string const& str :
{
R"(a 1)",
R"(b = 2 )",
R"("c"="3")",
R"(a=1;two 222;three "3 3 3")",
R"(b=2;three 333;four "4 4 4"
c=3;four 444;five "5 5 5")",
// special cases
R"("e=" "5")",
R"("f=""7")",
R"("g="="8")",
R"("\"Hello\\ World\\!\"" '8')",
R"("h=10;i=11;" bogus;yup "nope")",
// not ok?
R"(h i j)",
// allowing empty lines/attributes?
"",
"a 1;",
";",
";;",
R"(a=1;two 222;three "3 3 3"

n=1;gjb 222;guerr "3 3 3"
)",
}) //
{
File contents;
if (parse(begin(str), end(str), parser::file, contents))
fmt::print("Parsed:\n\t- {}\n", fmt::join(contents, "\n\t- "));
else
fmt::print("Not Parsed\n");
}

Prints

Parsed:
- {("a", "1")}
Parsed:
- {("b", "2")}
Parsed:
- {("c", "3")}
Parsed:
- {("a", "1"), ("two", "222"), ("three", "3 3 3")}
Parsed:
- {("b", "2"), ("three", "333"), ("four", "4 4 4")}
- {("c", "3"), ("four", "444"), ("five", "5 5 5")}
Parsed:
- {("e=", "5")}
Parsed:
- {("f=", "7")}
Parsed:
- {("g=", "8")}
Parsed:
- {(""Hello\ World\!"", "'8'")}
Parsed:
- {("h=10;i=11;", "bogus"), ("yup", "nope")}
Not Parsed
Not Parsed
Not Parsed
Not Parsed
Not Parsed
Not Parsed

Allowing empty elements

Is as simple as replacing line with:

auto line = -(attribute % ';');

To also allow redundant separators:

auto line = -(attribute % +x3::lit(';')) >> *x3::lit(';');

See that Live On Compiler Explorer

Insisting on Iterator Ranges

I explained above why I think this is a bad idea. Consider how you would correctly interpret the key/value from this line:

"\"Hello\\ World\\!\"" '8'

You simply don't want to deal with the grammar outside the parser. However, maybe your data is a 10 gigabyte memory mapped file:

using Field     = boost::iterator_range<std::string::const_iterator>;
using Attribute = std::pair<Field /*key*/, //
Field /*value*/>;

And then add x3::raw[] to the lexemes:

auto quoted      = x3::lexeme[x3::raw['"' >> *quoted_char >> '"']];

auto unquoted = x3::lexeme[x3::raw[+(x3::graph - ';' - '=')]];

See it Live On Compiler Explorer



Related Topics



Leave a reply



Submit