Simulate PHP Array Language Construct or Parse with Regexp

Simulate php array language construct or parse with regexp?

Whilst writing a parser using the Tokenizer which turned out not as easy as I expected, I came up with another idea: Why not parse the array using eval, but first validate that it contains nothing harmful?

So, what the code does: It checks the tokens of the array against some allowed tokens and chars and then executes eval. I do hope I included all possible harmless tokens, if not, simply add them. (I intentionally didn't include HEREDOC and NOWDOC, because I think they are unlikely to be used.)

function parseArray($code) {
$allowedTokens = array(
T_ARRAY => true,
T_CONSTANT_ENCAPSED_STRING => true,
T_LNUMBER => true,
T_DNUMBER => true,
T_DOUBLE_ARROW => true,
T_WHITESPACE => true,
);
$allowedChars = array(
'(' => true,
')' => true,
',' => true,
);

$tokens = token_get_all('<?php '.$code);
array_shift($tokens); // remove opening php tag

foreach ($tokens as $token) {
// char token
if (is_string($token)) {
if (!isset($allowedChars[$token])) {
throw new Exception('Disallowed token \''.$token.'\' encountered.');
}
continue;
}

// array token

// true, false and null are okay, too
if ($token[0] == T_STRING && ($token[1] == 'true' || $token[1] == 'false' || $token[1] == 'null')) {
continue;
}

if (!isset($allowedTokens[$token[0]])) {
throw new Exception('Disallowed token \''.token_name($token[0]).'\' encountered.');
}
}

// fetch error messages
ob_start();
if (false === eval('$returnArray = '.$code.';')) {
throw new Exception('Array couldn\'t be eval()\'d: '.ob_get_clean());
}
else {
ob_end_clean();
return $returnArray;
}
}

var_dump(parseArray('array("a", "b", "c", array("1", "2", array("A", "B")), array("3", "4"), "d")'));

I think this is a good comprimise between security and convenience - no need to parse yourself.

For example

parseArray('exec("haha -i -thought -i -was -smart")');

would throw exception:

Disallowed token 'T_STRING' encountered.

Simulate php array language construct or parse with regexp?

Whilst writing a parser using the Tokenizer which turned out not as easy as I expected, I came up with another idea: Why not parse the array using eval, but first validate that it contains nothing harmful?

So, what the code does: It checks the tokens of the array against some allowed tokens and chars and then executes eval. I do hope I included all possible harmless tokens, if not, simply add them. (I intentionally didn't include HEREDOC and NOWDOC, because I think they are unlikely to be used.)

function parseArray($code) {
$allowedTokens = array(
T_ARRAY => true,
T_CONSTANT_ENCAPSED_STRING => true,
T_LNUMBER => true,
T_DNUMBER => true,
T_DOUBLE_ARROW => true,
T_WHITESPACE => true,
);
$allowedChars = array(
'(' => true,
')' => true,
',' => true,
);

$tokens = token_get_all('<?php '.$code);
array_shift($tokens); // remove opening php tag

foreach ($tokens as $token) {
// char token
if (is_string($token)) {
if (!isset($allowedChars[$token])) {
throw new Exception('Disallowed token \''.$token.'\' encountered.');
}
continue;
}

// array token

// true, false and null are okay, too
if ($token[0] == T_STRING && ($token[1] == 'true' || $token[1] == 'false' || $token[1] == 'null')) {
continue;
}

if (!isset($allowedTokens[$token[0]])) {
throw new Exception('Disallowed token \''.token_name($token[0]).'\' encountered.');
}
}

// fetch error messages
ob_start();
if (false === eval('$returnArray = '.$code.';')) {
throw new Exception('Array couldn\'t be eval()\'d: '.ob_get_clean());
}
else {
ob_end_clean();
return $returnArray;
}
}

var_dump(parseArray('array("a", "b", "c", array("1", "2", array("A", "B")), array("3", "4"), "d")'));

I think this is a good comprimise between security and convenience - no need to parse yourself.

For example

parseArray('exec("haha -i -thought -i -was -smart")');

would throw exception:

Disallowed token 'T_STRING' encountered.

PHP regex parsing - splitting tokens in my own language. Is there a better way?

[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:

preg_match('/
# read constant (?)
\:((?:cons@(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:@[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)

Beware that all space are ignored, except under certain conditions (\n is normally "safe").


Now, if you want to pimp you lexer and parser, then read that:

What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.

As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:

$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}

if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}

$tokens[] = ['kind' => $kind, 'value' => $best];

$input = substr($input, strlen($best)); // move.
}

Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).

The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:

$regexp = [['reg' => '...', 'kind' => STRING], ...]

You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:

class Foobar {
const FOOBAR = "arg";
function x() {...}
}

There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.

FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").

Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).

This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.

As I said, it was something I made in the past - something like 6/7 years ago.

  • It was on Windows.
  • It was not particularly quick (well it is O(N²) because of the two loops).
  • I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
  • I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.

You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

Create set of all possible matches for a given regex

I have begun working on a solution on Github. It can already lex most examples and give the solution set for finite regex.

It currently passes the following unit tests.

<?php

class RegexCompiler_Tests_MatchTest extends PHPUnit_Framework_TestCase
{

function dataProviderForTestSimpleRead()
{
return array(
array( "^ab$", array( "ab" ) ),
array( "^(ab)$", array( "ab" ) ),
array( "^(ab|ba)$", array( "ab", "ba" ) ),
array( "^(ab|(b|c)a)$", array( "ab", "ba", "ca" ) ),
array( "^(ab|ba){0,2}$", array( "", "ab", "ba", "abab", "abba", "baab", "baba" ) ),
array( "^(ab|ba){1,2}$", array( "ab", "ba", "abab", "abba", "baab", "baba" ) ),
array( "^(ab|ba){2}$", array( "abab", "abba", "baab", "baba" ) ),
array( "^hello?$", array( "hell", "hello" ) ),
array( "^(0|1){3}$", array( "000", "001", "010", "011", "100", "101", "110", "111" ) ),
array( "^[1-9][0-9]{0,1}$", array_map( function( $input ) { return (string)$input; }, range( 1, 99 ) ) ),
array( '^\n$', array( "\n" ) ),
array( '^\r$', array( "\r" ) ),
array( '^\t$', array( "\t" ) ),
array( '^[\\\\\\]a\\-]$', array( "\\", "]", "a", "-" ) ), //the regex is actually '^[\\\]a\-]$' after PHP string parsing
array( '^[\\n-\\r]$', array( chr( 10 ), chr( 11 ), chr( 12 ), chr( 13 ) ) ),
);
}

/**
* @dataProvider dataProviderForTestSimpleRead
*/

function testSimpleRead( $regex_string, $expected_matches_array )
{
$lexer = new RegexCompiler_Lexer();
$actualy_matches_array = $lexer->lex( $regex_string )->getMatches();
sort( $actualy_matches_array );
sort( $expected_matches_array );
$this->assertSame( $expected_matches_array, $actualy_matches_array );
}

}

?>

I would like to build an MatchIterator class that could handle infinite lists as well as one that would randomly generate matches from the regex. I'd also like to look into building regex from a match set as a way of optimizing lookups or compressing data.

reliably convert string containing PHP array info to array

is there a way to explode this reliably even if there is commas inside the desired chunks?

PHP by default does not provide such a function. However you have a compact subset of PHP inside your string and PHP offers some tools here: A PHP tokenizer and a PHP parser.

Therefore it's possible for your string specification to create a helper function that validates the input against allowed tokens and then parse it:

$str = "array(1,3,4),array(array(4,5,6)),'this is a comma , inside a string', array('asdf' => 'lalal')";

function explode_string($str)
{
$result = NULL;

// validate string
$isValid = FALSE;
$tokens = token_get_all(sprintf('<?php %s', $str));
array_shift($tokens);
$valid = array(305, 315, 358, 360, 371, '(', ')', ',');
foreach($tokens as $token)
{
list($index) = (array) $token;
if (!in_array($index, $valid))
{
$isValid = FALSE;
break;
}
}
if (!$isValid)
throw new InvalidArgumentException('Invalid string.');

// parse string
$return = eval(sprintf('return array(%s);', $str));

return $return;
}

echo $str, "\n";

$result = explode_string($str);

var_dump($result);

The tokens used are:

T_LNUMBER (305)
T_CONSTANT_ENCAPSED_STRING (315)
T_DOUBLE_ARROW (358)
T_ARRAY (360)
T_WHITESPACE (371)

The token index number can be given a token name by using token_name.

Which gives you (Demo):

Array
(
[0] => Array
(
[0] => 1
[1] => 3
[2] => 4
)

[1] => Array
(
[0] => Array
(
[0] => 4
[1] => 5
[2] => 6
)

)

[2] => this is a comma , inside a string
[3] => Array
(
[asdf] => lalal
)

)

Create a script-parser in PHP

The proper solution is mentioned in the comments. You need to actually write a compiler/parser. My memory is a little fuzzy from my compilers course, but here is how you would approach it.

The basic concept is to convert the input to tokens (this is where regular expressions are okay). This is called lexical analysis

So:

[Config Object]
{Loop 3
Section[i]
{Loop 3
Setting[i] = Value[i]
}
}
OtherSetting=X

becomes (pseudo code tokens, and maybe not exactly what you need)

OPEN_BRACKET STRING(=Config Object) CLOSE_BRACKET
START_LOOP NUMBER(=3)
STRING(=Section) OPEN_BRACKET STRING(=i) CLOSE_BRACKET
START_LOOP NUMBER(=3)
STRING(=Setting) OPEN_BRACKET STRING(=i) CLOSE_BRACKET EQUAL STRING(=Value) OPEN_BRACKET STRING(=i) CLOSE_BRACKET
END_LOOP
END_LOOP
STRING(=OtherSetting) EQUAL STRING(=X)

So if your lexer gets you an array of tokens like the above, you just need to parse it to an actual grammar (so this is where you don't want to use regular expressions).

Your grammar (for the loops) is something along these lines (pseudo code syntax kind of like Bison, and I'm probably forgetting parts/leaving things out on purpose):

INDEXED_CONFIG_LINES: INDEXED_CONFIG_LINE | INDEXED_CONFIG_LINES INDEXED_CONFIG_LINE;
INDEXED_CONFIG_LINE: STRING OPEN_BRACKET STRING CLOSE_BRACKET EQUAL STRING OPEN_BRACKET STRING CLOSE_BRACKET;
LOOP: START_LOOP NUMBER LOOP_BODY END_LOOP;
LOOP_BODY: INDEXED_CONFIG_LINES | LOOP;

So instead of a regular expression, you need a parser that can use that grammar to build a syntax tree. You would basically just be building a state machine, where you transition on the next token to some state (like in a loop body, etc.).

Honestly, YAML would probably meet your needs instead of re-inventing the wheel or resorting to regex gymnastics. But if you really need to have the loop syntax you are proposing, you could take a look at the Symfony Yaml component to see how they do the parsing. https://github.com/symfony/Yaml

Or you can take a look at Twig for another parser that does have loops: https://github.com/fabpot/Twig/tree/master/lib/Twig



Related Topics



Leave a reply



Submit