What Are the Valid Characters in PHP Variable, Method, Class, etc Names

What are the valid characters in PHP variable, method, class, etc names?

The [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* regex only applies when the name is used directly in some special syntactical element. Some examples:

$varName           // <-- varName needs to satisfy the regex
$foo->propertyName // <-- propertyName needs to satisfy the regex
class ClassName {} // <-- ClassName needs to satisfy the regex
// and can't be a reserved keyword

Note that this regex is applied byte-per-byte, without consideration for encoding. That's why it also allows many weird Unicode names.

But the regex restricts only these "direct" uses of names. Through various dynamic features PHP provides it's possible to use virtually arbitrary names.

In general you should make no assumptions about which characters names in PHP can contain. To the most parts they are just arbitrary strings. Doing things like "validating if something is a valid class name" is just meaningless in PHP.

In the following I will provide examples of how you can create weird names for the different categories.

Variables

Variable names can be arbitrary strings:

${''} = 'foo';
echo ${''}; // foo
${"\0"} = 'bar';
echo ${"\0"}; // bar

Constants

Global constants can also be arbitrary strings:

define('', 'foo');
echo constant(''); // foo
define("\0", 'bar');
echo constant("\0"); // bar

There is no way to dynamically define class constants that I'm aware of, so those can not be arbitrary. Only way to create weird class constants seems to be via extension code.

Properties

Properties can not be the empty string and can not start with a NUL byte, but apart from this they are arbitrary:

$obj = new stdClass;
$obj->{''} = 'foo'; // Fatal error: Cannot access empty property
$obj->{"\0"} = 'foo'; // Fatal error: Cannot access property started with '\0'
$obj->{'*'} = 'foo';
echo $obj->{'*'}; // foo

Methods

Method names are arbitrary and can be handled by __call magic:

class Test {
public function __call($method, $args) {
echo "Called method \"$method\"";
}
}

$obj = new Test;
$obj->{''}(); // Called method ""
$obj->{"\0"}(); // Called method "\0"

Classes

Arbitrary class names can be created using class_alias with the exception of the empty string:

class Test {}

class_alias('Test', '');
$className = '';
$obj = new $className; // Fatal error: Class '' not found

class_alias('Test', "\0");
$className = "\0";
$obj = new $className; // Works!

Functions

I'm not aware of a way to create arbitrary function names from userland, but there are still some occasions where internal code produces "weird" names:

var_dump(create_function('',''));
// string(9) "\0lambda_1"

PHP variable / function / class names using special characters

If you check the docs on variables it says that:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

But basically people have agreed to only use a-zA-Z0-9_ and not the "fancy" names since they might break depending in the encoding one uses.

So you can have a variable that is named $aöäüÖÄ but if you save that with the wrong encoding you might run into trouble.


The same goes for functions too btw.

So

function fooööö($aà) { echo $aà; }

fooööö("hi"); // will just echo 'hi'

will just work out (at least at first).


Also check out:

Exotic names for methods, constants, variables and fields - Bug or Feature?

for some discussion on the subject.

Exotic names for methods, constants, variables and fields - Bug or Feature?

This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

Case-insensitive identifiers (class and function/method names)

The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

<?php
function func_á() { echo "worked"; }
func_Á();

Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:


$ LANG=en_US.iso88591 php a.php
worked
$ LANG=en_US.utf8 php a.php

Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3

Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

Case-sensitive identifiers (variables, constants, fields)

The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On in php.ini). This allows you to declare the encoding of the the script:

<?php
declare(encoding='ISO-8859-1');
// code here
?>

It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

  • Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
  • Multi-byte support is usually not compiled in, so it's less tested (more bugs).
  • Portability issues between installations that have the support compiled in and those that don't.
  • Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.

Why can I define a PHP function with a non-printing character?

From manual:

Function names follow the same rules as other labels in PHP. A valid
function name starts with a letter or underscore, followed by any
number of letters, numbers, or underscores. As a regular expression,
it would be expressed thus: [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*.

As explained in the other answer linked, regular expression is applied byte-per-byte, allowing "many weird Unicode names".

Doing it that way has some side-effects like you've seen. However, I can't imagine it was the original intent of people behind PHP, it would be just a direct consequence of the way they've implemented it.

single equal and minus php at php

If you must have the hyphen in a variable name, you can do:

${'some-randome-variable'} = "random-variable";

However, it is definitely not advised to name your variables in that way. The above should only be used if you need to access a variable that you don't control, that has hyphens. For example from a third party library or web service response.

PHP - valid variable names

You can literally choose any name for a variable. "i" and "foo" are obvious choices, but "", "\n", and "foo.bar" are also valid. The reason? The PHP symbol table is just a dictionary: a string key of zero or more bytes maps to a structured value (called a zval). Interestingly, there are two ways to access this symbol table: lexical variables and dynamic variables.

Lexical variables are what you read about in the "variables" documentation. Lexical variables define the symbol table key during compilation (ie, while the engine is lexing and parsing the code). To keep this lexer simple, lexical variables start with a $ sigil and must match the regex [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*. Keeping it simple this way means the parser doesn't have to figure out, for example, whether $foo.bar is a variable keyed by "foo.bar" or a variable "foo" string concatenated with a constant bar.

Now dynamic variables is where it gets interesting. Dynamic variables let you access those more uncommon variable names. PHP calls these variable variables. (I'm not fond of that name, as their opposite is logically "constant variable", which is confusing. But I'll call them variable variables here on.) The basic usage goes like:

$a = 'b';
$b = 'SURPRISE!';
var_dump($$a, ${$a}); // both emit a surprise

Variable variables are parsed differently than lexical variables. Rather than defining the symbol table key at lexing time, the symbol table key is evaluated at run time. The logic goes like this: the PHP lexer sees the variable variable syntax (either $$a or more generally ${expression}), the PHP parser defers evaluation of the expression until at run-time, then at run-time the engine uses the result of the expression to key into the symbol table. It's a little more work than lexical variables, but far more powerful.

Inside of ${} you can have an expression that evaluates to any byte sequence. Empty string, null byte, all of it. Anything goes. That is handy, for example, in heredocs. It's also handy for accessing remote variables as PHP variables. For example, JSON allows any character in a key name, and you might want to access those as straight variables (rather than array elements):

$decoded = json_decode('{ "foo.bar" : 1 }');
foreach ($decoded as $key => $value) {
${$key} = $value;
}
var_dump(${'foo.bar'});

Using variable variables in this way is similar to using an array as a "symbol table", like $array['foo.bar'], but the variable variable approach is perfectly acceptable and slightly faster.


Addendum

By "slightly faster" we are talking so far to the right of the decimal point that they're practically indistinguishable. It's not until 10^8 symbol accesses that the difference exceeds 1 second in my tests.

Set array key: 0.000000119529
Set var-var: 0.000000101196
Increment array key: 0.000000159856
Increment var-var: 0.000000136778

The loss of clarity and convention is likely not worth it.

$N = 100000000;

$elapsed = -microtime(true);
$syms = [];
for ($i = 0; $i < $N; $i++) { $syms['foo.bar'] = 1; }
printf("Set array key: %.12f\n", ($elapsed + microtime(true)) / $N);

$elapsed = -microtime(true);
for ($i = 0; $i < $N; $i++) { ${'foo.bar'} = 1; }
printf("Set var-var: %.12f\n", ($elapsed + microtime(true)) / $N);

$elapsed = -microtime(true);
$syms['foo.bar'] = 1;
for ($i = 0; $i < $N; $i++) { $syms['foo.bar']++; }
printf("Increment array key: %.12f\n", ($elapsed + microtime(true)) / $N);

$elapsed = -microtime(true);
${'foo.bar'} = 1;
for ($i = 0; $i < $N; $i++) { ${'foo.bar'}++; }
printf("Increment var-var: %.12f\n", ($elapsed + microtime(true)) / $N);

Sanitize Strings for Legal Variable Names in PHP

First decide what you need a filter or a validator. A validator will return true/false. Then you can raise an exception, produce an error for the user or just ignore the file. The other option is to use a filter which will effectively remove characters from the input string.

public function sanitize($input)
{
$pattern = '/[^a-zA-Z0-9]/';

return preg_replace($pattern, '', (string) $input);
}

You might also want to check for unicode. The pattern is:

public function sanitize($input)
{
if (!@preg_match('/\pL/u', 'a'))
{
$pattern = '/[^a-zA-Z0-9]/';
}
else
{
$pattern = '/[^\p{L}\p{N}]/u';
}
return preg_replace($pattern, '', (string) $input);
}

Issues also to consider:

  • Do you want to enable whitespace support? In this case you will need to add a space in the $pattern variables.
  • Are the filenames in a language other than English? Then you will need to do some locale specific manipulation to get the $pattern up to date.

HTH

Displaying variables with special characters $ or - in the name

Am not sure where you got this class but there are so many invalid Naming here

Example

$goal campaign-ID etc.

I had to reconstruct your class and it looked like this

$st= new stdClass();
$st->{"campaign-ID"} = 1 ;
$st->campaign_name = "Sample Campaign" ;
$st->start_duration = "2012-04-17" ;
$st->start_duration = "2012-04-17" ;
$st->{'activity$'} = null ;
$st->survey_settings = "Ordering K-Cup Packs" ;
$st->{'$limit'} = "sample" ;
$st->{'$goal'} = null ;

$std = new stdClass();
foreach($st as $key => $value)
{
$key = str_replace(array('$',"-"),array('',"_"),$key);
$std->{$key} = $value ;
}

echo "<pre>" ;
print_r($std);

Output

stdClass Object
(
[campaign_ID] => 1
[campaign_name] => Sample Campaign
[start_duration] => 2012-04-17
[activity] =>
[survey_settings] => Ordering K-Cup Packs
[limit] => sample
[goal] =>
)

I'll advice you create more valid variables name rather than such conversions

Is there a limit in PHP to the length of a variable name or function name?

Generally, such a limit is imposed by the threat of violence from other folks who interact with your code.



Related Topics



Leave a reply



Submit