Exotic Names for Methods, Constants, Variables and Fields - Bug or Feature

Exotic names for methods, constants, variables and fields - Bug or Feature?

This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

Case-insensitive identifiers (class and function/method names)

The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

<?php
function func_á() { echo "worked"; }
func_Á();

Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:


$ LANG=en_US.iso88591 php a.php
worked
$ LANG=en_US.utf8 php a.php

Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3

Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

Case-sensitive identifiers (variables, constants, fields)

The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On in php.ini). This allows you to declare the encoding of the the script:

<?php
declare(encoding='ISO-8859-1');
// code here
?>

It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
Multi-byte support is usually not compiled in, so it's less tested (more bugs).
Portability issues between installations that have the support compiled in and those that don't.
Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.

php variable name and braces

According to this link : http://cowburn.info/2008/01/12/php-vars-curly-braces/

the answers are :

question 1: because what is inside the braces will be, more than the name of a variable, a key/value pair within the $_GLOBALS array.
question 2: they would be part of the $Globals of your script, but you can't access to them with the regular $var notation.

What are the valid characters in PHP variable, method, class, etc names?

The [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* regex only applies when the name is used directly in some special syntactical element. Some examples:

$varName           // <-- varName needs to satisfy the regex
$foo->propertyName // <-- propertyName needs to satisfy the regex
class ClassName {} // <-- ClassName needs to satisfy the regex
                   //     and can't be a reserved keyword

Note that this regex is applied byte-per-byte, without consideration for encoding. That's why it also allows many weird Unicode names.

But the regex restricts only these "direct" uses of names. Through various dynamic features PHP provides it's possible to use virtually arbitrary names.

In general you should make no assumptions about which characters names in PHP can contain. To the most parts they are just arbitrary strings. Doing things like "validating if something is a valid class name" is just meaningless in PHP.

In the following I will provide examples of how you can create weird names for the different categories.

Variables

Variable names can be arbitrary strings:

${''} = 'foo';
echo ${''};      // foo
${"\0"} = 'bar';
echo ${"\0"};    // bar

Constants

Global constants can also be arbitrary strings:

define('', 'foo');
echo constant('');   // foo
define("\0", 'bar');
echo constant("\0"); // bar

There is no way to dynamically define class constants that I'm aware of, so those can not be arbitrary. Only way to create weird class constants seems to be via extension code.

Properties

Properties can not be the empty string and can not start with a NUL byte, but apart from this they are arbitrary:

$obj = new stdClass;
$obj->{''} = 'foo';   // Fatal error: Cannot access empty property
$obj->{"\0"} = 'foo'; // Fatal error: Cannot access property started with '\0'
$obj->{'*'} = 'foo';
echo $obj->{'*'};     // foo

Methods

Method names are arbitrary and can be handled by __call magic:

class Test {
    public function __call($method, $args) {
        echo "Called method \"$method\"";
    }
}

$obj = new Test;
$obj->{''}();    // Called method ""
$obj->{"\0"}();  // Called method "\0"

Classes

Arbitrary class names can be created using class_alias with the exception of the empty string:

class Test {}

class_alias('Test', '');
$className = '';
$obj = new $className; // Fatal error: Class '' not found

class_alias('Test', "\0");
$className = "\0";
$obj = new $className; // Works!

Functions

I'm not aware of a way to create arbitrary function names from userland, but there are still some occasions where internal code produces "weird" names:

var_dump(create_function('',''));
// string(9) "\0lambda_1"

Unicode identifiers (function names) for non-localization purposes advisable?

Which symbol makes sense as separator for mixed-in parameters in a virtual function name.

\u2639?

But other than localization and amusement and decorative effects, which uses of Unicode identifiers are advisable?

The biggest hurdle after font support is going to be making the character one that can be typed. Outside of a macro or copy/paste, unicode characters are not spectacularly easy to enter. Forcing this upon others is very likely going to violate the "assume the people that work with your code after you are murderous psychopaths that know where you live" rule.

We use unicode characters in only a few comments in our codebase, like

// Even though this is the end of the file and we should get an implicit exit, 
// if we don't actually expressly exit here, PHP segfaults.
// ♫ Oh, PHP, I love you. ♫

I think that falls into the "amusement and decorative" category. Or the "shoot self in head after slaughtering the php-internals team" category. Pick one.

Anyway, this is not a good idea because it's going to make your code hard to modify.

Calling a class method with some special character on it (spanish here) sometimes fails and sometimes works

AFAIK, the developers post-poned (maybe even cancelled) built-in unicode support, which had been planned for PHP6. It would have allowed you to use unicode characters in your class/function/method/var names.

~~For now, your only option is to use plain ascii characters.~~

very unsafe behavior of extract() and compact() function with non standard variables, it is a bug or a feature ;)?

PHP is open source. Feel free to contribute a patch to fix this behavior if you feel it should be fixed and/or join the PHP mailing list to discuss this topic with other PHP developers.

Simply ranting is pointless, especially here.

PHP variable / function / class names using special characters

If you check the docs on variables it says that:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

But basically people have agreed to only use a-zA-Z0-9_ and not the "fancy" names since they might break depending in the encoding one uses.

So you can have a variable that is named $aöäüÖÄ but if you save that with the wrong encoding you might run into trouble.

The same goes for functions too btw.

function fooööö($aà) { echo $aà; }

fooööö("hi"); // will just echo 'hi'

will just work out (at least at first).

Also check out:

Exotic names for methods, constants, variables and fields - Bug or Feature?

for some discussion on the subject.

How to check if a string can be used as a variable name in PHP?

From the manual:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

So If you ran your string through the RegEx, you should be able to tell if it's valid or not.

It should be noted that the ability to access 'invalid' Object property names using a variable variable is the correct approach for some XML parsing.

For example, from the SimpleXML docs:

Accessing elements within an XML document that contain characters not permitted under PHP's naming convention (e.g. the hyphen) can be accomplished by encapsulating the element name within braces and the apostrophe.

Followed by this code example:

echo $xml->movie->{'great-lines'}->line;

So it's not necessarily wrong to have properties that can only be accessed this way.

However, if your code both creates and uses the object - one would wonder why you would use those kind of properties. Allowing, of course, a situation similar to the SimpleXML example, where an object is created to represent something outside the scope of your control.

Is it safe to have 1 letter class names in PHP, e.g A, B, C

No.

Your future self will build a time machine for the sole purpose of slapping you for writing such unreadable code. And then, a paradox will result, and all of reality as we know it will be destroyed.

Why can I define a PHP function with a non-printing character?

From manual:

Function names follow the same rules as other labels in PHP. A valid
function name starts with a letter or underscore, followed by any
number of letters, numbers, or underscores. As a regular expression,
it would be expressed thus: [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*.

As explained in the other answer linked, regular expression is applied byte-per-byte, allowing "many weird Unicode names".

Doing it that way has some side-effects like you've seen. However, I can't imagine it was the original intent of people behind PHP, it would be just a direct consequence of the way they've implemented it.

Exotic Names for Methods, Constants, Variables and Fields - Bug or Feature