How Is the PHP Array Implemented on the C Level

How is the PHP array implemented on the C level?

PHP associative arrays are in fact implementation of HashTables.

Internally, it is possible to make numeric arrays or associative arrays.
If you combine them, it is associative array.

In numeric arrays, it is very similar to C. You have array of pointers to ZVAL structs.

Because pointers have fixed-length (let's call it n), the offset (x) calculation is easy: x * n.

In PHP types are ZVAL structs (because that way it implements dynamic types), but it also helps in associative array, because you can assume fixed-length. So even if direct access to array is slower, it is still considered O(1).

So what happens in string keys? PHP uses hash function to convert them to intergers.

Searching in numeric and associative array has similar efficiency, because internally they are all numeric.

Only direct-access to array keys is slower, because of the additional level (hash function).

How does PHP keep track of order in an associative array?

How are associative arrays implemented in PHP? might give you some insight.

It seems that PHP arrays are essentially hash tables, so the order of the array will stay the same until you reorder it (e.g. by sorting the array).

EDIT: It appears this is getting downvoted, allow me to explicitly include the sources I linked to in the comment below here...

"PHP associative arrays are in fact an implementation of HashTables", from
How is the PHP array implemented on the C level?
Also from that source: "The PHP array is a chained hash table (lookup of O(c) and O(n) on key collisions) that allows for int and string keys. It uses 2 different hashing algorithms to fit the two types into the same hash key space."
"Everything is a HashTable" from http://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html

php array lookup time

An array in PHP is actually an ordered map. A map is a type that
associates values to keys. This type is optimized for several
different uses; it can be treated as an array, list (vector), hash
table (an implementation of a map), dictionary, collection, stack,
queue, and probably more. As array values can be other arrays, trees
and multidimensional arrays are also possible.

The implementation is in fact some sort of a HashTable.

-> http://php.net/manual/en/language.types.array.php

-> How is the PHP array implemented on the C level?

What data-structure does the PHP array() constructor create?

I'd suggest that it's best to view arrays as a hash table implementation primarily, as if you spend a little time with the source you'll notice that whilst Zend HTs can be traversed as DLLs most of the operations you might be interested to compare (add, random access) are typical HT implementations.

Where we're dealing purely in numerical keys then it's essentially a pass-through to a C array without involving the hash function, and in this case there will be precisely one HT bucket per key. Of course the ability to dynamically extend the array is facilitated by a periodic memory reallocation, so you'll suffer a penalty there (as you might if you didn't pre-allocate a large std::vector in C++).

HTH.

Does PHP handle numerically-indexed arrays differently (internally)?

In short PHP does not have non-associative arrays.

@Sectus has posted a good answer re the underlying implementation of PHP arrays. It is often beneficial to understand what's going on "under the hood". But regardless of their underlying implementation, PHP arrays are exposed to the PHP developer as an ordered-map of values associated to keys (ie, an associated array). Always.

From PHP:Arrays - Manual

An array in PHP is actually an ordered map. A map is a type that associates values to keys.

and

PHP arrays can contain integer and string keys at the same time as PHP does not distinguish between indexed and associative arrays.

and

The key can either be an integer or a string.

The misconception that arrays are numerically indexed is caused by the feature that integer keys are auto-incremented as a matter of convenience, in the case where no key is explicitly specified.

But note, even when all keys are integers, there's no guarantee in PHP that an item exists at eg $arr[0], which to my knowledge is a given in any other language that does use indexed arrays (that is, assuming the indexed array contains at least one element, and is 0-based).

This is not a trivial differentiation. IMHO programmers who rely on PHP arrays behaving like indexed array without thought for the potential consequences or understanding of the difference* may be setting themselves (or future maintainers) up for strange/unexpected behaviour.

*I've qualified this, as there are obviously cases where an informed decision to take advantage of the index-like conveniences/features of the PHP language around arrays can provide benefit.

Is the following array well defined for binary search?

You can use file to get the entire contents of a file as an array of lines. Assuming that the first int in each pair is unique, you can use it as the key for the array:

foreach (file('ints.txt') as $line) {
  list($key, $value) = explode(';', $line);
  $elements[$key] = $value;
}

Looking up values by their keys in $elements will be O(n) but really close to O(1); this might be fast enough for your purposes.

How PHP (the zend engine) stores an array internally?

Internally PHP just uses a simple HashTable. (First a hash lookup; upon collision there's a simple list lookup.)

By the way, there are also some special classes in the SPL http://php.net/spl.datastructures which you might want to use for special things… (Only use them if really necessary… it's mostly just not worth it.)

+ operator for array in PHP?

Quoting from the PHP Manual on Language Operators

The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored.

So if you do

$array1 = ['one',   'two',          'foo' => 'bar'];
$array2 = ['three', 'four', 'five', 'foo' => 'baz']; 

print_r($array1 + $array2);

You will get

Array
(
    [0] => one   // preserved from $array1 (left-hand array)
    [1] => two   // preserved from $array1 (left-hand array)
    [foo] => bar // preserved from $array1 (left-hand array)
    [2] => five  // added from $array2 (right-hand array)
)

So the logic of + is equivalent to the following snippet:

$union = $array1;

foreach ($array2 as $key => $value) {
    if (false === array_key_exists($key, $union)) {
        $union[$key] = $value;
    }
}

If you are interested in the details of the C-level implementation head to

php-src/Zend/zend_operators.c

Note, that + is different from how array_merge() would combine the arrays:

print_r(array_merge($array1, $array2));

would give you

Array
(
    [0] => one   // preserved from $array1
    [1] => two   // preserved from $array1
    [foo] => baz // overwritten from $array2
    [2] => three // appended from $array2
    [3] => four  // appended from $array2
    [4] => five  // appended from $array2
)

See linked pages for more examples.

List of Big-O for PHP functions

Since it doesn't seem like anyone has done this before I thought it'd be good idea to have it for reference somewhere. I've gone though and either via benchmark or code-skimming to characterize the array_* functions. I've tried to put the more interesting Big-O near the top. This list is not complete.

Note: All the Big-O where calculated assuming a hash lookup is O(1) even though it's really O(n). The coefficient of the n is so low, the ram overhead of storing a large enough array would hurt you before the characteristics of lookup Big-O would start taking effect. For example the difference between a call to array_key_exists at N=1 and N=1,000,000 is ~50% time increase.

Interesting Points:

isset/array_key_exists is much faster than in_array and array_search
+(union) is a bit faster than array_merge (and looks nicer). But it does work differently so keep that in mind.
shuffle is on the same Big-O tier as array_rand
array_pop/array_push is faster than array_shift/array_unshift due to re-index penalty

Lookups:

array_key_exists O(n) but really close to O(1) - this is because of linear polling in collisions, but because the chance of collisions is very small, the coefficient is also very small. I find you treat hash lookups as O(1) to give a more realistic big-O. For example the different between N=1000 and N=100000 is only about 50% slow down.

isset( $array[$index] ) O(n) but really close to O(1) - it uses the same lookup as array_key_exists. Since it's language construct, will cache the lookup if the key is hardcoded, resulting in speed up in cases where the same key is used repeatedly.

in_array O(n) - this is because it does a linear search though the array until it finds the value.

array_search O(n) - it uses the same core function as in_array but returns value.

Queue functions:

array_push O(∑ var_i, for all i)

array_pop O(1)

array_shift O(n) - it has to reindex all the keys

array_unshift O(n + ∑ var_i, for all i) - it has to reindex all the keys

Array Intersection, Union, Subtraction:

array_intersect_key if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_intersect if intersection 100% do O(n^2*∑param_i_count, for all i), if intersection 0% intersect O(n^2)

array_intersect_assoc if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_diff O(π param_i_size, for all i) - That's product of all the param_sizes

array_diff_key O(∑ param_i_size, for i != 1) - this is because we don't need to iterate over the first array.

array_merge O( ∑ array_i, i != 1 ) - doesn't need to iterate over the first array

+ (union) O(n), where n is size of the 2nd array (ie array_first + array_second) - less overhead than array_merge since it doesn't have to renumber

array_replace O( ∑ array_i, for all i )

Random:

shuffle O(n)

array_rand O(n) - Requires a linear poll.

Obvious Big-O:

array_fill O(n)

array_fill_keys O(n)

range O(n)

array_splice O(offset + length)

array_slice O(offset + length) or O(n) if length = NULL

array_keys O(n)

array_values O(n)

array_reverse O(n)

array_pad O(pad_size)

array_flip O(n)

array_sum O(n)

array_product O(n)

array_reduce O(n)

array_filter O(n)

array_map O(n)

array_chunk O(n)

array_combine O(n)

I'd like to thank Eureqa for making it easy to find the Big-O of the functions. It's an amazing free program that can find the best fitting function for arbitrary data.

EDIT:

For those who doubt that PHP array lookups are O(N), I've written a benchmark to test that (they are still effectively O(1) for most realistic values).

php array lookup graph

$tests = 1000000;
$max = 5000001;


for( $i = 1; $i <= $max; $i += 10000 ) {
    //create lookup array
    $array = array_fill( 0, $i, NULL );

    //build test indexes
    $test_indexes = array();
    for( $j = 0; $j < $tests; $j++ ) {
        $test_indexes[] = rand( 0, $i-1 );
    }

    //benchmark array lookups
    $start = microtime( TRUE );
    foreach( $test_indexes as $test_index ) {
        $value = $array[ $test_index ];
        unset( $value );
    }
    $stop = microtime( TRUE );
    unset( $array, $test_indexes, $test_index );

    printf( "%d,%1.15f\n", $i, $stop - $start ); //time per 1mil lookups
    unset( $stop, $start );
}

How Is the PHP Array Implemented on the C Level