Are Pointer Variables Just Integers with Some Operators or Are They "Symbolic"

Are pointer variables just integers with some operators or are they symbolic ?

C was conceived as a language in which pointers and integers were very intimately related, with the exact relationship depending upon the target platform. The relationship between pointers and integers made the language very suitable for purposes of low-level or systems programming. For purposes of discussion below, I'll thus call this language "Low-Level C" [LLC].

The C Standards Committee wrote up a description of a different language, where such a relationship is not expressly forbidden, but is not acknowledged in any useful fashion, even when an implementation generates code for a target and application field where such a relationship would be useful. I'll call this language "High Level Only C" [HLOC].

In the days when the Standard was written, most things that called themselves C implementations processed a dialect of LLC. Most useful compilers process a dialect which defines useful semantics in more cases than HLOC, but not as many as LLC. Whether pointers behave more like integers or more like abstract mystical entities depends upon which exact dialect one is using. If one is doing systems programming, it is reasonable to view C as treating pointers and integers as intimately related, because LLC dialects suitable for that purpose do so, and HLOC dialects that don't do so aren't suitable for that purpose. When doing high-end number crunching, however, one would far more often being using dialects of HLOC which do not recognize such a relationship.

The real problem, and source of so much contention, lies in the fact that LLC and HLOC are increasingly divergent, and yet are both referred to by the name C.

Why pointer (*) and array ([]) symbols are bound to variable name and not to type in variable declaration?

Kernighan and Ritchie write, in The C Programming Language, 1978, page 90:

The declaration of the pointer px is new.

int *px;

is intended as a mnemonic; it says the combination *px is an int, that is, if px occurs in the context *px, it is equivalent to a variable of the type int. In effect, the syntax of the declaration for a variable mimics the syntax of expressions in which the variable might appear. This reasoning is useful in all cases involving complicated declarations. For example,

double atof(), *dp;

says that in an expression atof() and *dp have values of type double.

Thus, we see that, in declarations such as int X, Y, Z, X, Y, and Z give us “pictures” of expressions, such as b, *b, b[10], *b[10], and so on. The actual type for the declared identifier is derived from the picture: Since *b[10] is an int, then b[10] is a pointer to an int, so b is an array of 10 pointers to int.

What are the distinctions between the various symbols (*,&, etc) combined with parameters?

To understand this you'll first need to understand pointers and references. I'll simply explain the type declaration syntax you're asking about assuming you already know what pointers and references are.

In C, it is said that 'declaration follows use.' That means the syntax for declaring a variable mimics using the variable: generally in a declaration you'll have a base type like int or float followed something that looks like an expression. For example in int *y the base type is int and the expression look-alike is *y. Thereafter that expression evaluates to a value with the given base type.

So int *y means that later an expression *y is an int. That implies that y must be a pointer to an int. The same holds true for function parameters, and in fact for whole function declarations:

int *foo(int **bar);

In the above int **bar says **bar is an int, implying *bar is a pointer to an int, and bar is a pointer to a pointer to an int. It also declares that *foo(arg) will be an int (given arg of the appropriate type), implying that foo(arg) results in a pointer to an int.¹ So the whole function declaration reads "foo is a function taking a pointer to a pointer to an int, and returning a pointer to an int."

C++ adds the concept of references, and messes C style declarations up a little bit in the process. Because taking the address of a variable using the address-of operator & must result in a pointer, C doesn't have any use for & in declarations; int &x would mean &x is an int, implying that x is some type where taking the address of that type results in an int.² So because this syntax is unused, C++ appropriates it for a completely different purpose.

In C++ int &x means that x is a reference to an int. Using the variable does not involve any operator to 'dereference' the reference, so it doesn't matter that the reference declarator symbol clashes with the address-of operator. The same symbol means completely different things in the two contexts, and there is never a need to use one meaning in the context where the other is allowed.

So char &foo(int &a) declares a function taking a reference to an int and returning a reference to a char. func(&x) is an expression taking the address of x and passing it to func.


1. In fact in the original C syntax for declaring functions 'declarations follow use' was even more strictly followed. For example you'd declare a function as int foo(a,b) and the types of parameters were declared elsewhere, so that the declaration would look exactly like a use, without the extra typenames.

2. Of course int *&x; could make sense in that *&x could be an int, but C doesn't actually do that.

How to explain C pointers (declaration vs. unary operators) to a beginner?

For your student to understand the meaning of the * symbol in different contexts, they must first understand that the contexts are indeed different. Once they understand that the contexts are different (i.e. the difference between the left hand side of an assignment and a general expression) it isn't too much of a cognitive leap to understand what the differences are.

Firstly explain that the declaration of a variable cannot contain operators (demonstrate this by showing that putting a - or + symbol in a variable declaration simply causes an error). Then go on to show that an expression (i.e. on the right hand side of an assignment) can contain operators. Make sure the student understands that an expression and a variable declaration are two completely different contexts.

When they understand that the contexts are different, you can go on to explain that when the * symbol is in a variable declaration in front of the variable identifier, it means 'declare this variable as a pointer'. Then you can explain that when used in an expression (as a unary operator) the * symbol is the 'dereference operator' and it means 'the value at the address of' rather than its earlier meaning.

To truly convince your student, explain that the creators of C could have used any symbol to mean the dereference operator (i.e. they could have used @ instead) but for whatever reason they made the design decision to use *.

All in all, there's no way around explaining that the contexts are different. If the student doesn't understand the contexts are different, they can't understand why the * symbol can mean different things.

Why is the Dereference operator used to declare pointers?

Many symbols in C and C++ are overloaded. That is, their meanings depend on the context where they are used. For example, the symbol & can denote the address-of operator and the binary bitwise AND operator.

The symbol * used in a declaration denotes a pointer:

int b = 10;
int *a = &b,

but used in expressions, when applied to a variable of a pointer type, denotes the dereference operator, for example:

printf( "%d\n", *a );

It also can denote the multiplication operator, for example you can write:

printf( "%d\n", b ** a );

that is the same as

printf( "%d\n", b * *a );

Similarly, the pair of square braces can be used in a declaration of an array, like:

int a[10];

and as the subscript operator:

a[5] = 5;

Why does the arrow (- ) operator in C exist?

I'll interpret your question as two questions: 1) why -> even exists, and 2) why . does not automatically dereference the pointer. Answers to both questions have historical roots.

Why does -> even exist?

In one of the very first versions of C language (which I will refer as CRM for "C Reference Manual", which came with 6th Edition Unix in May 1975), operator -> had very exclusive meaning, not synonymous with * and . combination

The C language described by CRM was very different from the modern C in many respects. In CRM struct members implemented the global concept of byte offset, which could be added to any address value with no type restrictions. I.e. all names of all struct members had independent global meaning (and, therefore, had to be unique). For example you could declare

struct S {
int a;
int b;
};

and name a would stand for offset 0, while name b would stand for offset 2 (assuming int type of size 2 and no padding). The language required all members of all structs in the translation unit either have unique names or stand for the same offset value. E.g. in the same translation unit you could additionally declare

struct X {
int a;
int x;
};

and that would be OK, since the name a would consistently stand for offset 0. But this additional declaration

struct Y {
int b;
int a;
};

would be formally invalid, since it attempted to "redefine" a as offset 2 and b as offset 0.

And this is where the -> operator comes in. Since every struct member name had its own self-sufficient global meaning, the language supported expressions like these

int i = 5;
i->b = 42; /* Write 42 into `int` at address 7 */
100->a = 0; /* Write 0 into `int` at address 100 */

The first assignment was interpreted by the compiler as "take address 5, add offset 2 to it and assign 42 to the int value at the resultant address". I.e. the above would assign 42 to int value at address 7. Note that this use of -> did not care about the type of the expression on the left-hand side. The left hand side was interpreted as an rvalue numerical address (be it a pointer or an integer).

This sort of trickery was not possible with * and . combination. You could not do

(*i).b = 42;

since *i is already an invalid expression. The * operator, since it is separate from ., imposes more strict type requirements on its operand. To provide a capability to work around this limitation CRM introduced the -> operator, which is independent from the type of the left-hand operand.

As Keith noted in the comments, this difference between -> and *+. combination is what CRM is referring to as "relaxation of the requirement" in 7.1.8: Except for the relaxation of the requirement that E1 be of pointer type, the expression E1−>MOS is exactly equivalent to (*E1).MOS

Later, in K&R C many features originally described in CRM were significantly reworked. The idea of "struct member as global offset identifier" was completely removed. And the functionality of -> operator became fully identical to the functionality of * and . combination.

Why can't . dereference the pointer automatically?

Again, in CRM version of the language the left operand of the . operator was required to be an lvalue. That was the only requirement imposed on that operand (and that's what made it different from ->, as explained above). Note that CRM did not require the left operand of . to have a struct type. It just required it to be an lvalue, any lvalue. This means that in CRM version of C you could write code like this

struct S { int a, b; };
struct T { float x, y, z; };

struct T c;
c.b = 55;

In this case the compiler would write 55 into an int value positioned at byte-offset 2 in the continuous memory block known as c, even though type struct T had no field named b. The compiler would not care about the actual type of c at all. All it cared about is that c was an lvalue: some sort of writable memory block.

Now note that if you did this

S *s;
...
s.b = 42;

the code would be considered valid (since s is also an lvalue) and the compiler would simply attempt to write data into the pointer s itself, at byte-offset 2. Needless to say, things like this could easily result in memory overrun, but the language did not concern itself with such matters.

I.e. in that version of the language your proposed idea about overloading operator . for pointer types would not work: operator . already had very specific meaning when used with pointers (with lvalue pointers or with any lvalues at all). It was very weird functionality, no doubt. But it was there at the time.

Of course, this weird functionality is not a very strong reason against introducing overloaded . operator for pointers (as you suggested) in the reworked version of C - K&R C. But it hasn't been done. Maybe at that time there was some legacy code written in CRM version of C that had to be supported.

(The URL for the 1975 C Reference Manual may not be stable. Another copy, possibly with some subtle differences, is here.)

Interpreting the & sign in C++

Since there are only so many ASCII symbols, it's inevitable that they get re-used for contradictory purposes, and & is one such character.

It means either address of in a statement or reference to in a variable declaration. Thinking of the "right-hand-side" is close, but it's really just its presence in a statement or, more specifically, an expression. Remember the rules apply in function definitions as well, as in int f(int &a) where there's no RHS in play.

In your example:

int & b = a;

This reads as "b is a reference to a, which is an int".

Keep in mind references might seem similar to addresses as in pointers, but they are not at all the same. A reference is an alias, as in it is entirely equivalent to. A pointer is, by definition, always one degree removed.

That is in the case of:

b = 2;

This assigns directly to b, which is an alias for a, so they both change. To adjust a pointer int *c = &a you would need to do *c = 2 which changes a but does not change c, the address of a remains the same.

Remember you can have references to pointers, pointers to references, and any combination you can dream up, even from nightmares, like int &*&**&***a if you so wish. It's valid! (Just not recommended.)

So things to keep in mind for a simple example of int &b = a:

  • b and a share the same address
  • Modifying b always modifies a
  • To the compiler, b is just an alias for a
  • b[0] is a syntax error unless a can be indexed as an array
  • *b is a syntax error unless a can be de-referenced as pointer

Whereas for int *c = a:

  • c has a different address from a
  • c is a different variable from a
  • Modifying c directly does not modify a
  • Modifying a de-referenced *c does modify a
  • c behaves like an array, as in c[0] can be used to fetch or modify a
  • Pointers generally incur an additional level of overhead when de-referencing and "exercising" them

If you're ever wondering what's going on, look at the assembly output of a simple program that uses both pointers and references. The differences can be substantial.

Beginners question - Why does C use * to declare a pointer and not &?

Using * to declare pointers is mostly a matter of convention, but there is a reason of consistency: the * in the declaration int *p means int is the type of *p.

It might seem more consistent to write int &p = &n as p is initialized to the address of n but this convention would not hold for double pointers: int **pp defines pp as a pointer to a pointer to an int, yet pp cannot be initialized with &(&n).

Note that int& p = n; is a valid definition in C++ for a reference to an int, which is a pointer in disguise. Modifying p would then modify n. References are implemented as pointers without the indirection notation.



Related Topics



Leave a reply



Submit