What Are the Stages of Compilation of a C++ Program

What are the stages of compilation of a C++ program?

Are the stages of compilation of a C++ program specified by the standard?

Yes and no.

The C++ standard defines 9 "phases of translation". Quoting from the N3242 draft (10MB PDF), dated 2011-02-28 (prior to the release of the official C++11 standard), section 2.2:

The precedence among the syntax rules of translation is specified by the following phases [see footnote].
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. [SNIP]
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to
form logical source lines. [SNIP]
The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments). [SNIP]
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. [SNIP]
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name
in a character literal or a non-raw string literal, is converted to
the corresponding member of the execution character set; [SNIP]
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The
resulting tokens are syntactically and semantically analyzed and
translated as a translation unit. [SNIP]
Translated translation units and instantiation units are combined as follows: [SNIP]
All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the
current translation. All such translator output is collected into a
program image which contains information needed for execution in its
execution environment.
[footnote] Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.

As indicated by the [SNIP] markers, I haven't quoted the entire section, just enough to get the idea across.

To emphasize, compilers are not required to follow this exact model, as long as the final result is as if they did.

Phases 1-6 correspond more or less to the preprocessor, 7 to what you might normally think of as compilation, 8 deals with templates, and 9 corresponds to linking.

(C's translation phases are similar, but #8 is omitted.)

What are the main steps behind compiling?

By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?

You are not very clear here. If you are asking, which tool has knowledge of your CPU specific instructions, it's the assembler, disassembler, debugger, and maybe some others. They can generate machine code or convert it back to disassembly.

If you are asking who cares about which instructions are used, it's the processor that needs to execute them, as each instruction set represents even such common instruction as "add two integers" in completely different manner.

Is gcc converting any C to assembly language?

Yes, C (or program in any other supported language) is converted to assembly by GCC. There are many steps involved, and at least two additional internal representations used in process. Details are explained in GCC internals document. Finally compiler "backend" generates assembly representation of simple "patterns", generated by previous compiler passes. You can ask GCC to output this assembly by using -S flag. If you don't specifically ask for it, next step (assembling) is automatically executed and you only see your final executable file.

I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?

First take note that assembly languages for each CPU differ, as they are supposed to represent CPU's machine language 1:1. Assembler then translated assembly code into machine code. Who ships it? Anyone who builds it. With GNU toolchain it's part of binutils package and it's usually installed by default on most Linux distributions. This is not only assembler available. Also note, that although GNU "suite" (GCC/binutils/gdb) support many architectures, you need to use appropriate port for your architecture. Your desktop PC's default assembler for example can not compile/assemble into ARM machine code.

Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?

Because text editor is supposed to show text representation of that 0s and 1s. Assuming each character in file takes 8 bits they interpret each subseqent 8-bits as single character, instead of showing separate bits. If you know that in standard 8 bit ASCII letter 'A' is represented by value 65, you can also convert this back to binary: 01000001. It's a bit easier to convert hexadecimal representation back to binary. For this you can use hexdump (or similar) tool.

At what stage of C program compilation, an inline function is inserted into the caller function

"Compiler stages" is not really a standardized term. The C language only specifies something called translation phases, which in detail specify the various pre-processor stages, but is very vague about all the work that goes on after pre-processing. It is summarized by the standard as the final translation phase:

All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.

This includes optimization and anything else that needs to be done before producing the executable binary. All the details of how and when this is done, is left to the compiler implementation to decide as they see fit.

Compiling process in terms of a simple c++ program

This is a very broad question in general but i'll try to answer as briefly as possible.
A typical language processing system has the following phases :

1. Preprocessing Phase - In this phase all preprocessors and macros are handled and code is generated which is free from these. This involves replacing macro calls with macro body and replacing the formal parameters with the actual parameter.

2. Compilation Phase - This has several smaller phases such as:
Lexical Analysis , Syntax Analysis , Semantic Analysis , Intermediate code generation , code optimization , target code generation , etc.
The Compilation phase may/may not produce assembly code. There are separate pros and cons of both the approaches. We will assume that assembly code was produced in this discussion.

3. Assembly Phase - The assembler converts the output of compiler to target code . Assemblers can be one pass or two pass in nature.

4. Linking Phase - The code that has been produced has many references and calls to subroutines which are defined in other modules. Such modules are linked to the code in this phase and the addresses are assigned to such instructions which have outside references.

5. Loading Phase - In this phase , all the segments which are produced in the previous phase get loaded into the RAM for actual execution and control is passed to the first instruction.

All components listed in this answer have many intricacies and sub-parts and in no way are a complete explanation of a language processor.

There are books such by authors DM Dhamdere , Tannenbaum and Alfred Aho on these topics which are useful.

Execute Large C Program By Generating Intermediate Stages

Problem 2: Opening the file on each iteration of the loop because it's changed

I may not be best qualified to answer this but doing fopen on each iteration (and fclose) presumably seems wasteful and slow. To answer, or have anyone more qualified answer, I think we'd need to know more about your data.

For instance:

Is it text or binary?
Are you processing records or a stream of text? That is, is it a file of records or a stream of data? (you aren't cracking genes are you? :-)

I ask as, judging by your comment "because it's changed each iteration", would you be better using a random-accessed file. By this, I'm guessing you're re-opening to fseek to a point that you may have passed (in your stream of data) and making a change. However, if you open a file as binary, you can fseek through anywhere in the file using fsetpos and fseek. That is, you can "seek" backwards.

Additionally, if your data is record-based or somehow organised, you could also create an index for it. with this, you could use to fsetpos to set the pointer at the index you're interested in and traverse. Thus, saving time in finding the area of data to change. You could even persist your index in an accompanying index file.

Note that you can write plain text to a binary file. Perhaps worth investigating?

How does the compilation/linking process work?

The compilation of a C++ program involves three steps:

Preprocessing: the preprocessor takes a C++ source code file and deals with the #includes, #defines and other preprocessor directives. The output of this step is a "pure" C++ file without pre-processor directives.
Compilation: the compiler takes the pre-processor's output and produces an object file from it.
Linking: the linker takes the object files produced by the compiler and produces either a library or an executable file.

Preprocessing

The preprocessor handles the preprocessor directives, like #include and #define. It is agnostic of the syntax of C++, which is why it must be used with care.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it makes sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.

Compilation

The compilation step is performed on each output of the preprocessor. The compiler parses the pure C++ source code (now without any preprocessor directives) and converts it into assembly code. Then invokes underlying back-end(assembler in toolchain) that assembles that code into machine code producing actual binary file in some format(ELF, COFF, a.out, ...). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don't provide a definition for it. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed.

Compilers usually let you stop compilation at this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don't need to recompile everything if you only change a single file.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It's at this stage that "regular" compiler errors, like syntax errors or failed overload resolution errors, are reported.

Linking

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven't got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don't exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

At what stage of compilation are reserved identifiers reserved?

(The comments on the question explain that we're talking about reserved identifiers in the sense of C99 section 7.1.3, i.e., identifiers matching /^_[A-Z_]/ anywhere, /^_/ in file scope, /^str[a-z]/ with external linkage, etc. So here's my guess at at least a part of what you're asking...)

They're not reserved in the sense that (any particular phase of) the compiler is expected to diagnose their misuse. Rather, they're reserved in that if you're foolish enough to (mis)use them yourself, you don't get to complain if your program stops working or stops compiling at a later date.

We've all seen what happens when people with only a dangerous amount of knowledge look inside system headers and then write their own header guards:

#ifndef _MYHEADER_H
#define _MYHEADER_H
// ...
#endif

They're invoking undefined behaviour, but nothing diagnoses this as "error: reserved identifier used by end-user code". Instead mostly they're lucky and all is well; but occasionally they collide with an identifier of interest to the implementation, and confusing things happen.

Similarly, I often have an externally-visible function named strip() or so:

char *strip(char *s) {
  // remove leading whitespace
  }

By my reading of C99's 7.1.3, 7.26, and 7.26.11, this invokes undefined behaviour. However I have decided not to care about this. The identifier is not reserved in that anything bad is expected to happen today, but because the Standard reserves to itself the right to invent a new standard str-ip() routine in a future revision. And I've decided that I reckon string-ip, whatever that might be, is an unlikely name for a string operation to be added in the future -- so in the unlikely event that happens, I'll cross that bridge when I get to it. Technically I'm invoking undefined behaviour, but I don't expect to get bitten.

Finally, a counter-example to your point 4:

#include <string.h>
#define memcpy(d,s,n)  (my_crazy_function((n), (s)))
void foo(char *a, char *b) {
  memcpy(a, b, 5);  // intends to invoke my_crazy_function
  memmove(a, b, 5); // standard behaviour expected
}

This complies with your 4.1, 4.2, 4.3 (if I understand your intention on that last one). However, if memmove is additionally implemented as a macro (via 7.1.4/1) that is written in terms of memcpy, then you're going to be in trouble.

What Are the Stages of Compilation of a C++ Program