What Do C and Assembler Actually Compile To

What do C and Assembler actually compile to?

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

For example, if you compile and assemble (but don't link) this program

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

A text section containing the machine code for main
A label definition for main which points to the beginning of the text section
A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"
A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

Does a C compiler compile to generic assembly?

There is no requirement to compile C to any specific assemblies or any assembly at all, those are left to the implementor of a compiler, and not part of the language specification. Typically, every CPU manufacturer will develop a C compiler to target their specific architecture.

There are more generic compilers like GCC and Clang though, which can target many different instruction sets.

To use Clang as an example, it is based on the Low Level Virtual Machine, which is an abstract machine with an "intermediate representation" language, LLVM IR. A back-end is written for each architecture that LLVM can target for converting LLVM IR to the instruction set, and then any compiler which compiles to LLVM IR can then target the CPUs supported by LLVM.

The compiler will decide which back-end to target at runtime based on the arguments you pass to it. The compiler typically has a default back-end which is set when building the compiler itself, through the configuration (which will probably default to the architecture you're building the compiler on).

GCC probably uses a similar approach with some intermediate representation, but I'm not sure of the details. There's also a GCC back-end which can target LLVM too.

Why C language build process includes 'assemble' process?

Timwi was simplifying the build process.

The point is that C code is typically not compiled to bytecode for a virtual machine, but to native code for your machine. This is "direct" because there's no further processing to be done at run time to make the program runnable, as would typically be used with languages like C# or Java.

There may be a multi-step process, including compiling, assembling, and linking, to achieve that, but the steps in the process aren't important to the point Timwi was making.

Do programming language compilers first translate to assembly or directly to machine code?

gcc actually produces assembler and assembles it using the as assembler. Not all compilers do this - the MS compilers produce object code directly, though you can make them generate assembler output. Translating assembler to object code is a pretty simple process, at least compared with C→Assembly or C→Machine-code translation.

Some compilers produce other high-level language code as their output - for example, cfront, the first C++ compiler, produced C as its output which was then compiled to machine code by a C compiler.

Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.

Does a compiler have an assembler too?

It depends on the compiler; many compilers can compile to assembly. For instance, if you pass the '-S' flag to gcc, like:

gcc -S -o test.S test.c

That will output assembly for your test.c file into the file test.S which you can look at. (I recommend using -O0 if you're gonna be trying to read the assembly, because compiler optimizations in there will likely confuse the heck out of you).

Since you mentioned Visual C++ in your question, Paul Dixon points out below that Visual C++ uses the /FA flag to accomplish the same thing.

Does a compiler always produce an assembly code?

TL:DR different object file formats / easier portability to new Unix platforms (historically) is one of the main reasons for gcc keeping the assembler separate from the compiler, I think. Outside of gcc, the mainstream x86 C and C++ compilers (clang/LLVM, MSVC, ICC) go straight to machine code, with the option of printing asm text if you ask them to.

LLVM and MSVC are / come with complete toolchains, not just compilers. (Also come with assembler and linker). LLVM already has object-file handling as a library function, so it can use that instead of writing out asm text to feed to a separate program.

Smaller projects often choose to leave object-file format details to the assembler. e.g. FreePascal can go straight to an object file on a few of its target platforms, but otherwise only to asm. There are many claims (1, 2, 3, 4) that almost all compilers go through asm text, but that's not true for many of the biggest most-widely-used compilers (except GCC) that have lots of developers working on them.

C compilers tend to either target a single platform only (like a vendor's compiler for a microcontroller) and were written as "the/a C implementation for this platform", or be very large projects like LLVM where including machine code generation isn't a big fraction of the compiler's own code size. Compilers for less widely used languages are more usually portable, but without wanting to write their own machine-code / object-file handling. (Many compilers these days are front-ends for LLVM, so get .o output for free, like rustc, but older compilers didn't have that option.)

Out of all compilers ever, most do go to asm. But if you weight by how often each one is used every day, going straight to a relocatable object file (.o / .obj) is significant fraction of the total builds done on any given day worldwide. i.e. the compiler you care about if you're reading this might well work this way.

Also, compilers like javac that target a portable bytecode format have less reason to use asm; the same output file and bytecode format work across every platform they have to run on.

https://retrocomputing.stackexchange.com/questions/14927/when-and-why-did-high-level-language-compilers-start-targeting-assembly-language on retrocomputing has some other answers about advantages of keeping as separate.
What is the need to generate ASM code in gcc, g++
What do C and Assembler actually compile to? - even compilers that go straight to machine code don't produce linked executables directly, they produce relocatable object files (.o or .obj). Except for tcc, the Tiny C Compiler, intended for use on the fly for one-file C programs.
Semi-related: Why do we even need assembler when we have compiler? asm is useful for humans to look at machine code, not as a necessary part of C -> machine code.

Why GCC does what it does

Yes, as is a separate program that the gcc front-end actually runs separately from cc1 (the C preprocessor+compiler that produces text asm).

This makes gcc slightly more modular, making the compiler itself a text -> text program.

GCC internally uses some binary data structures for GIMPLE and RTL internal representations, but it doesn't write (text representations of) those IR formats to files unless you use a special option for debugging.

So why stop at assembly? This means GCC doesn't need to know about different object file formats for the same target. For example, different x86-64 OSes use ELF, PE/COFF, MachO64 object files, and historically a.out. as assembles the same text asm into the same machine code surrounded by different object file metadata on different targets. (There are minor differences gcc has to know about, like whether to prepend an _ to symbol names or not, and whether 32-bit absolute addresses can be used, and whether code has to be PIC.)

Any platform-specific quirks can be left to GNU binutils as (aka GAS), or gcc can use the vendor-supplied assembler that comes with a system.

Historically, there were many different Unix systems with different CPUs, or especially the same CPU but different quirks in their object file formats. And more importantly, a fairly compatible set of assembler directives like .globl main, .asciiz "Hello World!\n", and similar. GAS syntax comes from Unix assemblers.

It really was possible in the past to port GCC to a new Unix platform without porting as, just using the assembler that comes with the OS.

Nobody has ever gotten around to integrating an assembler as a library into GCC's cc1 compiler. That's been done for the C preprocessor (which historically was also done in a separate process), but not the assembler.

Most other compilers do produce object files directly from the compiler, without a text asm temporary file / pipe. Often because the compiler was only designed for one or a couple targets, like MSVC or ICC or various compilers that started out as x86-only, or many vendor-supplied compilers for embedded chips.

clang/LLVM was designed much more recently than GCC. It was designed to work as an optimizing JIT back-end, so it needed a built-in assembler to make it fast to generate machine code. To work as an ahead-of-time compiler, adding support for different object-file formats was presumably a minor thing since the internal software architecture was there to go straight to binary machine code.

LLVM of course uses LLVM-IR internally for target-independent optimizations before looking for back-end-specific optimizations, but again it only writes out this format as text if you ask it to.

Where do the compiler and assembler reside on a computer?

You are overcomplicating this. A compiler takes text in some format and converts it, typically, to text in another format. Say for example a C compiler turns C into assembly. A compiler is just a program, nothing special about it just like your web browser is a program, the text editor you use for writing the programs is just a program, the command line/console if you use one is just a program. No magic.

An assembler is just a program that takes text in and typically outputs some form of binary file. There are many formats just like there are many binary formats for images and videos (bmp, jpg, png, gif, tiff, m4v, mpeg, etc). No magic, just a program that does a job like any of the ones listed above.

Same goes for the linker, it takes binary files in and typically outputs a binary file out.

These programs are, typically, like all other programs on your hard drive, or at least on a drive you have mounted and can access. Like the web browser and text editor, etc. Now to run them you need them "in the path" ideally or if part of some IDE then the IDE might not need them in the path it may know relative to itself where they are. Likewise the compiler which often calls the assembler and linker for you, might not need the path it may know/assume relative to where it is where they are. But they live on the file system like any other program/file but to execute them they need to be able to be found. And depending on the operating system and the installer for the toolchain there are often different choices and not one global rule.

There is no reason why you cant have as many different compilers and assemblers as you can fit on your filesystem, they are just programs like any other, so you have to find a place for them and have to have a way to run them. There is no reason to assume that any two compilers produce the same binary from the same source code, likewise there is no reason to assume that any assembler is able to assemble the output of any compiler. That is where the term toolchain comes from, a set of tools that link together in a chain, compiler outputs something the assembler in the toolchain knows how to deal with the assembler outputs something the linker knows how to deal with. You might have some cross compatibility among different toolchains/vendors, but that doesnt mean they have to that could either be by design, or dumb luck.

Does the compiler actually produce Machine Code?

You are confusing a few things. I retargettable compiler like gcc and other generic compilers compile files to objects, then the linker later links objects with other libraries as needed to make a so called binary that the operating system can then read, parse, load the loadable blocks and start execution.

A sane compiler author will use assembly language as the output of the compiler then the compiler or the user in their makefile calls the assembler which creates the object. This is how gcc works. And how clang works sorta, but llc can make objects directly now not just assembly that gets assembled.

It makes far more sense to generate debuggable assembly language that produce raw machine code. You really need a good reason like JIT to skip the step. I would avoid toolchains that go straight to machine code just because they can, they are harder to maintain and more likely to have bugs or take longer to fix bugs.

If the architecture is the same there is no reason why you cant have a generic toolchain generate code for incompatible operating systems. the gnu tools for example can do this. Operating system differences are not by definition at the machine code level most are at the high level language level C libraries that you can to create gui windows, etc have nothing to do with the machine code nor the processor architecture, for some operating systems the same operating system specific C code can be used on mips or arm or powerpc or x86. where the architecture becomes specific is the mechanism that actual system calls are invoked. A specific instruction is often used. and machine code is eventually used yes but no reason why this cant be coded in real or inline assembly.

And then this leads to libraries, even fopen and printf which are generic C calls eventually have to make a system call so much of the library support code can be in a compatible across systems high level language, there will need to be a system and architecture specific bit of code for the last mile. You should see this in glibc sources, or hooks into newlib for example in other library solutions. As examples.

Same is true for other languages like C++ as it is for C. Interpreted languages have additional layers but their virtual machines are just programs that sit on similar layers.

Low level programming doesnt mean machine nor assembly language it just means whatever programming language you are using accesses at a lower level, below the application or below the operating system, etc...

What are the main steps behind compiling?

By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?

You are not very clear here. If you are asking, which tool has knowledge of your CPU specific instructions, it's the assembler, disassembler, debugger, and maybe some others. They can generate machine code or convert it back to disassembly.

If you are asking who cares about which instructions are used, it's the processor that needs to execute them, as each instruction set represents even such common instruction as "add two integers" in completely different manner.

Is gcc converting any C to assembly language?

Yes, C (or program in any other supported language) is converted to assembly by GCC. There are many steps involved, and at least two additional internal representations used in process. Details are explained in GCC internals document. Finally compiler "backend" generates assembly representation of simple "patterns", generated by previous compiler passes. You can ask GCC to output this assembly by using -S flag. If you don't specifically ask for it, next step (assembling) is automatically executed and you only see your final executable file.

I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?

First take note that assembly languages for each CPU differ, as they are supposed to represent CPU's machine language 1:1. Assembler then translated assembly code into machine code. Who ships it? Anyone who builds it. With GNU toolchain it's part of binutils package and it's usually installed by default on most Linux distributions. This is not only assembler available. Also note, that although GNU "suite" (GCC/binutils/gdb) support many architectures, you need to use appropriate port for your architecture. Your desktop PC's default assembler for example can not compile/assemble into ARM machine code.

Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?

Because text editor is supposed to show text representation of that 0s and 1s. Assuming each character in file takes 8 bits they interpret each subseqent 8-bits as single character, instead of showing separate bits. If you know that in standard 8 bit ASCII letter 'A' is represented by value 65, you can also convert this back to binary: 01000001. It's a bit easier to convert hexadecimal representation back to binary. For this you can use hexdump (or similar) tool.

What Do C and Assembler Actually Compile To