Why Does The Solaris Assembler Generate Different Machine Code Than The Gnu Assembler Here

Why does the Solaris assembler generate different machine code than the GNU assembler here?

Some x86 instructions have multiple encodings that do the same thing. In particular, any instruction that acts on two registers can have the registers swapped and the direction bit in the instruction reversed.

Which one a given assembler/compiler picks simply depends on what the tool authors chose.

Why GNU GAWK 3.1.5 on Solaris 5.10 does not match the pattern?

The part of the command after the pipe cannot see those lines because you are redirecting stdout but sftp is sending the errors to stderr.

If you want to redirect both of them to gawk, you should add 2>&1 to your command:

bash-3.2$ sftp my-server < batch_ls.sftp 2>&1 | gawk 'BEGIN{d=-1;wd=1}/^sftp> c/{d++;wd=0}/Coul/{wd=1}wd==0{print $0,d,wd}'

Does a compiler always produce an assembly code?

TL:DR different object file formats / easier portability to new Unix platforms (historically) is one of the main reasons for gcc keeping the assembler separate from the compiler, I think. Outside of gcc, the mainstream x86 C and C++ compilers (clang/LLVM, MSVC, ICC) go straight to machine code, with the option of printing asm text if you ask them to.

LLVM and MSVC are / come with complete toolchains, not just compilers. (Also come with assembler and linker). LLVM already has object-file handling as a library function, so it can use that instead of writing out asm text to feed to a separate program.

Smaller projects often choose to leave object-file format details to the assembler. e.g. FreePascal can go straight to an object file on a few of its target platforms, but otherwise only to asm. There are many claims (1, 2, 3, 4) that almost all compilers go through asm text, but that's not true for many of the biggest most-widely-used compilers (except GCC) that have lots of developers working on them.

C compilers tend to either target a single platform only (like a vendor's compiler for a microcontroller) and were written as "the/a C implementation for this platform", or be very large projects like LLVM where including machine code generation isn't a big fraction of the compiler's own code size. Compilers for less widely used languages are more usually portable, but without wanting to write their own machine-code / object-file handling. (Many compilers these days are front-ends for LLVM, so get .o output for free, like rustc, but older compilers didn't have that option.)

Out of all compilers ever, most do go to asm. But if you weight by how often each one is used every day, going straight to a relocatable object file (.o / .obj) is significant fraction of the total builds done on any given day worldwide. i.e. the compiler you care about if you're reading this might well work this way.

Also, compilers like javac that target a portable bytecode format have less reason to use asm; the same output file and bytecode format work across every platform they have to run on.

Related:

  • https://retrocomputing.stackexchange.com/questions/14927/when-and-why-did-high-level-language-compilers-start-targeting-assembly-language on retrocomputing has some other answers about advantages of keeping as separate.
  • What is the need to generate ASM code in gcc, g++
  • What do C and Assembler actually compile to? - even compilers that go straight to machine code don't produce linked executables directly, they produce relocatable object files (.o or .obj). Except for tcc, the Tiny C Compiler, intended for use on the fly for one-file C programs.
  • Semi-related: Why do we even need assembler when we have compiler? asm is useful for humans to look at machine code, not as a necessary part of C -> machine code.


Why GCC does what it does

Yes, as is a separate program that the gcc front-end actually runs separately from cc1 (the C preprocessor+compiler that produces text asm).

This makes gcc slightly more modular, making the compiler itself a text -> text program.

GCC internally uses some binary data structures for GIMPLE and RTL internal representations, but it doesn't write (text representations of) those IR formats to files unless you use a special option for debugging.

So why stop at assembly? This means GCC doesn't need to know about different object file formats for the same target. For example, different x86-64 OSes use ELF, PE/COFF, MachO64 object files, and historically a.out. as assembles the same text asm into the same machine code surrounded by different object file metadata on different targets. (There are minor differences gcc has to know about, like whether to prepend an _ to symbol names or not, and whether 32-bit absolute addresses can be used, and whether code has to be PIC.)

Any platform-specific quirks can be left to GNU binutils as (aka GAS), or gcc can use the vendor-supplied assembler that comes with a system.

Historically, there were many different Unix systems with different CPUs, or especially the same CPU but different quirks in their object file formats. And more importantly, a fairly compatible set of assembler directives like .globl main, .asciiz "Hello World!\n", and similar. GAS syntax comes from Unix assemblers.

It really was possible in the past to port GCC to a new Unix platform without porting as, just using the assembler that comes with the OS.

Nobody has ever gotten around to integrating an assembler as a library into GCC's cc1 compiler. That's been done for the C preprocessor (which historically was also done in a separate process), but not the assembler.


Most other compilers do produce object files directly from the compiler, without a text asm temporary file / pipe. Often because the compiler was only designed for one or a couple targets, like MSVC or ICC or various compilers that started out as x86-only, or many vendor-supplied compilers for embedded chips.

clang/LLVM was designed much more recently than GCC. It was designed to work as an optimizing JIT back-end, so it needed a built-in assembler to make it fast to generate machine code. To work as an ahead-of-time compiler, adding support for different object-file formats was presumably a minor thing since the internal software architecture was there to go straight to binary machine code.

LLVM of course uses LLVM-IR internally for target-independent optimizations before looking for back-end-specific optimizations, but again it only writes out this format as text if you ask it to.



Encoding ADC EAX, ECX - 2 different ways to encode? (arch x86)

This is redundancy of instruction encoding. Any architecture that use multiple parameters in the instruction has this.

Think of a RISC architecture that have add rx, ry, rz that assigns the sum of ry and rz into rx then you can encode add rx, ry, rz or add rx, rz, ry, they'll all be equivalent.

In x86 we (normally) have only 2 parameters for each instruction but you can select the direction between them since you can store to or read from memory. If you don't use memory then you can choose the direction between the 2 registers, so there are 2 encoding ways

You can use this to identify some compilers/assemblers. For some assemblers you can choose which encoding to use. In GAS you can use .s suffix to force it to emit the alternate encoding

10 de   adcb   %bl,%dh
12 f3 adcb.s %bl,%dh

x86 XOR opcode differences

x86 has 2 redundant ways to encode a 2-register instance of any of the basic ALU instructions (that date back to 8086), using either the r/m source and r/m destination forms.

This redundancy for reg,reg encoding is a consequence of how x86 machine code allows a memory-destination or a memory-source for most instructions: instead of spending bits in the ModR/M byte to have a flexible encoding for both operands, there are simply two separate opcodes for most instructions.

(This is why two explicit memory operands, like xor [eax], [ecx], isn't allowed for any instruction. Only a few instructions where one or both memory operands are implicit, like rep movs or push [mem] allow two memory operands, never one instruction with two separate ModR/M-encoded addressing modes.)



There are patterns to the encodings

Notice that 31 vs. 33 for word/dword/qword-sized xor differ only in bit #1. Other instructions like 29 vs. 2B sub follow the same pattern. Bit #1 is sometimes called the "direction" bit of the opcode. (Not to be confused with DF in EFLAGS, the direction flag).

Also note that byte vs. word/dword/qword operand-size versions of those instructions differ only in the low bit, like 30 XOR r/m8, r8 vs. 31 XOR r/m16, r16. Again, this pattern shows up in the ALU instruction encodings that date back to 8086. Bit #0 of those opcodes is sometimes called the "size" bit.

These "basic ALU" instructions that have an encoding for each direction and size combo date back to original 8086; many later instructions like 386 bsf r, r/m or 186 imul r, r/m, imm don't have a form that could allow a memory destination. Or for bt* r/m, r only the destination can be reg/mem.

That's also why later instructions (or new forms of them like imul) usually don't have a separate opcode for byte operand-size, only allowing word/dword/qword via the normal prefix mechanisms. 8086 used up much of the coding space, and later extensions wanted to leave room for more future extensions. So that's why there's no imul r, r/m8.

(dword and qword operand size were themselves extensions; 8086 didn't have operand-size or REX prefixes. So original 8086 was fairly sensible in terms of using its opcode coding space, and having patterns to make decoding not a total mess.)



No execution differences between forms

For reg,reg instructions, there's no difference in how they decode and execute on any CPUs I'm aware of; the only time you need to care about which encoding your assembler uses is when you want the machine code to meet some other requirement, like using only bytes that represent printable ASCII characters. (e.g. for an exploit payload).



Specifying which form you want the assembler to use

Some assemblers have syntax for overriding their default choice of encoding, e.g. GAS had a .s suffix to get the non-default encoding. That's now deprecated, and you should use {load} or {store} prefixes before the mnemonic (see the docs), like so:

{load} xor %eax, %ecx
{store} xor %eax, %ecx
{vex3} vpaddd %xmm0, %xmm1, %xmm1
vpaddd %xmm0, %xmm1, %xmm1 # default is to use 2-byte VEX when possible

gcc -c foo.S && objdump -drwC foo.o

0:   31 c1                   xor    %eax,%ecx
2: 33 c8 xor %eax,%ecx
4: c4 e1 71 fe c8 vpaddd %xmm0,%xmm1,%xmm1
9: c5 f1 fe c8 vpaddd %xmm0,%xmm1,%xmm1

(Related: What methods can be used to efficiently extend instruction length on modern x86? for use-cases for {vex3}, {evex} and {disp32}.)

NASM also has {vex2}, {vex3}, and {evex} prefixes with the same syntax as GAS, e.g. {vex3} vpaddd xmm1, xmm1, xmm0. But I don't see a way to override the op r/m, r vs. op r, r/m choice of opcodes.



Related Q&As, some basically duplicates

  • Why does the Solaris assembler generate different machine code than the GNU assembler here? - some assemblers have a different default choice of "direction".
  • Some assemblers have even used that choice (and maybe other redundancies) as a way to fingerprint / watermark in their output machine code. Notably A86 which was shareware, using this as a way to detect binaries from people that didn't pay the shareware fee. A86 tag wiki.
  • What is the ".s" suffix in x86 instructions? (the precursor to the {load} and {store} overrides in GAS source).
  • Encoding ADC EAX, ECX - 2 different ways to encode? (arch x86)
  • x86 sub instruction opcode confusion

What is the .s suffix in x86 instructions?

To understand what the .s suffix means, you need to understand how x86 instructions are encoded. If we take adc as an example, there are four main forms that the operands can take:

  1. The source operand is an immediate, and the destination operand is the accumulator register.
  2. The source operand is an immediate, and the destination operand is a register or memory location
  3. The source operand is a register, and the destination operand is a register or memory location.
  4. The source operand is a register or memory location, and the destination operand is a register.

And of course there are variants of these for the different operand sizes: 8-bit, 16-bit, 32-bit, etc.

When one of your operands is a register and the other is a memory location, it is obvious which of forms 3 and 4 the assembler should use, but when both operands are registers, either form is applicable. The .s prefix tells the assembler which form to use (or in the case of a disassembly, shows you which form has been used).

Looking at the specific example of adcb %bl,%dh, the two ways it can be encoded are as follows:

10 de   adcb   %bl,%dh
12 f3 adcb.s %bl,%dh

The first byte determines the form of the instruction used, which I'll get back to later. The second byte is what is know as a ModR/M byte and specifies the addressing mode and register operands that are used. The ModR/M byte can be split into three fields: Mod (the most significant 2 bits), REG (the next 3) and R/M (the last 3).

de: Mod=11, REG = 011, R/M = 110
f3: Mod=11, REG = 110, R/M = 011

The Mod and R/M fields together determine the effective address of the memory location if one of the operands is a memory location, but when that operand is just a register, the Mod field is set to 11, and R/M is the value of the register. The REG field obviously just represents the other register.

So in the de byte, the R/M field holds the dh register, and the REG fields holds the bl register. And in the f3 byte, the R/M field holds the bl register, and the REG fields holds the dh register. (8-bit registers are encoded as the numbers 0 to 7 in the order al,cl,dl,bl,ah,ch,dh,bh)

Getting back to the first byte, the 10 tells us to use the form 3 encoding, where the source operand is always a register (i.e. it comes from the REG field), and the destination operand is a memory location or register (i.e. it is determined by the Mod and R/M fields). The 12 tells us to use the form 4 encoding, where the operands are the other way around - the source operand is determined by the Mod and R/M fields and the destination operand comes from the REG field.

So the the position the registers are stored in the ModR/M byte are swapped, and the first byte of the instruction tells us which operand is stored where.

How do assembly languages depend on operating systems?

As others have pointed out, system calls and interrupts are different. I can think of another few differences.

The instruction set is the same across all OSes on a given processor, but the executable file format might not be. For example, on the x86, Windows uses the PE format, Linux uses ELF, and MacOS uses Mach-O. That means that assemblers on those platforms must produce their output in those formats, which is a difference.

Relatedly, the calling convention could also be different across different OSes. That probably only matters where you are writing assembly code that calls or is called by compiled-code routines, or perhaps where you are writing inline assembler in some compiled code. The calling convention governs which registers are used for what purposes during a function call, so different conventions require different use of registers by calling and called code. They also put constraints on the position of the stack pointer, and various other things. As it happens, calling conventions have historically been a rare example of consistency across OSes in many cases: i believe the Windows and UNIX calling conventions are the same on the x86 (and are all based on the venerable UNIX System V ABI specification), and are consistent across OSes on most other architectures. However, the conventions are now different between Windows and UNIX on the x86_64.

In addition, there may be differences in the syntax used by the assembly language. Again on the x86, the Windows and Linux assemblers used to use different syntax, with the Windows assembler using a syntax invented by Intel, and the Linux assembler (really, the GNU assembler) using a traditional UNIX syntax invented by AT&T. These syntaxes describe the same instruction set, but are written differently. Nowadays, the GNU assembler can also understand the Intel syntax, so there is less of a difference.



Related Topics



Leave a reply



Submit