How to Remove "Noise" from Gcc/Clang Assembly Output

How to remove noise from GCC/clang assembly output?

Stripping out the .cfi directives, unused labels, and comment lines is a solved problem: the scripts behind Matt Godbolt's compiler explorer are open source on its github project. It can even do colour highlighting to match source lines to asm lines (using the debug info).

You can set it up locally so you can feed it files that are part of your project with all the #include paths and so on (using -I/...). And so you can use it on private source code that you don't want to send out over the Internet.

Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” shows how to use it (it's pretty self-explanatory but has some neat features if you read the docs on github), and also how to read x86 asm, with a gentle introduction to x86 asm itself for total beginners, and to looking at compiler output. He goes on to show some neat compiler optimizations (e.g. for dividing by a constant), and what kind of functions give useful asm output for looking at optimized compiler output (function args, not int a = 123;).

On the Godbolt compiler explorer, it can be useful to use -g0 -fno-asynchronous-unwind-tables if you want to uncheck the filter option for directives, e.g. because you want to see the .section and .p2align stuff in the compiler output. The default is to add -g to your options to get the debug info it uses to colour-highlight matching source and asm lines, but this means .cfi directives for every stack operation, and .loc for every source line, among other things.

With plain gcc/clang (not g++), -fno-asynchronous-unwind-tables avoids .cfi directives. Possibly also useful: -fno-exceptions -fno-rtti -masm=intel. Make sure to omit -g.

Copy/paste this for local use:

g++ -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti -fverbose-asm \
    -Wall -Wextra  foo.cpp   -O3 -masm=intel -S -o- | less

Or -Os can be more readable, e.g. using div for division by non-power-of-2 constants instead of a multiplicative inverse even though that's a lot worse for performance and only a bit smaller, if at all.

But really, I'd recommend just using Godbolt directly (online or set it up locally)! You can quickly flip between versions of gcc and clang to see if old or new compilers do something dumb. (Or what ICC does, or even what MSVC does.) There's even ARM / ARM64 gcc 6.3, and various gcc for PowerPC, MIPS, AVR, MSP430. (It can be interesting to see what happens on a machine where int is wider than a register, or isn't 32-bit. Or on a RISC vs. x86).

For C instead of C++, you can use -xc -std=gnu11 to avoid flipping the language drop-down to C, which resets your source pane and compiler choices, and has a different set of compilers available.

Useful compiler options for making asm for human consumption:

Remember, your code only has to compile, not link: passing a pointer to an external function like void ext(void*p) is a good way to stop something from optimizing away. You only need a prototype for it, with no definition so the compiler can't inline it or make any assumptions about what it does. (Or inline asm like Benchmark::DoNotOptimize can force a compiler to materialize a value in a register, or forget about it being a known constant, if you know GNU C inline asm syntax well enough to use constraints to understand the effect you're having on what you're requiring of the compiler.)
I'd recommend using -O3 -Wall -Wextra -fverbose-asm -march=haswell for looking at code. (-fverbose-asm can just make the source look noisy, though, when all you get are numbered temporaries as names for the operands.) When you're fiddling with the source to see how it changes the asm, you definitely want compiler warnings enabled. You don't want to waste time scratching your head over the asm when the explanation is that you did something that deserves a warning in the source.
To see how the calling convention works, you often want to look at caller and callee without inlining.
You can use __attribute__((noipa)) foo_t foo(bar_t x) { ... } on a definition, or compile with gcc -O3 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions to disable inlining. (But those command line options don't disable cloning a function for constant-propagation. noipa = no Inter-Procedural Analysis. It's even stronger than __attribute__((noinline,noclone)).) See From compiler perspective, how is reference for array dealt with, and, why passing by value(not decay) is not allowed? for an example.
Or if you just want to see how functions pass / receive args of different types, you could use different names but the same prototype so the compiler doesn't have a definition to inline. This works with any compiler. Without a definition, a function is just a black box to the optimizer, governed only by the calling convention / ABI.
-ffast-math will get many libm functions to inline, some to a single instruction (esp. with SSE4 available for roundsd). Some will inline with just -fno-math-errno, or other "safer" parts of -ffast-math, without the parts that allow the compiler to round differently. If you have FP code, definitely look at it with/without -ffast-math. If you can't safely enable any of -ffast-math in your regular build, maybe you'll get an idea for a safe change you can make in the source to allow the same optimization without -ffast-math.
-O3 -fno-tree-vectorize will optimize without auto-vectorizing, so you can get full optimization without if you want to compare with -O2 (which doesn't enable autovectorization on gcc11 and earlier, but does on all clang).
-Os (optimize for size and speed) can be helpful to keep the code more compact, which means less code to understand. clang's -Oz optimizes for size even when it hurts speed, even using push 1 / pop rax instead of mov eax, 1, so that's only interesting for code golf.
Even -Og (minimal optimization) might be what you want to look at, depending on your goals. -O0 is full of store/reload noise, which makes it harder to follow, unless you use register vars. The only upside is that each C statement compiles to a separate block of instructions, and it makes -fverbose-asm able to use the actual C var names.
clang unrolls loops by default, so -fno-unroll-loops can be useful in complex functions. You can get a sense of "what the compiler did" without having to wade through the unrolled loops. (gcc enables -funroll-loops with -fprofile-use, but not with -O3). (This is a suggestion for human-readable code, not for code that would run faster.)
Definitely enable some level of optimization, unless you specifically want to know what -O0 did. Its "predictable debug behaviour" requirement makes the compiler store/reload everything between every C statement, so you can modify C variables with a debugger and even "jump" to a different source line within the same function, and have execution continue as if you did that in the C source. -O0 output is so noisy with stores/reloads (and so slow) not just from lack of optimization, but forced de-optimization to support debugging. (also related).

To get a mix of source and asm, use gcc -Wa,-adhln -c -g foo.c | less to pass extra options to as. (More discussion of this in a blog post, and another blog.). Note that the output of this isn't valid assembler input, because the C source is there directly, not as an assembler comment. So don't call it a .s. A .lst might make sense if you want to save it to a file.

Godbolt's color highlighting serves a similar purpose, and is great at helping you see when multiple non-contiguous asm instructions come from the same source line. I haven't used that gcc listing command at all, so IDK how well it does, and how easy it is for the eye to see, in that case.

I like the high code density of godbolt's asm pane, so I don't think I'd like having source lines mixed in. At least not for simple functions. Maybe with a function that was too complex to get a handle on the overall structure of what the asm does...

And remember, when you want to just look at the asm, leave out the main() and the compile-time constants. You want to see the code for dealing with a function arg in a register, not for the code after constant-propagation turns it into return 42, or at least optimizes away some stuff.

Removing static and/or inline from functions will produce a stand-alone definition for them, as well as a definition for any callers, so you can just look at that.

Don't put your code in a function called main(). gcc knows that main is special and assumes it will only be called once, so it marks it as "cold" and optimizes it less.

The other thing you can do: If you did make a main(), you can run it and use a debugger. stepi (si) steps by instruction. See the bottom of the x86 tag wiki for instructions. But remember that code might optimize away after inlining into main with compile-time-constant args.

__attribute__((noinline)) may help, on a function that you want to not be inlined. gcc will also make constant-propagation clones of functions, i.e. a special version with one of the args as a constant, for call-sites that know they're passing a constant. The symbol name will be .clone.foo.constprop_1234 or something in the asm output. You can use __attribute__((noclone)) to disable that, too.).

For example

If you want to see how the compiler multiplies two integers: I put the following code on the Godbolt compiler explorer to get the asm (from gcc -O3 -march=haswell -fverbose-asm) for the wrong way and the right way to test this.

// the wrong way, which people often write when they're used to creating a runnable test-case with a main() and a printf
// or worse, people will actually look at the asm for such a main()
int constants() { int a = 10, b = 20; return a * b; }
    mov     eax, 200  #,
    ret                     # compiles the same as  return 200;  not interesting

// the right way: compiler doesn't know anything about the inputs
// so we get asm like what would happen when this inlines into a bigger function.
int variables(int a, int b) { return a * b; }
    mov     eax, edi  # D.2345, a
    imul    eax, esi        # D.2345, b
    ret

(This mix of asm and C was hand-crafted by copy-pasting the asm output from godbolt into the right place. I find it's a good way to show how a short function compiles in SO answers / compiler bug reports / emails.)

Clean x86_64 assembly output with gcc? [duplicate]

The stuff that goes into .eh_frame section is unwind descriptors, which you only need to unwind stack (e.g. with GDB). While learning assembly, you could simply ignore it. Here is a way to do the "clean up" you want:

gcc -S -o - test.c | sed -e '/^\.L/d' -e '/\.eh_frame/Q'
        .file   "test.c"
        .text
.globl main
        .type   main,@function
main:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    $0, %eax
        leave
        ret
        .size   main,.Lfe1-main

Getting assember output from GCC/Clang in LTO mode

For GCC just add -save-temps to linker command:

$ gcc -flto -save-temps ... *.o -o bin/libsortcheck.so
$ ls -1
...
libsortcheck.so.ltrans0.s

For Clang the situation is more complicated. In case you use GNU ld (default or -fuse-ld=ld) or Gold linker (enabled via -fuse-ld=gold), you need to run with -Wl,-plugin-opt=emit-asm:

$ clang tmp.c -flto -Wl,-plugin-opt=emit-asm -o tmp.s

For newer (11+) versions of LLD linker (enabled via -fuse-ld=lld) you can generate asm with -Wl,--lto-emit-asm.

How to generate godbolt like clean assembly locally?

A while ago, I needed something like this locally so I wrote a small tool to make the asm readable.

It attempts to 'clean' and make the 'asm' output from 'gcc' readable using C++ itself. It does something similar to Compiler Explorer and tries to remove all the directives and unused labels, making the asm clean. Only standard library is used for this.

Some things I should mention:

Will only with gcc and clang
Only tested with C++ code
compile with -S -fno-asynchronous-unwind-tables -fno-dwarf2-cfi-asm -masm=intel, (remove -masm= if you want AT&T asm)
AT&T syntax will probably work but I didn't test it much. The other two options are to remove the .cfi directives. It can be handled using the code below but the compiler itself does a much better job of this. See the answer by Peter Cordes above.
This program can work as standalone, but I would highly recommend reading this SO answer to tune your asm output and then process it using this program to remove unused labels / directives etc.
abi::__cxa_demangle() is used for demangling
Disclaimer: This isn't a perfect solution, and hasn't been tested extensively.

The strategy used for cleaning the asm(There are probably better, faster more efficient ways to do this):

Collect all the labels
Go through the asm line by line and check if the labels are used/unused
If the labels are unused, they get deleted
Every line beginning with '.' gets deleted, unless it is a used somewhere

Update 1: Not all static data gets removed now.

#include <algorithm>
#include <cxxabi.h>
#include <fstream>
#include <iostream>
#include <regex>
#include <string>
#include <sstream>
#include <unordered_map>

// trim from both ends (in place)
std::string_view trim(std::string_view s)
{
    s.remove_prefix(std::min(s.find_first_not_of(" \t\r\v\n"), s.size()));
    s.remove_suffix(std::min(s.size() - s.find_last_not_of(" \t\r\v\n") - 1, s.size()));
    return s;
}

static inline bool startsWith(const std::string_view s, const std::string_view searchString)
{
    return (s.rfind(searchString, 0) == 0);
}

std::string demangle(std::string &&asmText)
{
    int next = 0;
    int last = 0;
    while (next != -1) {
        next = asmText.find("_Z", last);
        //get token
        if (next != -1) {
            int tokenEnd = asmText.find_first_of(":,.@[]() \n", next + 1);
            int len = tokenEnd - next;
            std::string tok = asmText.substr(next, len);
            int status = 0;
            char* name = abi::__cxa_demangle(tok.c_str(), 0, 0, &status);
            if (status != 0) {
                std::cout << "Demangling of: " << tok << " failed, status: " << status << '\n';
                continue;
            }
            std::string demangledName{name};
            demangledName.insert(demangledName.begin(), ' ');
            asmText.replace(next, len, demangledName);
            free((void*)name);
        }
    }
    return std::move(asmText);
}

std::string clean_asm(const std::string& asmText)
{
    std::string output;
    output.reserve(asmText.length());
    std::stringstream s{asmText};

    //1. collect all the labels
    //2. go through the asm line by line and check if the labels are used/unused
    //3. if the labels are unused, they get deleted
    //4. every line beginning with '.' gets deleted, unless it is a used label

    std::regex exp {"^\\s*[_|a-zA-Z]"};
    
    std::regex directiveRe { "^\\s*\\..*$" };
    std::regex labelRe { "^\\.*[a-zA-Z]+[0-9]+:$" };
    std::regex hasOpcodeRe { "^\\s*[a-zA-Z]" };
    std::regex numericLabelsRe { "\\s*[0-9]:" };

    const std::vector<std::string> allowedDirectives =
    {
        ".string", ".zero", ".byte", ".value", ".long", ".quad", ".ascii"
    };

    //<label, used>
    std::unordered_map<std::string, bool> labels;

    //1
    std::string line;
    while (std::getline(s, line)) {
        if (std::regex_match(line, labelRe)) {
            trim(line);
            // remove ':'
            line = line.substr(0, line.size() - 1);
            labels[line] = false;
        }
    }

    s.clear();
    s.str(asmText);
    line = "";

    //2
    while (std::getline(s, line)) {
        if (std::regex_match(line, hasOpcodeRe)) {
            auto it = labels.begin();   
            for (; it != labels.end(); ++it) {
                if (line.find(it->first)) {
                    labels[it->first] = true;
                }
            }
        }
    }

    //remove false labels from labels hash-map
    for (auto it = labels.begin(); it != labels.end();) {
        if (it->second == false)
            it = labels.erase(it);
        else
            ++it;
    }

    s.clear();
    s.str(asmText);
    line = "";

    std::string currentLabel;

    //3
    while (std::getline(s, line)) {
        trim(line);

        if (std::regex_match(line, labelRe)) {
            auto l = line;
            l = l.substr(0, l.size() - 1);
            currentLabel = "";
            if (labels.find(l) != labels.end()) {
                currentLabel = line;
                output += line + "\n";
            }
            continue;
        }

        if (std::regex_match(line, directiveRe)) {
            //if we are in a label
            if (!currentLabel.empty()) {
                auto trimmedLine = trim(line);
                for (const auto& allowedDir : allowedDirectives) {
                    if (startsWith(trimmedLine, allowedDir)) {
                        output += line;
                        output += '\n';
                    }
                }
            }
            continue;
        }

        if (std::regex_match(line, numericLabelsRe)) {
            continue;
        }

        if (line == "endbr64") {
            continue;
        }

        if (line[line.size() - 1] == ':' || line.find(':') != std::string::npos) {
            currentLabel = line;
            output += line + '\n';
            continue;
        }

        line.insert(line.begin(), '\t');

        output += line + '\n';
    }

    return output;
}

int main(int argc, char* argv[])
{
    if (argc < 2) {
        std::cout << "Please provide more than asm filename you want to process.\n";
    }
    std::ifstream file(argv[1]);
    std::string output;
    if (file.is_open()) {
        std::cout << "File '" << argv[1] << "' is opened\n";
        std::string line;
        while (std::getline(file, line)) {
            output += line + '\n';
        }
    }

    output = demangle(std::move(output));
    output = clean_asm(output);

    std::string fileName = argv[1];
    auto dotPos = fileName.rfind('.');
    if (dotPos != std::string::npos)
        fileName.erase(fileName.begin() + dotPos, fileName.end());

    std::cout << "Asm processed. Saving as '"<< fileName <<".asm'";
    std::ofstream out;
    out.open(fileName + ".asm");
    out << output;

    return 0;
}

How should I correctly learn GNU assembly?

When learning assembly, you can mostly focus on the instructions, and .section and .globl directives. Unless you're trying to learn how GAS directives produce metadata, including debug info and other stuff that's useful for debugging high-level languages moreso than hand-written asm.

A lot of debug-info directives like apparently .scl only have their syntax documented, no real details on what the values mean or what other things might care about what value you put there.

You can write working hand-written asm to play around with using pretty much only .section and .globl. (And for static data, .byte / .short / .long / .quad and .ascii / .asciz for initialized, .space in the BSS, and .p2align if you need it).

That's why Matt Godbolt's "compiler explorer" site filters out directives by default, except for data initializers, because of course data is in .data or .rdata, and code in .text, and the interesting part about compiler asm output is the actual instructions (and static data). See How to remove "noise" from GCC/clang assembly output?

Being fully compliant with Windows expectations for SEH metadata for stack unwinding (especially in 64-bit code) may take some extra directives, same for x86-64 SysV .cfi stack-unwind metadata. But that's something you can worry about after you understand the basics of assembly, if you ever need to use hand-written asm in a robust production-quality context, rather than just as a one-off experiment to learn how instructions work.

https://stackoverflow.com/tags/x86/info has some links to tutorials (and manuals), and Programming from the Ground Up is a good free book for 32-bit x86 with AT&T syntax. (And GAS directives). It's aimed at running on Linux, so it can teach some OS / computing concepts along the way, the kind of background knowledge necessary for assembly (and system calls) to make sense. Online HTML version

To follow it on a modern 64-bit GNU/Linux distro, you may need as --32 and ld -m elf_i386 to override the defaults to 32-bit. And gcc -m32 -fno-pie -no-pie anywhere the book says gcc.

You might also want -fno-stack-protector to further simplify the asm output from C, if you're comparing book examples of how C compiles. But be aware that different GCC versions will compile differently, especially as default tuning options have changed over the years. -mtune=pentium or -mtune=pentium3 might also get GCC to choose code-gen strategies that are more like an old book. Of course, current GCC's choices are also correct, and more appropriate for newer CPUs, just different from old GCC!

Also the i386 SysV ABI as used on Linux has changed to require 16-byte stack alignment before a call instruction, due to GCC accidentally making 32-bit code e.g. using movaps that relied on that performance optimization GCC was choosing to do. Calling libc functions will usually still happen to work in 32-bit code with only 4-byte stack alignment, but if you see modern GCC reserving more space than it needs, that's usually why.

Nitpick:

I found some of assemblers, like NASM, MASM, and GAS. Their syntax is different(particularly pseudo-directives), NASM and MASM support Intel syntax, GAS support AT&T Syntax.

Directives are orthogonal to instruction syntax, and there are different flavours of Intel syntax, especially between MASM vs. NASM. (As well as major differences between MASM and NASM for directives).

Also, GAS also supports .intel_syntax noprefix to use a somewhat MASM-like instruction syntax, but still GAS directives.

(Similarly, YASM has an AT&T mode to use AT&T instruction syntax, but still NASM/YASM preprocessor and directives.)

Remove needless assembler statements from g++ output

I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.

The function has no local variables

Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.

How does a cyclic move without an effect improve the debugging experience. Please elaborate.

To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).

Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.

Which options would I have to set?

If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.

How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.

Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.

Why is the subq $8, %rsp statement generated at all?

Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.

Why does System V / AMD64 ABI mandate a 16 byte stack alignment?

Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.

If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?

In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

How to Remove "Noise" from Gcc/Clang Assembly Output