Determine Target Isa Extensions of Binary File in Linux (Library or Executable)

Determine target ISA extensions of binary file in Linux (library or executable)

I think you need a tool that checks every instruction, to determine exactly which set it belongs to. Is there even an offical name for the specific set of instructions implemented by the C3 processor? If not, it's even hairier.

A quick'n'dirty variant might be to do a raw search in the file, if you can determine the bit pattern of the disallowed instructions. Just test for them directly, could be done by a simple objdump | grep chain, for instance.

How do I determine the target architecture of static library (.a) on Mac OS X?

Another option is lipo; its output is brief and more readable than otool's.

An example:

% lipo -info /usr/lib/libiodbc.a 
Architectures in the fat file: /usr/lib/libiodbc.a are: x86_64 i386 ppc
% lipo -info libnonfatarchive.a
input file libnonfatarchive.a is not a fat file
Non-fat file: libnonfatarchive.a is architecture: i386
%

Different ISA binary profiling results contradiction

You're profiling qemu interpreting / emulating RISC-V, not the RISC-V "guest" code inside QEMU. QEMU can't do that; it's not a cycle-accurate simulator of anything.

That's slower and takes more instructions than native code compiled for your x86-64 in the first place.

Using binfmt_misc to transparently run qemu-riscv64 on RISC-V binaries makes ./unit_tests exactly equivalent to qemu-riscv64 ./unit_tests

Your test results prove this: perf stat qemu-riscv64 ./unit_tests gave you approximately the same results as what's in your question.


Somewhat related: Modern Microprocessors A 90-Minute Guide! has some good details about how CPU pipelines work. RISC isn't always better than modern x86 CPUs. They spend enough transistors to run x86-64 code fast.

You actually would expect more total instructions for the same work from a RISC CPU, just not that many more instructions. Like maybe 1.1x or 1.25x?

Performance depends on the microarchitecture, not (just) the instruction set. IPC and total time or cycles depends entirely on how aggressive the microarchitecture is at finding instruction-level parallelism. Modern Intel designs are some of the best at that, even in fairly dense CISC x86 code with memory-source instructions being common.

Scan binary for CPU feature usage

There are two good approaches:

  • Run under a debugger and look at instruction that caused an illegal-instruction fault
  • Run under a simulator/emulator that can show you an instruction mix, like SDE.

But your idea, statically scanning the binary, can't distinguish code in functions that are only called after checking cpuid.



Using a debugger to look at the faulting instruction

Pick any debugger. GDB is easy to install on any Linux distro, and probably also on Windows or Mac (or lldb there). Or pick any other debugger, e.g. one with a GUID.

Run the program. Once it faults, use the debugger to examine the faulting instruction.

Look it up in Intel or AMD's x86 asm reference manual, e.g. https://www.felixcloutier.com/x86/ is an HTML scrape of Intel's PDFs. See which ISA extension this form of this instruction requires.

For example, this source can compile to use AVX-512 instructions if you let the compiler do so, but only needs SSE2 to compile in the first place.

#include <immintrin.h>
// stores to global vars typically aren't optimized out, even without volatile
int buf[16];
int main(int argc, char **argv)
{
__m128i v = _mm_set1_epi32(argc); // broadcast scalar to vector
_mm_storeu_si128((__m128i*)buf, v);
}

(See it on Godbolt with different compile options.)

Build with gcc -march=skylake-avx512 -O3 ill.c.

Then try to run it, e.g. on my Skylake-client (non-AVX512) GNU/Linux desktop. (I also used strip a.out to remove the symbol table (function names), like a binary-only software release).

$ ./a.out 
Illegal instruction (core dumped)
$ gdb a.out
...
(gdb) run
Starting program: /tmp/a.out

Program received signal SIGILL, Illegal instruction.
0x0000555555555020 in ?? ()

(gdb) disas
No function contains program counter for selected frame.

(gdb) disas /r $pc,+20 # from current program counter to +20 bytes
Dump of assembler code from 0x555555555020 to 0x555555555034:
=> 0x0000555555555020: 62 f2 7d 08 7c c7 vpbroadcastd xmm0,edi
0x0000555555555026: c5 f9 7f 05 32 30 00 00 vmovdqa XMMWORD PTR [rip+0x3032],xmm0 # 0x555555558060
0x000055555555502e: 31 c0 xor eax,eax
0x0000555555555030: c3 ret
0x0000555555555031: 66 2e 0f 1f 84 00 00 00 00 00 cs nop WORD PTR [rax+rax*1+0x0]
End of assembler dump.

The => indicates the current program counter (RIP in x86-64, but GDB portably defines $pc as an alias on any ISA.)

So we faulted on vpbroadcastd xmm0,edi. (The way GCC implemented _mm_set1_epi32(argc) when we told it AVX512 was available.)

That doesn't involve memory access, and the fault was illegal-instruction not segmentation-fault anyway, so we can be sure that actually trying to execute an unsupported instruction was the direct cause of the crash here. (It's also possible for it to be an indirect cause, e.g. a program using lzcnt eax, ecx but an old CPU running it as bsr eax, ecx, and then using that different integer as an array index. lzcnt/bsr is somewhat unlikely for your case since AMD has supported it for longer than Intel.)

So let's check on vpbroadcastd: there are multiple entries for vpbroadcast in Intel's manual:

  • VPBROADCAST Load Integer and Broadcast - nope, only has entries with XMM and memory sources.
  • VPBROADCASTB/VPBROADCASTW/VPBROADCASTD/VPBROADCASTQ — Load with Broadcast Integer Data from General Purpose Register - this is the one we want
  • VBROADCAST — Load with Broadcast Floating-Point Data - nope, this one is also only memory or vector register source operands. And is vbroadcastss etc., not vP... integer instructions. (Intel's convention is that p... is packed-integer, ...ps/pd is packed-single or packed-float.)

If the mnemonic starts with v and you can't find an entry, e.g. vaddps, that's because the instruction existed before AVX, and is documented under its legacy-SSE mnemonic, like SSE1 addps which does list both addps and vaddps encodings, including the AVX-512 encodings that allow ZMM registers, x/ymm16..31, and masking like vaddps ymm0{k3}{z}, ymm1, ymm2. That's an AVX-512F+VL instruction.

Anyway, back to our example. The table entry that matches the faulting instruction was the following. Note the 7C opcode byte before the ModR/M (/r) that encodes the operands. That's present after the 4-byte EVEX prefix, as a cross-check that this is indeed the opcode we're looking for.

EVEX.128.66.0F38.W0 7C /r VPBROADCASTD xmm1 {k1}{z}, r32

It requires "AVX512VL AVX512F" according to the table. The {k1}{z} is optional masking. r32 is a 32-bit general-purpose integer register, like edi in this case. xmm1 means any XMM register can be the first xmm operand to this instruction; in this case GCC chose XMM0.

My CPU doesn't have AVX-512 at all, so it faulted.



SDE instruction mix

This should work equally well on Windows or any other OS.

Intel's SDE (Software Development Emulator) has a -mix option, whose output includes categorizing by required ISA extension. See How do I monitor the amount of SIMD instruction usage re: using it.

Using the same example a.out I used with GDB:

Running /opt/sde-external-8.33.0-2019-02-07-lin/sde64 -mix -- ./a.out created a file sde-mix-out.txt which contained a lot of stuff, including stats for how often different basic blocks were executed. (Some in the dynamic linker ran many times.) IDK if there's an option to omit that, because it would get pretty bloated for a large program, I expect. I think it might only print the top few blocks, even if there are many more.

Then we get to the part we want:

...
# END_TOP_BLOCK_STATS
# EMIT_DYNAMIC_STATS FOR TID 0 OS-TID 1168465 EMIT #1
#
# $dynamic-counts
#
# TID 0
# opcode count
#
*stack-read 8806
*stack-write 8314
*iprel-read 1003
*iprel-write 437

...

*isa-ext-AVX 4
*isa-ext-AVX2 5
*isa-ext-AVX512EVEX 1
*isa-ext-BASE 133338
*isa-ext-LONGMODE 545
*isa-ext-SSE 56
*isa-ext-SSE2 2560
*isa-ext-XSAVE 1
*isa-set-AVX 4
*isa-set-AVX2 5
*isa-set-AVX512F_128 1
*isa-set-CMOV 266
*isa-set-FAT_NOP 891
*isa-set-I186 2676
*isa-set-I386 7626
*isa-set-I486REAL 71
*isa-set-I86 121192
*isa-set-LONGMODE 545
*isa-set-PENTIUMREAL 8
*isa-set-PPRO 608
*isa-set-SSE 56
*isa-set-SSE2 2560
*isa-set-XSAVE 1

The 1 count for isa-set-AVX512F_128 is the instruction that would have faulted on my CPU, which doesn't support AVX-512 at all. AVX512F_128 is AVX512F (foundation) + AVX512VL (vector length, allowing vectors other than 512-bit ZMM registers).

(It was also counted as isa-ext-AVX512EVEX. EVEX is the machine-code prefix for AVX-512 vector instructions. AVX-512 mask instructions like kandw k0, k1, k2 use VEX encoding, like AVX1/AVX2 SIMD instructions. But this wouldn't distinguish an Ice Lake new instruction like vpermb faulting on a Skylake-server CPU that supports AVX-512F but not AVX512VBMI)

Everything other than AVX-512 is probably simpler, since there's a fully separate name for each extension.



Static disassembly

You can disassemble most binaries; if they're not obfuscated then disassembly should find all the instructions that might ever execute. (And high-performance code that uses new instructions is unlikely to be using hacks that throw off a disassembler, like jumping into the middle of what straight-line disassembly would see as a different instruction; x86 machine code is a byte-stream of variable-length instructions.)

But that doesn't tell you which instructions actually do execute; some might be in functions that are only called after checking CPUID to find out if the necessary extensions are supported.

(And I don't know of a tool to categorize them by ISA extension, although I've never looked for one; usually developers wanting to make sure they didn't use AVX2 instructions in code that will run on AVX1-only CPUs use build-time checks, or test by running under an emulator or on a real CPU.)

Difference between shared library (.so) a Linux executable file without extension?


  1. One of the main differences is that a shared library does not have a main() function. It also contains position independent code that may or may not be the case for executables. If you do put a main() function in the library, you still need to link it with a normal object file (containing no main() function).
  2. Yes. To create a shared library you compile your code with -fpic or -fPIC to generate position-independent code (PIC) suitable for use in a shared library.

Nothing prevents you from creating an executable called myexe.so though, but it can't be used as a shared library.

Linux - How to fix shared library links within executable image?

I would suggest to create the static build for you application.
Link with libxxx.a ( Static library ) while u create the build.

 Or you can create the .deb ( for ubuntu and debian ) package. 
It will resolve the library dependencies.

By
SIVA K



Related Topics



Leave a reply



Submit