When Compiling Programs to Run Inside a Vm, What Should March and Mtune Be Set To

When compiling programs to run inside a VM, what should march and mtune be set to?

Some incomplete and out of order excerpts from section 3.17.14 Intel 386 and AMD x86-64 Options of the GCC 4.6.3 Standard C++ Library Manual (which I hope are pertinent).

-march=cpu-type
  Generate instructions for the machine type cpu-type.  
  The choices for cpu-type are the same as for -mtune.  
  Moreover, specifying -march=cpu-type implies -mtune=cpu-type. 

-mtune=cpu-type
  Tune to cpu-type everything applicable about the generated code,  
  except for the ABI and the set of available instructions.  
  The choices for cpu-type are:
    generic
      Produce code optimized for the most common IA32/AMD64/EM64T processors. 
    native
      This selects the CPU to tune for at compilation time by determining
      the processor type of the compiling machine. 
      Using -mtune=native will produce code optimized for the local machine
      under the constraints of the selected instruction set.
      Using -march=native will enable all instruction subsets supported by
      the local machine (hence the result might not run on different machines).

What I found most interesting is that specifying -march=cpu-type implies -mtune=cpu-type. My take on the rest was that if you are specifying both -march & -mtune you're probably getting too close to tweak overkill.

My suggestion would be to just use -m64 and you should be safe enough since you're running inside a x86-64 Linux, correct?

~~But if you don't need to run in another environment and you're feeling lucky and fault tolerant then -march=native might also work just fine for you.~~

-m32
  The 32-bit environment sets int, long and pointer to 32 bits  
  and generates code that runs on any i386 system.     
-m64
  The 64-bit environment sets int to 32 bits and long and pointer  
  to 64 bits and generates code for AMD's x86-64 architecture.

For what it's worth ...

Out of curiosity I tried using the technique described in the article you referenced. I tested gcc v4.6.3 in 64-bit Ubuntu 12.04 which was running as a VMware Player guest. The VMware VM was running in Windows 7 on a desktop using an Intel Pentium Dual-Core E6500 CPU.

The gcc option -m64 was replaced with just -march=x86-64 -mtune=generic.

However, compiling with -march=native resulted in gcc using all of the much more specific compiler options below.

-march=core2 -mtune=core2 -mcx16 
-mno-abm -mno-aes -mno-avx -mno-bmi -mno-fma -mno-fma4 -mno-lwp 
-mno-movbe -mno-pclmul -mno-popcnt -mno-sse4.1 -mno-sse4.2 
-mno-tbm -mno-xop -msahf --param l1-cache-line-size=64 
--param l1-cache-size=32 --param l2-cache-size=2048

So, yes, as the gcc documentation states when "Using -march=native ... the result might not run on different machines". To play it safe you should probably only use -m64 or it's apparent equivalent -march=x86-64 -mtune=generic for your compiles.

I can't see how you would have any problem with this since the intent of those compiler options are that gcc will produce code capable of running correctly on any x86-64/amd64 compliant CPU. (No?)

I am frankly astounded at how specific the gcc -march=native CPU options turned out to be. I have no idea how a CPU's L1 cache size being 32k could be used to fine tune the generated code. But apparently if there is a way to do this, then using -march=native will allow gcc to do it.

I wonder if this might result in any noticeable performance improvements?

mtune and march when compiling in a docker image

If I use native in an image built by Dockerhub, I guess this will use the spec of the machine used by Dockerhub, and this will impact the image binary available for download?

That's true. When the docker image is built, it is done on the host machine and using its resources, so -march=native and -mtune=native will take the specs of the host machine.

For building docker images that may be used by a wide audience, and making them work as on many (X86) targets as possible, it's best to use a common instruction set. If you need to specify march and mtune, these would probably be the safest choice:

-march=x86-64 -mtune=generic

There may be some performance hits compared to -march=native -mtune=native in certain cases, but fortunately, on most applications, this change could go almost unnoticed (specific applications may be more affected, especially if they depend on a small piece of kernel functions that GCC is able to optimize well, for example by utilizing the CPU vector instruction sets).

For reference, check this detailed benchmark comparison by Phoronix:

GCC Compiler Tests At A Variety Of Optimization Levels Using Clear Linux

It compares about a dozen benchmarks with GCC 6.3 using different optimization flags. Benchmarks run on an Intel Core-I7 6800K machine, which supports modern Intel instruction sets including SSE, AVX, BMI, etc. (see here for the complete list). Specifically, -O3 vs. -O3 -march=native is the interesting metric.
You could see that in most benchmarks, the advantage of -O3 -march=native over -O3 is minor to negligible (and in one case, -O3 wins...).

To conclude, -march=x86-64 -mtune=generic is a decent choice for Docker images and should provide good portability and a typically minor performance hit.

What are my available march/mtune options?

Use gcc --target-help

-march=CPU[,+EXTENSION...]
                      generate code for CPU and EXTENSION, CPU is one of:
                       generic32, generic64, i386, i486, i586, i686,
                       pentium, pentiumpro, pentiumii, pentiumiii, pentium4,
                       prescott, nocona, core, core2, corei7, l1om, k1om,
                       iamcu, k6, k6_2, athlon, opteron, k8, amdfam10,
                       bdver1, bdver2, bdver3, bdver4, znver1, znver2,
                       btver1, btver2
...

It's often not the general architecture like x86 or x86-64 but the specific microarchitectures. But there's x86-64 (not x86_64) for a generic x86 CPU with 64-bit extensions. The full list for each architecture can be found on GCC's -march manual. For x86:

-march=cpu-type

Generate instructions for the machine type cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated. Specifying -march=cpu-type implies -mtune=cpu-type.

...

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-march-13

While the baseline version of -march is -march=x86-64, the baseline / default tune option is -mtune=generic. That aims to not be terrible anywhere, avoiding performance pitfalls even at the cost of extra instructions or code size.

-march=native will pick the right arch and tune settings for the machine the compiler is running on, or tune=generic if the compiler doesn't recognize the specific model of CPU it's running on.

(e.g. old gcc on a Skylake, will still enable -mavx2 -mpopcnt -mbmi2 and so on, but will set -mtune=generic instead of something closer to appropriate.)

How to correctly determine -march and -mtune for Intel processors?

In the gcc version you're using, Haswell was called core-avx2. Other microarchitectures had also crappy names. For example, Ivy Bridge, Sandy Bridge, and Westmere were called, core-avx-i, corei7-avx, and corei7, respectively. Starting with gcc 4.9.0, the actual names of the microarchitectures are used, so gcc will print Haswell when using gcc -march=native -Q --help=target|grep march on a Haswell processor instead of core-avx2 (see the patch).

When passing -mtune=native to gcc and the host processor is not known to the version of gcc you're using, it will apply generic tuning. Your processor model (63) is only known to gcc 5.1.0 and later (see the patch).

The name-printing part of -Q --help=target has to pick some name for -march=native. For CPUs too new for your GCC to recognize specifically, it will pick something like Broadwell if the processor supports ADX, or the microarchitecture that supports the highest SIMD extension (up to AVX2) that is supported on the host processor (as determined by cpuid).

But the actual effect of -march=native is to enable all the appropriate -mavx -mpopcnt -mbmi2 -mcx16 and so on options, all detected separately using cpuid. So for code-gen purposes, -march=native always works for enabling ISA extensions that your GCC knows how to use, even if it doesn't recognize your CPU.

But for setting tune options, -march=native or -mtune=native totally fails and falls back to generic when it doesn't recognize your CPU exactly. It unfortunately doesn't do things like tune=intel for unknown-Intel CPUs.

On your processor, gcc knows that it supports AVX2, so it assumes that it is a Haswell processor (called core-avx2 in your gcc version) because AVX2 is supported starting on Haswell, but it doesn't know for sure that it is actually a Haswell processor. That's why it applies generic tuning instead of tuning for core-avx2 (i.e., Haswell). But in this case, I think this would have the same effect as tuning for core-avx2 because, to that compiler version, only Haswell supports AVX2 and the compiler knows that the host processor supports AVX2. In general, though, it may not tune for the native microarchitecture even if -march was guessed correctly on an unkown CPU.

(Editor's note: no, tune=generic doesn't adapt to which instruction-set options are enabled. It's still fully generic tuning, including caring about CPUs like AMD Phenom or Intel Sandybridge that don't support AVX2. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?.

This is one reason why you should use -march=native or -march=haswell (with a new enough gcc), not just -mavx2 -mfma. The other reason is that you're probably going to forget -mbmi2 -mpopcnt -mcx16, and maybe even forget -mfma)

Why is -march=native used so rarely?

Conservative

If you take a closer look at the defaults of gcc, the oldest compiler in your list, you'll realize that they are very conservative:

By default, on x86, only SSE 2 is activated; not even SSE 4.
The set of flags in -Wall and -Wextra has not changed for years; there are new useful warnings, they are NOT added to -Wall or -Wextra.

Why? Because it would break things!

There are entire development chains relying on those convenience defaults, and any alteration brings the risk of either breaking them, or of producing binaries that will not run on the targets.

The more users, the greater the threat, so developers of gcc are very, very conservative to avoid world-wide breakage. And developers of the next batch of compilers follow in the footsteps of their elders: it's proven to work.

Note: rustc will default to static linking, and boasts that you can just copy the binary and drop it on another machine; obviously -march=native would be an impediment there.

Masses Friendly

And in the truth is, it probably doesn't matter. You actually recognized it yourself:

In my own experience, this flag can provide massive speedups for numerically-intensive code

Most code is full of virtual calls and branches (typically OO code) and not at all numerically-intensive. Thus, for the majority of the code, SSE 2 is often sufficient.

The few codebases for which performance really matters will require significant time invested in performance tuning anyway, both at code and compiler level. And if vectorization matters, it won't be left at the whim of the compiler: developers will use the built-in intrinsics and write the vectorized code themselves, as it's cheaper than putting up a monitoring tool to ensure that auto-vectorization did happen.

Also, even for numerically intensive code, the host machine and the target machine might differ slightly. Compilation benefits from lots of core, even at a lower frequency, while execution benefits from a high frequency and possibly less cores unless the work is easily parallelizable.

Conclusion

Not activating -march=native by default makes it easier for users to get started; since even performance seekers may not care for it much, this means there's more to lose than gain.

In an alternative history where the default had been -march=native from the beginning; users would be used to specify the target architecture, and we would not be having this discussion.

Configuring compilers on Mac M1 (Big Sur, Monterey) for Rcpp and other tools

Background

Currently (2022-04-24), CRAN builds R 4.2 binaries for Apple silicon using Apple clang from Command Line Tools for Xcode 13.1 and using an experimental fork of GNU Fortran 12.

If you obtain R from CRAN (i.e., here), then you need to replicate CRAN's compiler setup on your system before building R packages that contain C/C++/Fortran code from their sources (and before using Rcpp, etc.). This requirement ensures that your package builds are compatible with R itself.

A further complication is the fact that Apple clang doesn't support OpenMP, so you need to do even more work to compile programs that make use of multithreading. You could circumvent the issue by building R itself and all R packages from sources with LLVM clang, which does support OpenMP, but that approach is onerous and "for experts only".

There is another approach that has been tested by a few people, including Simon Urbanek, the maintainer of R for macOS. It is experimental and also "for experts only", but it works on my machine and is much simpler than learning to build R yourself.

Instructions for obtaining a working toolchain

Warning: These come with no warranty and could break at any time. Some level of familiarity with C/C++/Fortran program compilation, Makefile syntax, and Unix shells is assumed. Everyone is encouraged to consult official documentation, which is more likely to be maintained than answers on SO. As usual, sudo at your own risk.

I will try to address compilers and OpenMP support at the same time. I am going to assume that you are starting from nothing. Feel free to skip steps you've already taken, though you might find a fresh start helpful.

I've tested these instructions on a machine running Big Sur, and at least one person has tested them on a machine running Monterey. I would be glad to hear from others.

Download an R 4.2 binary from CRAN here and install. Be sure to select the binary built for Apple silicon.
Run
```
$ sudo xcode-select --install
```
in Terminal to install the latest release version of Apple's Command Line Tools for Xcode, which includes Apple clang. You can obtain earlier versions from your browser here. However, the version that you install should not be older than the one that CRAN used to build your R binary.

Download the gfortran binary recommended here and install by unpacking to root:

$ curl -LO https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz
$ sudo tar xvf gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /
$ sudo ln -sfn $(xcrun --show-sdk-path) /opt/R/arm64/gfortran/SDK

The last command updates a symlink inside of the gfortran installation so that it points to the SDK inside of your Command Line Tools installation.

Download an OpenMP runtime suitable for your Apple clang version here and install by unpacking to root. You can query your Apple clang version with clang --version. For example, I have version 1300.0.29.3, so I did:
```
$ curl -LO https://mac.r-project.org/openmp/openmp-12.0.1-darwin20-Release.tar.gz
$ sudo tar xvf openmp-12.0.1-darwin20-Release.tar.gz -C /
```
After unpacking, you should find these files on your system:
```
/usr/local/lib/libomp.dylib
/usr/local/include/ompt.h
/usr/local/include/omp.h
/usr/local/include/omp-tools.h
```

Add the following lines to $(HOME)/.R/Makevars, creating the file if necessary.

CPPFLAGS+=-I/usr/local/include -Xclang -fopenmp
LDFLAGS+=-L/usr/local/lib -lomp

FC=/opt/R/arm64/gfortran/bin/gfortran -mtune=native
FLIBS=-L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.6.0/12.0.1 -L/opt/R/arm64/gfortran/lib -lgfortran -lemutls_w -lm

Run R and test that you can compile a program with OpenMP support. For example:

if (!requireNamespace("RcppArmadillo", quietly = TRUE)) {
    install.packages("RcppArmadillo")
}
Rcpp::sourceCpp(code = '
#include <RcppArmadillo.h>
#ifdef _OPENMP
# include <omp.h>
#endif

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
void omp_test()
{
#ifdef _OPENMP
    Rprintf("OpenMP threads available: %d\\n", omp_get_max_threads());
#else
    Rprintf("OpenMP not supported\\n");
#endif
}
')
omp_test()

OpenMP threads available: 8

If the C++ code fails to compile, or if it compiles without error but you get linker warnings or you find that OpenMP is not supported, then one of us has probably made a mistake. Please report any issues.

References

Everything is a bit scattered:

R Installation and Administration manual [link]
R for macOS Developers page [link]

Wrong compiler used in the target offload process in CLion during CUDA compilation

So, I filed a bug against CLion about this issue - the one @MaximBanaev pointed out:

CPP-19089: Can't get CLion to use a non-default C++ compiler with CUDA

CLion has simply not properly taken this issue into account in its toolchain mechanism. The issue is not even just the environment - there needs to be proper UI to set the "backing" C++ compiler for CUDA, in addition to CLion noticing the relevant settings when it comes up.

This isn't the only "fundamental" bug I've encountered trying to use CLion with CUDA projects. The support is still rather half-baked IMHO. I wish they get it right, because I would like to this IDE out and for now, I can't.

When Compiling Programs to Run Inside a Vm, What Should March and Mtune Be Set To