When compiling programs to run inside a VM, what should march and mtune be set to?
Some incomplete and out of order excerpts from section 3.17.14 Intel 386 and AMD x86-64 Options of the GCC 4.6.3 Standard C++ Library Manual (which I hope are pertinent).
-march=cpu-type
Generate instructions for the machine type cpu-type.
The choices for cpu-type are the same as for -mtune.
Moreover, specifying -march=cpu-type implies -mtune=cpu-type.
-mtune=cpu-type
Tune to cpu-type everything applicable about the generated code,
except for the ABI and the set of available instructions.
The choices for cpu-type are:
generic
Produce code optimized for the most common IA32/AMD64/EM64T processors.
native
This selects the CPU to tune for at compilation time by determining
the processor type of the compiling machine.
Using -mtune=native will produce code optimized for the local machine
under the constraints of the selected instruction set.
Using -march=native will enable all instruction subsets supported by
the local machine (hence the result might not run on different machines).
What I found most interesting is that specifying -march=cpu-type implies -mtune=cpu-type
. My take on the rest was that if you are specifying both -march
& -mtune
you're probably getting too close to tweak overkill.
My suggestion would be to just use -m64
and you should be safe enough since you're running inside a x86-64 Linux, correct?
But if you don't need to run in another environment and you're feeling lucky and fault tolerant then -march=native
might also work just fine for you.
-m32
The 32-bit environment sets int, long and pointer to 32 bits
and generates code that runs on any i386 system.
-m64
The 64-bit environment sets int to 32 bits and long and pointer
to 64 bits and generates code for AMD's x86-64 architecture.
For what it's worth ...
Out of curiosity I tried using the technique described in the article you referenced. I tested gcc v4.6.3 in 64-bit Ubuntu 12.04 which was running as a VMware Player guest. The VMware VM was running in Windows 7 on a desktop using an Intel Pentium Dual-Core E6500 CPU.
The gcc option -m64
was replaced with just -march=x86-64 -mtune=generic
.
However, compiling with -march=native
resulted in gcc using all of the much more specific compiler options below.
-march=core2 -mtune=core2 -mcx16
-mno-abm -mno-aes -mno-avx -mno-bmi -mno-fma -mno-fma4 -mno-lwp
-mno-movbe -mno-pclmul -mno-popcnt -mno-sse4.1 -mno-sse4.2
-mno-tbm -mno-xop -msahf --param l1-cache-line-size=64
--param l1-cache-size=32 --param l2-cache-size=2048
So, yes, as the gcc documentation states when "Using -march=native ... the result might not run on different machines". To play it safe you should probably only use -m64
or it's apparent equivalent -march=x86-64 -mtune=generic
for your compiles.
I can't see how you would have any problem with this since the intent of those compiler options are that gcc will produce code capable of running correctly on any x86-64/amd64 compliant CPU. (No?)
I am frankly astounded at how specific the gcc -march=native
CPU options turned out to be. I have no idea how a CPU's L1 cache size being 32k could be used to fine tune the generated code. But apparently if there is a way to do this, then using -march=native
will allow gcc to do it.
I wonder if this might result in any noticeable performance improvements?
mtune and march when compiling in a docker image
If I use
native
in an image built by Dockerhub, I guess this will use the spec of the machine used by Dockerhub, and this will impact the image binary available for download?
That's true. When the docker image is built, it is done on the host machine and using its resources, so -march=native
and -mtune=native
will take the specs of the host machine.
For building docker images that may be used by a wide audience, and making them work as on many (X86) targets as possible, it's best to use a common instruction set. If you need to specify march
and mtune
, these would probably be the safest choice:
-march=x86-64 -mtune=generic
There may be some performance hits compared to -march=native -mtune=native
in certain cases, but fortunately, on most applications, this change could go almost unnoticed (specific applications may be more affected, especially if they depend on a small piece of kernel functions that GCC is able to optimize well, for example by utilizing the CPU vector instruction sets).
For reference, check this detailed benchmark comparison by Phoronix:
GCC Compiler Tests At A Variety Of Optimization Levels Using Clear Linux
It compares about a dozen benchmarks with GCC 6.3 using different optimization flags. Benchmarks run on an Intel Core-I7 6800K machine, which supports modern Intel instruction sets including SSE, AVX, BMI, etc. (see here for the complete list). Specifically, -O3
vs. -O3 -march=native
is the interesting metric.
You could see that in most benchmarks, the advantage of -O3 -march=native
over -O3
is minor to negligible (and in one case, -O3
wins...).
To conclude, -march=x86-64 -mtune=generic
is a decent choice for Docker images and should provide good portability and a typically minor performance hit.
What are my available march/mtune options?
Use gcc --target-help
-march=CPU[,+EXTENSION...]
generate code for CPU and EXTENSION, CPU is one of:
generic32, generic64, i386, i486, i586, i686,
pentium, pentiumpro, pentiumii, pentiumiii, pentium4,
prescott, nocona, core, core2, corei7, l1om, k1om,
iamcu, k6, k6_2, athlon, opteron, k8, amdfam10,
bdver1, bdver2, bdver3, bdver4, znver1, znver2,
btver1, btver2
...
It's often not the general architecture like x86
or x86-64
but the specific microarchitectures. But there's x86-64
(not x86_64
) for a generic x86 CPU with 64-bit extensions. The full list for each architecture can be found on GCC's -march
manual. For x86:
-march=cpu-type
Generate instructions for the machine type cpu-type. In contrast to
-mtune=cpu-type
, which merely tunes the generated code for the specifiedcpu-type
,-march=cpu-type
allows GCC to generate code that may not run at all on processors other than the one indicated. Specifying-march=cpu-type
implies-mtune=cpu-type
.
...
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-march-13
While the baseline version of -march
is -march=x86-64
, the baseline / default tune option is -mtune=generic
. That aims to not be terrible anywhere, avoiding performance pitfalls even at the cost of extra instructions or code size.
-march=native
will pick the right arch and tune settings for the machine the compiler is running on, or tune=generic
if the compiler doesn't recognize the specific model of CPU it's running on.
(e.g. old gcc on a Skylake, will still enable -mavx2 -mpopcnt -mbmi2
and so on, but will set -mtune=generic
instead of something closer to appropriate.)
How to correctly determine -march and -mtune for Intel processors?
In the gcc version you're using, Haswell was called core-avx2. Other microarchitectures had also crappy names. For example, Ivy Bridge, Sandy Bridge, and Westmere were called, core-avx-i, corei7-avx, and corei7, respectively. Starting with gcc 4.9.0, the actual names of the microarchitectures are used, so gcc will print Haswell when using gcc -march=native -Q --help=target|grep march
on a Haswell processor instead of core-avx2 (see the patch).
When passing -mtune=native
to gcc and the host processor is not known to the version of gcc you're using, it will apply generic
tuning. Your processor model (63) is only known to gcc 5.1.0 and later (see the patch).
The name-printing part of -Q --help=target
has to pick some name for -march=native
. For CPUs too new for your GCC to recognize specifically, it will pick something like Broadwell if the processor supports ADX, or the microarchitecture that supports the highest SIMD extension (up to AVX2) that is supported on the host processor (as determined by cpuid
).
But the actual effect of -march=native
is to enable all the appropriate -mavx -mpopcnt -mbmi2 -mcx16
and so on options, all detected separately using cpuid
. So for code-gen purposes, -march=native
always works for enabling ISA extensions that your GCC knows how to use, even if it doesn't recognize your CPU.
But for setting tune
options, -march=native
or -mtune=native
totally fails and falls back to generic
when it doesn't recognize your CPU exactly. It unfortunately doesn't do things like tune=intel
for unknown-Intel CPUs.
On your processor, gcc knows that it supports AVX2, so it assumes that it is a Haswell processor (called core-avx2 in your gcc version) because AVX2 is supported starting on Haswell, but it doesn't know for sure that it is actually a Haswell processor. That's why it applies generic tuning instead of tuning for core-avx2 (i.e., Haswell). But in this case, I think this would have the same effect as tuning for core-avx2 because, to that compiler version, only Haswell supports AVX2 and the compiler knows that the host processor supports AVX2. In general, though, it may not tune for the native microarchitecture even if -march
was guessed correctly on an unkown CPU.
(Editor's note: no, tune=generic
doesn't adapt to which instruction-set options are enabled. It's still fully generic tuning, including caring about CPUs like AMD Phenom or Intel Sandybridge that don't support AVX2. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 and Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?.
This is one reason why you should use -march=native
or -march=haswell
(with a new enough gcc), not just -mavx2 -mfma
. The other reason is that you're probably going to forget -mbmi2 -mpopcnt -mcx16
, and maybe even forget -mfma
)
Why is -march=native used so rarely?
Conservative
If you take a closer look at the defaults of gcc, the oldest compiler in your list, you'll realize that they are very conservative:
- By default, on x86, only SSE 2 is activated; not even SSE 4.
- The set of flags in
-Wall
and-Wextra
has not changed for years; there are new useful warnings, they are NOT added to-Wall
or-Wextra
.
Why? Because it would break things!
There are entire development chains relying on those convenience defaults, and any alteration brings the risk of either breaking them, or of producing binaries that will not run on the targets.
The more users, the greater the threat, so developers of gcc are very, very conservative to avoid world-wide breakage. And developers of the next batch of compilers follow in the footsteps of their elders: it's proven to work.
Note: rustc
will default to static linking, and boasts that you can just copy the binary and drop it on another machine; obviously -march=native
would be an impediment there.
Masses Friendly
And in the truth is, it probably doesn't matter. You actually recognized it yourself:
In my own experience, this flag can provide massive speedups for numerically-intensive code
Most code is full of virtual calls and branches (typically OO code) and not at all numerically-intensive. Thus, for the majority of the code, SSE 2 is often sufficient.
The few codebases for which performance really matters will require significant time invested in performance tuning anyway, both at code and compiler level. And if vectorization matters, it won't be left at the whim of the compiler: developers will use the built-in intrinsics and write the vectorized code themselves, as it's cheaper than putting up a monitoring tool to ensure that auto-vectorization did happen.
Also, even for numerically intensive code, the host machine and the target machine might differ slightly. Compilation benefits from lots of core, even at a lower frequency, while execution benefits from a high frequency and possibly less cores unless the work is easily parallelizable.
Conclusion
Not activating -march=native
by default makes it easier for users to get started; since even performance seekers may not care for it much, this means there's more to lose than gain.
In an alternative history where the default had been -march=native
from the beginning; users would be used to specify the target architecture, and we would not be having this discussion.
Configuring compilers on Mac M1 (Big Sur, Monterey) for Rcpp and other tools
Background
Currently (2022-04-24), CRAN builds R 4.2 binaries for Apple silicon using Apple clang
from Command Line Tools for Xcode 13.1 and using an experimental fork of GNU Fortran 12.
If you obtain R from CRAN (i.e., here), then you need to replicate CRAN's compiler setup on your system before building R packages that contain C/C++/Fortran code from their sources (and before using Rcpp
, etc.). This requirement ensures that your package builds are compatible with R itself.
A further complication is the fact that Apple clang
doesn't support OpenMP, so you need to do even more work to compile programs that make use of multithreading. You could circumvent the issue by building R itself and all R packages from sources with LLVM clang
, which does support OpenMP, but that approach is onerous and "for experts only".
There is another approach that has been tested by a few people, including Simon Urbanek, the maintainer of R for macOS. It is experimental and also "for experts only", but it works on my machine and is much simpler than learning to build R yourself.
Instructions for obtaining a working toolchain
Warning: These come with no warranty and could break at any time. Some level of familiarity with C/C++/Fortran program compilation, Makefile syntax, and Unix shells is assumed. Everyone is encouraged to consult official documentation, which is more likely to be maintained than answers on SO. As usual, sudo
at your own risk.
I will try to address compilers and OpenMP support at the same time. I am going to assume that you are starting from nothing. Feel free to skip steps you've already taken, though you might find a fresh start helpful.
I've tested these instructions on a machine running Big Sur, and at least one person has tested them on a machine running Monterey. I would be glad to hear from others.
Download an R 4.2 binary from CRAN here and install. Be sure to select the binary built for Apple silicon.
Run
$ sudo xcode-select --install
in Terminal to install the latest release version of Apple's Command Line Tools for Xcode, which includes Apple
clang
. You can obtain earlier versions from your browser here. However, the version that you install should not be older than the one that CRAN used to build your R binary.Download the
gfortran
binary recommended here and install by unpacking to root:$ curl -LO https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz
$ sudo tar xvf gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /
$ sudo ln -sfn $(xcrun --show-sdk-path) /opt/R/arm64/gfortran/SDKThe last command updates a symlink inside of the
gfortran
installation so that it points to the SDK inside of your Command Line Tools installation.Download an OpenMP runtime suitable for your Apple
clang
version here and install by unpacking to root. You can query your Appleclang
version withclang --version
. For example, I have version 1300.0.29.3, so I did:$ curl -LO https://mac.r-project.org/openmp/openmp-12.0.1-darwin20-Release.tar.gz
$ sudo tar xvf openmp-12.0.1-darwin20-Release.tar.gz -C /After unpacking, you should find these files on your system:
/usr/local/lib/libomp.dylib
/usr/local/include/ompt.h
/usr/local/include/omp.h
/usr/local/include/omp-tools.hAdd the following lines to
$(HOME)/.R/Makevars
, creating the file if necessary.CPPFLAGS+=-I/usr/local/include -Xclang -fopenmp
LDFLAGS+=-L/usr/local/lib -lomp
FC=/opt/R/arm64/gfortran/bin/gfortran -mtune=native
FLIBS=-L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.6.0/12.0.1 -L/opt/R/arm64/gfortran/lib -lgfortran -lemutls_w -lmRun R and test that you can compile a program with OpenMP support. For example:
if (!requireNamespace("RcppArmadillo", quietly = TRUE)) {
install.packages("RcppArmadillo")
}
Rcpp::sourceCpp(code = '
#include <RcppArmadillo.h>
#ifdef _OPENMP
# include <omp.h>
#endif
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
void omp_test()
{
#ifdef _OPENMP
Rprintf("OpenMP threads available: %d\\n", omp_get_max_threads());
#else
Rprintf("OpenMP not supported\\n");
#endif
}
')
omp_test()OpenMP threads available: 8
If the C++ code fails to compile, or if it compiles without error but you get linker warnings or you find that OpenMP is not supported, then one of us has probably made a mistake. Please report any issues.
References
Everything is a bit scattered:
- R Installation and Administration manual [link]
- R for macOS Developers page [link]
Wrong compiler used in the target offload process in CLion during CUDA compilation
So, I filed a bug against CLion about this issue - the one @MaximBanaev pointed out:
CPP-19089: Can't get CLion to use a non-default C++ compiler with CUDA
CLion has simply not properly taken this issue into account in its toolchain mechanism. The issue is not even just the environment - there needs to be proper UI to set the "backing" C++ compiler for CUDA, in addition to CLion noticing the relevant settings when it comes up.
This isn't the only "fundamental" bug I've encountered trying to use CLion with CUDA projects. The support is still rather half-baked IMHO. I wish they get it right, because I would like to this IDE out and for now, I can't.
Related Topics
Is It Efficient to Use Epoll with Devices (/Dev/Event/...)
Gnuplot-Like Program for Timeline Data
Add a Directory When Creating Tar Archive
Conversion from Ebcdic to Utf8 in Linux
Find String and Replace Line in Linux
How to Get Notified of Modification in The Memory in Linux
Infinite Loop Receive from Serial Port
Determining The Independent Cpu's (Specified with Affinity Id's) for Building Atlas
Path Environment Variable in Linux
How to Check If a UId Exists in an Acl in Linux
Linux Mail Adding Content Type Headers Not Working
Installing a Fully Functional Postgis 2.0 on Ubuntu Linux Geos/Gdal Issues
Getting Root Privileges in Ansible
Bash - Process Backspace Control Character When Redirecting Output to File
Glib: G_Atomic_Int_Get Becomes No-Op