Mpirun: Unrecognized Argument Mca

mpirun: Unrecognized argument mca


[mpiexec@tamnun] match_arg (./utils/args/args.c:194): unrecognized argument mca
[mpiexec@tamnun] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error
[mpiexec@tamnun] parse_args (./ui/mpich/utils.c:2964): error parsing input array
^^^^^
[mpiexec@tamnun] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments
^^^^^

You are using MPICH in the last case. MPICH is not Open MPI and its process launcher does not recognize the --mca parameter that is specific to Open MPI (MCA stands for Modular Component Architecture - the basic framework that Open MPI is built upon). A typical case of a mix-up of multiple MPI implementations.

Job fails while using srun or mpirun in slurm

The root cause is the mix of several MPI implementations that do not inter operate :

  • mpirun is from Open MPI
  • mpiexec is likely the builtin MPICH from Paraview
  • your app is built with Intel MPI.

Try using /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin/mpirun (or /nfs/apps/Compilers/Intel/ParallelStudio/2016.3.067/impi/5.1.3.210/bin64/mpirun) instead so the launcher will match your MPI library.

If you want to use srun with Intel MPI, an extra step is required.
You first need to

export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so

openmpi ignored error: mca interface is not recognized

I've figured out the problem thanks to Gilles Gouaillardet's help on the OpenMPI forums.

Problem:

I installed the newer version 2.0.1 without uninstalling 1.10. Since I installed it at the same location, some mca files were overwritten while others have been removed or renamed in the newer version and were therefore still present in the directory. In the end, these module files were not recognised by version 2.0.1, resulting in the above warnings.

Solution:

  1. Remove all the pluging files: rm -rf /usr/local/lib/openmpi
  2. Reinstall Openmpi: make install

How to enable CUDA Aware OpenMPI?

This was an issue in the 20.7 release when adding UCX support. You can lower the optimization level to -O1 work around the problem, or update your NV HPC compiler version to 20.9 where we've resolved the issue.

https://developer.nvidia.com/nvidia-hpc-sdk-version-209-downloads

How can I increase OpenFabrics memory limit for Torque jobs?

Your mlx4_core parameters allow for the registration of 2^20 * 2^4 * 4 KiB = 64 GiB only. With 192 GiB of physical memory per node and given that it is recommended to have at least twice as much registerable memory, you should set log_num_mtt to 23, which would increase the limit to 512 GiB - the closest power of two greater or equal to twice the amount of RAM. Be sure to reboot the node(s) or unload and then reload the kernel module.

You should also submit a simple Torque job script that executes ulimit -l in order to verify the limits on locked memory and make sure there is no such limit. Note that ulimit -c unlimited does not remove the limit on the amount of locked memory but rather the limit on the size of core dump files.

Spawn issue with mpi4py in the Anaconda Python distribution

I ran into the same problem and one solution was compiling mpi4py with openmpi instead of mpich (see 'Compute Pi' example in the mpi4py documentation).

See this unresolved issue.

Tested on:
Ubuntu 16.04
Anaconda 4.0.0
python 3.5.0
mpich 3.2.0
openmpi 1.10.2
mpi4py 2.0.0



Related Topics



Leave a reply



Submit