Is Returning a 2-Tuple Less Efficient Than Std::Pair

Is returning a 2-tuple less efficient than std::pair?

The short answer is because the libstc++ standard library implementation used by gcc and clang on Linux implements std::tuple with a non-trivial move constructor (in particular, the _Tuple_impl base class has a non-trivial move constructor). On the other hand, the copy and move constructors for std::pair are all defaulted.

This in turn causes a C++-ABI related difference in the calling convention for returning these objects from functions, as well as passing them by value.

The Gory Details

You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two int fields in these structures will get the INTEGER class or the MEMORY class.

A recent version of the ABI specification has this to say:

The classification of aggregate (structures and arrays) and union
types works as follows:

If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .

If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the
object is replaced in the parameter list by a pointer that has class
INTEGER) 14 .

If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class
NO_CLASS.

Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according
to the classes of the fields in the eightbyte

It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which gcc is documented to follow does include move constructors:

If the parameter type is non-trivial for the purposes of calls, the
caller must allocate space for a temporary and pass that temporary by
reference. Specifically:

Space is allocated by the caller in the usual manner for a temporary, typically on the stack.

and then the definition of non-trivial:

A type is considered non-trivial for the purposes of calls if:

it has a non-trivial copy constructor, move constructor, or destructor, or

all of its copy and move constructors are deleted.

So because tuple is not considered to be trivially copyable from an ABI perspective, it gets MEMORY treatment, which means that your function must populate the stack allocated object passed in by the called in rdi. The std::pair function can just pass back the entire structure in rax since it fits in one EIGHTBYTE and has class INTEGER.

Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for tuple since this ABI different is "baked in".

Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your f1() and f2() functions and do some math on the result:

int add_pair() {
  auto p = f1();
  return p.first + p.second;
}

int add_tuple() {
  auto t = f2();
  return std::get<0>(t) + std::get<1>(t);
}

In principle the add_tuple method starts from a disadvantage, since it has to call f2() which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it to f2 as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:

add_pair():
  mov eax, 819
  ret
add_tuple():
  mov eax, 819
  ret

So overall you can say that the effect of this ABI issue with tuple will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).

libcstc++ vs libc+++

As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like f1() for the std::tuple case they would break ABI compliant callers.

You can see this clearly if you switch to using libc++ rather than the Linux default of libstdc++ - this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Now clang (and presumably gcc although I didn't try it), generates the same optimal code in both cases:

f1():                                 # @f1()
        movabs  rax, 2345052143889
        ret
f2():                                 # @f2()
        movabs  rax, 2345052143889
        ret

Earlier Versions of Clang

Why do versions of clang compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc or newer versions of clang. The spec was eventually updated and the clang behavior changed in version 5.0.

Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and clang changed their behavior at some point, which probably explains the switch:

The ABI specification for some argument passing cases involving move
constructors were unclear, and when they were clarified, clang changed
to follow the ABI. This is probably one of those cases.

Why is std::pair faster than std::tuple

You are missing some crucial information: What compiler do you use? What do you use to measure the performance of the microbenchmark? What standard library implementation do you use?

My system:

g++ (GCC) 4.9.1 20140903 (prerelease)
GLIBCXX_3.4.20

Anyhow, I ran your examples, but reserved the proper size of the vectors first to get rid of the memory allocation overhead. With that, I funnily observe the opposite something interesting - the reverse of what you see:

g++ -std=c++11 -O2 pair.cpp -o pair
perf stat -r 10 -d ./pair
Performance counter stats for './pair' (10 runs):

      1647.045151      task-clock:HG (msec)      #    0.993 CPUs utilized            ( +-  1.94% )
              346      context-switches:HG       #    0.210 K/sec                    ( +- 40.13% )
                7      cpu-migrations:HG         #    0.004 K/sec                    ( +- 22.01% )
          182,978      page-faults:HG            #    0.111 M/sec                    ( +-  0.04% )
    3,394,685,602      cycles:HG                 #    2.061 GHz                      ( +-  2.24% ) [44.38%]
    2,478,474,676      stalled-cycles-frontend:HG #   73.01% frontend cycles idle     ( +-  1.24% ) [44.55%]
    1,550,747,174      stalled-cycles-backend:HG #   45.68% backend  cycles idle     ( +-  1.60% ) [44.66%]
    2,837,484,461      instructions:HG           #    0.84  insns per cycle        
                                                  #    0.87  stalled cycles per insn  ( +-  4.86% ) [55.78%]
      526,077,681      branches:HG               #  319.407 M/sec                    ( +-  4.52% ) [55.82%]
          829,623      branch-misses:HG          #    0.16% of all branches          ( +-  4.42% ) [55.74%]
      594,396,822      L1-dcache-loads:HG        #  360.887 M/sec                    ( +-  4.74% ) [55.59%]
        20,842,113      L1-dcache-load-misses:HG  #    3.51% of all L1-dcache hits    ( +-  0.68% ) [55.46%]
        5,474,166      LLC-loads:HG              #    3.324 M/sec                    ( +-  1.81% ) [44.23%]
  <not supported>      LLC-load-misses:HG       

      1.658671368 seconds time elapsed                                          ( +-  1.82% )

versus:

g++ -std=c++11 -O2 tuple.cpp -o tuple
perf stat -r 10 -d ./tuple
Performance counter stats for './tuple' (10 runs):

        996.090514      task-clock:HG (msec)      #    0.996 CPUs utilized            ( +-  2.41% )
              102      context-switches:HG       #    0.102 K/sec                    ( +- 64.61% )
                4      cpu-migrations:HG         #    0.004 K/sec                    ( +- 32.24% )
          181,701      page-faults:HG            #    0.182 M/sec                    ( +-  0.06% )
    2,052,505,223      cycles:HG                 #    2.061 GHz                      ( +-  2.22% ) [44.45%]
    1,212,930,513      stalled-cycles-frontend:HG #   59.10% frontend cycles idle     ( +-  2.94% ) [44.56%]
      621,104,447      stalled-cycles-backend:HG #   30.26% backend  cycles idle     ( +-  3.48% ) [44.69%]
    2,700,410,991      instructions:HG           #    1.32  insns per cycle        
                                                  #    0.45  stalled cycles per insn  ( +-  1.66% ) [55.94%]
      486,476,408      branches:HG               #  488.386 M/sec                    ( +-  1.70% ) [55.96%]
          959,651      branch-misses:HG          #    0.20% of all branches          ( +-  4.78% ) [55.82%]
      547,000,119      L1-dcache-loads:HG        #  549.147 M/sec                    ( +-  2.19% ) [55.67%]
        21,540,926      L1-dcache-load-misses:HG  #    3.94% of all L1-dcache hits    ( +-  2.73% ) [55.43%]
        5,751,650      LLC-loads:HG              #    5.774 M/sec                    ( +-  3.60% ) [44.21%]
  <not supported>      LLC-load-misses:HG       

      1.000126894 seconds time elapsed                                          ( +-  2.47% )

as you can see, in my case the reason are the much higher number of stalled cycles, both in the frontend as well as in the backend.

Now where does this come from? I bet it comes down to some failed inlining, similar to what is explained here: std::vector performance regression when enabling C++11

Indeed, enabling -flto equalizes the results for me:

Performance counter stats for './pair' (10 runs):

      1021.922944      task-clock:HG (msec)      #    0.997 CPUs utilized            ( +-  1.15% )
                63      context-switches:HG       #    0.062 K/sec                    ( +- 77.23% )
                5      cpu-migrations:HG         #    0.005 K/sec                    ( +- 34.21% )
          195,396      page-faults:HG            #    0.191 M/sec                    ( +-  0.00% )
    2,109,877,147      cycles:HG                 #    2.065 GHz                      ( +-  0.92% ) [44.33%]
    1,098,031,078      stalled-cycles-frontend:HG #   52.04% frontend cycles idle     ( +-  0.93% ) [44.46%]
      701,553,535      stalled-cycles-backend:HG #   33.25% backend  cycles idle     ( +-  1.09% ) [44.68%]
    3,288,420,630      instructions:HG           #    1.56  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.88% ) [55.89%]
      672,941,736      branches:HG               #  658.505 M/sec                    ( +-  0.80% ) [56.00%]
          660,278      branch-misses:HG          #    0.10% of all branches          ( +-  2.05% ) [55.93%]
      474,314,267      L1-dcache-loads:HG        #  464.139 M/sec                    ( +-  1.32% ) [55.73%]
        19,481,787      L1-dcache-load-misses:HG  #    4.11% of all L1-dcache hits    ( +-  0.80% ) [55.51%]
        5,155,678      LLC-loads:HG              #    5.045 M/sec                    ( +-  1.69% ) [44.21%]
  <not supported>      LLC-load-misses:HG       

      1.025083895 seconds time elapsed                                          ( +-  1.03% )

and for tuple:

Performance counter stats for './tuple' (10 runs):

      1018.980969      task-clock:HG (msec)      #    0.999 CPUs utilized            ( +-  0.47% )
                8      context-switches:HG       #    0.008 K/sec                    ( +- 29.74% )
                3      cpu-migrations:HG         #    0.003 K/sec                    ( +- 42.64% )
          195,396      page-faults:HG            #    0.192 M/sec                    ( +-  0.00% )
    2,103,574,740      cycles:HG                 #    2.064 GHz                      ( +-  0.30% ) [44.28%]
    1,088,827,212      stalled-cycles-frontend:HG #   51.76% frontend cycles idle     ( +-  0.47% ) [44.56%]
      697,438,071      stalled-cycles-backend:HG #   33.15% backend  cycles idle     ( +-  0.41% ) [44.76%]
    3,305,631,646      instructions:HG           #    1.57  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.21% ) [55.94%]
      675,175,757      branches:HG               #  662.599 M/sec                    ( +-  0.16% ) [56.02%]
          656,205      branch-misses:HG          #    0.10% of all branches          ( +-  0.98% ) [55.93%]
      475,532,976      L1-dcache-loads:HG        #  466.675 M/sec                    ( +-  0.13% ) [55.69%]
        19,430,992      L1-dcache-load-misses:HG  #    4.09% of all L1-dcache hits    ( +-  0.20% ) [55.49%]
        5,161,624      LLC-loads:HG              #    5.065 M/sec                    ( +-  0.47% ) [44.14%]
  <not supported>      LLC-load-misses:HG       

      1.020225388 seconds time elapsed                                          ( +-  0.48% )

So remember, -flto is your friend and failed inlining can have extreme results on heavily templated code. Use perf stat to find out what's happening.

Which is better: returning tuple or passing arguments to function as references?

Look at disassemble (compiled with GCC -O3):

It takes more instruction to implement tuple call.

0000000000000000 <returnValues(int, int)>:
   0:   83 c2 64                add    $0x64,%edx
   3:   83 c6 64                add    $0x64,%esi
   6:   48 89 f8                mov    %rdi,%rax
   9:   89 17                   mov    %edx,(%rdi)
   b:   89 77 04                mov    %esi,0x4(%rdi)
   e:   c3                      retq   
   f:   90                      nop

0000000000000010 <returnValuesVoid(int&, int&)>:
  10:   83 07 64                addl   $0x64,(%rdi)
  13:   83 06 64                addl   $0x64,(%rsi)
  16:   c3                      retq

But less instructions for the tuple caller:

0000000000000000 <callTuple()>:
   0:   48 83 ec 18             sub    $0x18,%rsp
   4:   ba 14 00 00 00          mov    $0x14,%edx
   9:   be 0a 00 00 00          mov    $0xa,%esi
   e:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  13:   e8 00 00 00 00          callq  18 <callTuple()+0x18> // call returnValues
  18:   8b 74 24 0c             mov    0xc(%rsp),%esi
  1c:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  23:   e8 00 00 00 00          callq  28 <callTuple()+0x28> // std::cout::operator<<
  28:   8b 74 24 08             mov    0x8(%rsp),%esi
  2c:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  33:   e8 00 00 00 00          callq  38 <callTuple()+0x38> // std::cout::operator<<
  38:   48 83 c4 18             add    $0x18,%rsp
  3c:   c3                      retq   
  3d:   0f 1f 00                nopl   (%rax)

0000000000000040 <callRef()>:
  40:   48 83 ec 18             sub    $0x18,%rsp
  44:   48 8d 74 24 0c          lea    0xc(%rsp),%rsi
  49:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  4e:   c7 44 24 08 0a 00 00    movl   $0xa,0x8(%rsp)
  55:   00 
  56:   c7 44 24 0c 14 00 00    movl   $0x14,0xc(%rsp)
  5d:   00 
  5e:   e8 00 00 00 00          callq  63 <callRef()+0x23> // call returnValuesVoid
  63:   8b 74 24 08             mov    0x8(%rsp),%esi
  67:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  6e:   e8 00 00 00 00          callq  73 <callRef()+0x33> // std::cout::operator<<
  73:   8b 74 24 0c             mov    0xc(%rsp),%esi
  77:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  7e:   e8 00 00 00 00          callq  83 <callRef()+0x43> // std::cout::operator<<
  83:   48 83 c4 18             add    $0x18,%rsp
  87:   c3                      retq

I don't think there is any considerable performance different, but the tuple one is more clear, more readable.

Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.

0000000000000000 <callTuple()>:
   0:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
   7:   48 83 ec 08             sub    $0x8,%rsp
   b:   be 6e 00 00 00          mov    $0x6e,%esi
  10:   e8 00 00 00 00          callq  15 <callTuple()+0x15>
  15:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  1c:   be 78 00 00 00          mov    $0x78,%esi
  21:   48 83 c4 08             add    $0x8,%rsp
  25:   e9 00 00 00 00          jmpq   2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
  2a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000030 <callRef()>:
  30:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  37:   48 83 ec 08             sub    $0x8,%rsp
  3b:   be 6e 00 00 00          mov    $0x6e,%esi
  40:   e8 00 00 00 00          callq  45 <callRef()+0x15>
  45:   48 8d 3d 00 00 00 00    lea    0x0(%rip),%rdi
  4c:   be 78 00 00 00          mov    $0x78,%esi
  51:   48 83 c4 08             add    $0x8,%rsp
  55:   e9 00 00 00 00          jmpq   5a <callRef()+0x2a> // TCO, optimized way to call a function and also return

Comparing std::tuple (or std::pair) of custom types who has alternative orderings. Is it possible to plug-in a custom less-than / comparison function?

The easy way would be to manually write compare( tup, tup, f ) that uses f to lexographically compare the elements in the tuples. But that is boring.

// This type wraps a reference of type X&&
// it then overrides == and < with L and E respectively
template<class X, class L, class E>
struct reorder_ref {
  using ref = reorder_ref;
  X&& x;
  friend bool operator<(ref lhs, ref rhs) {
    return L{}((X&&) lhs.x, (X&&) rhs.x);
  }
  friend bool operator==(ref lhs, ref rhs) {
    return E{}((X&&) lhs.x, (X&&) rhs.x);
  }
  // other comparison ops based off `==` and `<` go here
  friend bool operator!=(ref lhs, ref rhs){return !(lhs==rhs);}
  friend bool operator>(ref lhs, ref rhs){return rhs<lhs;}
  friend bool operator<=(ref lhs, ref rhs){return !(lhs>rhs);}
  friend bool operator>=(ref lhs, ref rhs){return !(lhs<rhs);}

  reorder_ref(X&& x_) : x((X&&) x_) {}
  reorder_ref(reorder_ref const&) = default;
};

the above is a reference that changes how we order.

// a type tag, to pass a type to a function:
template<class X>class tag{using type=X;};

// This type takes a less than and equals stateless functors
// and takes as input a tuple, and builds a tuple of reorder_refs
// basically it uses L and E to compare the elements, but otherwise
// uses std::tuple's lexographic comparison code.
template<class L, class E>
struct reorder_tuple {
  // indexes trick:
  template<class Tuple, class R, size_t... Is>
  R operator()(tag<R>, std::index_sequence<Is...>, Tuple const& in) const {
    // use indexes trick to do conversion
    return R( std::get<Is>(in)... );
  }

  // forward to the indexes trick above:
  template<class... Ts, class R=std::tuple<reorder_ref<Ts const&, L, E>...>>
  R operator()(std::tuple<Ts...> const& in) const {
    return (*this)(tag<R>{}, std::index_sequence_for<Ts...>{}, in);
  }
  // pair filter:
  template<class... Ts, class R=std::pair<reorder_ref<Ts const&, L, E>...>>
  R operator()(std::pair<Ts...> const& in) const {
    return (*this)(tag<R>{}, std::index_sequence_for<Ts...>{}, in);
  }
};

the above stateless function object takes some new less and equals operations, and maps any tuple to a tuple of reorder_ref<const T, ...>, which change the ordering to follow L and E respectively.

This next type does what std::less<void> does for std::less<T> sort of -- it takes a type-specific stateless ordering function template object, and makes it a type-generic stateless ordering function object:

// This takes a type-specific ordering stateless function type, and turns
// it into a generic ordering function type
template<template<class...> class order>
struct generic_order {
  template<class T>
  bool operator()(T const& lhs, T const& rhs) const {
    return order<T>{}(lhs, rhs);
  }
};

so if we have a template<class T>class Z such that Z<T> is an ordering on Ts, the above gives you a universal ordering on anything.

This next one is a favorite of mine. It takes a type T, and orders it based on a mapping to a type U. This is surprisingly useful:

// Suppose there is a type X for which we have an ordering L
// and we have a map O from Y->X.  This builds an ordering on
// (Y lhs, Y rhs) -> L( O(lhs), O(rhs) ).  We "order" our type
// "by" the projection of our type into another type.  For
// a concrete example, imagine we have an "id" structure with a name
// and age field.  We can write a function "return s.age;" to
// map our id type into ints (age).  If we order by that map,
// then we order the "id" by age.
template<class O, class L = std::less<>>
struct order_by {
  template<class T, class U>
  bool operator()(T&& t, U&& u) const {
    return L{}( O{}((T&&) t), O{}((U&&) u) );
  }
};

Now we glue it all together:

// Here is where we build a special order.  Suppose we have a template Z<X> that returns
// a stateless order on type X.  This takes that ordering, and builds an ordering on
// tuples based on it, using the above code as glue:
template<template<class...>class Less, template<class...>class Equals=std::equal_to>
using tuple_order = order_by< reorder_tuple< generic_order<Less>, generic_order<Equals> > >;

tuple_order does most of the work for us. All we need is to provide it with an element-wise ordering template stateless function object. tuple_order will then produce a tuple ordering functor based on it.

// Here is a concrete use of the above
// my_less is a sorting functiont that sorts everything else the usual way
// but it sorts Foo's backwards
// Here is a toy type.  It wraps an int.  By default, it sorts in the usual way
struct Foo {
  int value = 0;
  // usual sort:
  friend bool operator<( Foo lhs, Foo rhs ) {
    return lhs.value<rhs.value;
  }
  friend bool operator==( Foo lhs, Foo rhs ) {
    return lhs.value==rhs.value;
  }
};

template<class T>
struct my_less : std::less<T> {};

// backwards sort:
template<>
struct my_less<Foo> {
  bool operator()(Foo const& lhs, Foo const& rhs) const {
    return rhs.value < lhs.value;
  }
};

using special_order = tuple_order< my_less >;

and bob is your uncle (live example).

special_order can be passed to a std::map or std::set, and it will order any tuples or pairs encountered with my_less replacing the default ordering of the elements.

Returning std::pair versus passing by non-const reference

I tried that with VC++2008, using cl.exe /c /O2 /FAs foo.cpp (that's "compile only and do not link", "optimize for speed", and "dump assembly output with matching source code lines in comments"). Here's what getLine() ended up being.

"byref" version:

PUBLIC  ?getPair@@YAXHAAH0@Z                ; getPair
; Function compile flags: /Ogtpy
;   COMDAT ?getPair@@YAXHAAH0@Z
_TEXT   SEGMENT
_inp$ = 8                       ; size = 4
_a$ = 12                        ; size = 4
_b$ = 16                        ; size = 4
?getPair@@YAXHAAH0@Z PROC               ; getPair, COMDAT

; 9    :     myGetPair(inp, a, b);

    mov eax, DWORD PTR _inp$[esp-4]
    mov edx, DWORD PTR _a$[esp-4]
    lea ecx, DWORD PTR [eax+1023]
    mov DWORD PTR [edx], ecx
    mov ecx, DWORD PTR _b$[esp-4]
    add eax, 31                 ; 0000001fH
    mov DWORD PTR [ecx], eax

; 10   : }

    ret 0
?getPair@@YAXHAAH0@Z ENDP               ; getPair

"byval" std::pair-returning version:

PUBLIC  ?getPair@@YAXHAAH0@Z                ; getPair
; Function compile flags: /Ogtpy
;   COMDAT ?getPair@@YAXHAAH0@Z
_TEXT   SEGMENT
_inp$ = 8                       ; size = 4
_a$ = 12                        ; size = 4
_b$ = 16                        ; size = 4
?getPair@@YAXHAAH0@Z PROC               ; getPair, COMDAT

; 8    :     std::pair<int,int> result = myGetPair(inp);

    mov eax, DWORD PTR _inp$[esp-4]

; 9    : 
; 10   :     a = result.first;

    mov edx, DWORD PTR _a$[esp-4]
    lea ecx, DWORD PTR [eax+1023]
    mov DWORD PTR [edx], ecx

; 11   :     b = result.second;

    mov ecx, DWORD PTR _b$[esp-4]
    add eax, 31                 ; 0000001fH
    mov DWORD PTR [ecx], eax

; 12   : }

    ret 0
?getPair@@YAXHAAH0@Z ENDP               ; getPair

As you can see, the actual assembly is identical; the only difference is in mangled names and comments.

Boost::Tuples vs Structs for return values

tuples

I think i agree with you that the issue with what position corresponds to what variable can introduce confusion. But i think there are two sides. One is the call-side and the other is the callee-side:

int remainder; 
int quotient;
tie(quotient, remainder) = div(10, 3);

I think it's crystal clear what we got, but it can become confusing if you have to return more values at once. Once the caller's programmer has looked up the documentation of div, he will know what position is what, and can write effective code. As a rule of thumb, i would say not to return more than 4 values at once. For anything beyond, prefer a struct.

output parameters

Output parameters can be used too, of course:

int remainder; 
int quotient;
div(10, 3, "ient, &remainder);

Now i think that illustrates how tuples are better than output parameters. We have mixed the input of div with its output, while not gaining any advantage. Worse, we leave the reader of that code in doubt on what could be the actual return value of div be. There are wonderful examples when output parameters are useful. In my opinion, you should use them only when you've got no other way, because the return value is already taken and can't be changed to either a tuple or struct. operator>> is a good example on where you use output parameters, because the return value is already reserved for the stream, so you can chain operator>> calls. If you've not to do with operators, and the context is not crystal clear, i recommend you to use pointers, to signal at the call side that the object is actually used as an output parameter, in addition to comments where appropriate.

returning a struct

The third option is to use a struct:

div_result d = div(10, 3);

I think that definitely wins the award for clearness. But note you have still to access the result within that struct, and the result is not "laid bare" on the table, as it was the case for the output parameters and the tuple used with tie.

I think a major point these days is to make everything as generic as possible. So, say you have got a function that can print out tuples. You can just do

cout << div(10, 3);

And have your result displayed. I think that tuples, on the other side, clearly win for their versatile nature. Doing that with div_result, you need to overload operator<<, or need to output each member separately.

Is there a way to easily handle functions returning std::pairs?

This looks like enough of a common case to prompt a helper function:

template <class T, std::size_t...Idx>
auto deref_impl(T &&tuple, std::index_sequence<Idx...>) {
    return std::tuple<decltype(*std::get<Idx>(std::forward<T>(tuple)))...>(*std::get<Idx>(std::forward<T>(tuple))...);
}

template <class T>
auto deref(T &&tuple)
    -> decltype(deref_impl(std::forward<T>(tuple), std::make_index_sequence<std::tuple_size<std::remove_reference_t<T>>::value>{})) {
    return deref_impl(std::forward<T>(tuple), std::make_index_sequence<std::tuple_size<std::remove_reference_t<T>>::value>{});
}

// ...

int lhsMin;
int lhsMax;
std::tie(lhsMin,lhsMax) = deref(std::minmax_element(lhs.begin(), lhs.end()));

index_sequence is C++14, but a full implementation can be made in C++11.

Note: I'd keep the repeated decltype in deref's return type even in C++14, so that SFINAE can apply.

See it live on Coliru

Is Returning a 2-Tuple Less Efficient Than Std::Pair