Generate Tail Call Opcode

Generate tail call opcode

C# compiler does not give you any guarantees about tail-call optimizations because C# programs usually use loops and so they do not rely on the tail-call optimizations. So, in C#, this is simply a JIT optimization that may or may not happen (and you cannot rely on it).

F# compiler is designed to handle functional code that uses recursion and so it does give you certain guarantees about tail-calls. This is done in two ways:

if you write a recursive function that calls itself (like your fib) the compiler turns it into a function that uses loop in the body (this is a simple optimization and the produced code is faster than using a tail-call)
if you use a recursive call in a more complex position (when using continuation passing style where function is passed as an argument), then the compiler generates a tail-call instruction that tells the JIT that it must use a tail call.

As an example of the second case, compile the following simple F# function (F# does not do this in Debug mode to simplify debugging, so you may need Release mode or add --tailcalls+):

let foo a cont = cont (a + 1)

The function simply calls the function cont with the first argument incremented by one. In continuation passing style, you have a long sequence of such calls, so the optimization is crucial (you simply cannot use this style without some handling of tail calls). The generates IL code looks like this:

IL_0000: ldarg.1
IL_0001: ldarg.0
IL_0002: ldc.i4.1
IL_0003: add
IL_0004: tail.                          // Here is the 'tail' opcode!
IL_0006: callvirt instance !1 
  class [FSharp.Core] Microsoft.FSharp.Core.FSharpFunc`2<int32, !!a>::Invoke(!0)
IL_000b: ret

Why does tail call optimization need an op code?

Following the links you already provided, this is the part which seems to me, answers your question pretty closely..

Source

The CLR and tail calls
When you're dealing with languages managed by the CLR, there are two kinds of compilers in play. There's the compiler that goes from your language's source code down to IL (C# developers know this as csc.exe), and then there's the compiler that goes from IL to native code (the JIT 32/64 bit compilers that are invoked at run time or NGEN time). Both the source->IL and IL->native compilers understand the tail call optimization. But the IL->native compiler--which I'll just refer to as JIT--has the final say on whether the tail call optimization will ultimately be used. The source->IL compiler can help to generate IL that is conducive to making tail calls, including the use of the "tail." IL prefix (more on that later). In this way, the source->IL compiler can structure the IL it generates to persuade the JIT into making a tail call. But the JIT always has the option to do whatever it wants.
When does the JIT make tail calls?
I asked Fei Chen and Grant Richins, neighbors down the hall from me who happen to work on the JIT, under what conditions the various JITs will employ the tail call optimization. The full answer is rather detailed. The quick summary is that the JITs try to use the tail call optimization whenever they can, but there are lots of reasons why the tail call optimization can't be used. Some reasons why tail calling is a non-option:
Caller doesn't return immediately after the call (duh :-))
Stack arguments between caller and callee are incompatible in a way that would require shifting things around in the caller's frame before the callee could execute
Caller and callee return different types
We inline the call instead (inlining is way better than tail calling, and opens the door to many more optimizations)
Security gets in the way
The debugger / profiler turned off JIT optimizations

The most interesting part in context of your question, which makes it super clear in my opinion, among many scenarios, is example of security mentioned above...

Security in .NET in many cases depends on the stack being accurate... at runtime.. Which is why, as highlighted above, the burden is shared by both the source to CIL compiler, and (runtime) CIL-to-native JIT compilers, with the final say being with the latter.

Why does tail call optimization need an op code?

Following the links you already provided, this is the part which seems to me, answers your question pretty closely..

Source

The CLR and tail calls
When you're dealing with languages managed by the CLR, there are two kinds of compilers in play. There's the compiler that goes from your language's source code down to IL (C# developers know this as csc.exe), and then there's the compiler that goes from IL to native code (the JIT 32/64 bit compilers that are invoked at run time or NGEN time). Both the source->IL and IL->native compilers understand the tail call optimization. But the IL->native compiler--which I'll just refer to as JIT--has the final say on whether the tail call optimization will ultimately be used. The source->IL compiler can help to generate IL that is conducive to making tail calls, including the use of the "tail." IL prefix (more on that later). In this way, the source->IL compiler can structure the IL it generates to persuade the JIT into making a tail call. But the JIT always has the option to do whatever it wants.
When does the JIT make tail calls?
I asked Fei Chen and Grant Richins, neighbors down the hall from me who happen to work on the JIT, under what conditions the various JITs will employ the tail call optimization. The full answer is rather detailed. The quick summary is that the JITs try to use the tail call optimization whenever they can, but there are lots of reasons why the tail call optimization can't be used. Some reasons why tail calling is a non-option:
Caller doesn't return immediately after the call (duh :-))
Stack arguments between caller and callee are incompatible in a way that would require shifting things around in the caller's frame before the callee could execute
Caller and callee return different types
We inline the call instead (inlining is way better than tail calling, and opens the door to many more optimizations)
Security gets in the way
The debugger / profiler turned off JIT optimizations

The most interesting part in context of your question, which makes it super clear in my opinion, among many scenarios, is example of security mentioned above...

tail. prefix in ILAsm – any example of use?

You are not going to find it in any code produced by the current MS C# compiler. You will find it in code produced from the F# compiler, but not as much as you might expect, for almost opposite reasons.

Now, to first correct one mistake in your statement:

ECMA-335, III.2.4 specifies tail. prefix that can be used in recursive functions.

This is not strictly true. The tail. prefix can be used in tail-called calls; not all recursive functions are tail-recursion, and not all tail-calls are part of recursion.

A tail call is any call to a function (including an OOP method) where the last operation in that code-path is to make that call and then return the value it returns, or just return if the function called doesn't return a value. Hence in:

int DoSomeCalls(int x)
{
  if(A(x))
    return B(x);
  if(DoSomeCalls(x * 2) > 3)
  {
    int ret = C(x);
    return ret;
  }
  return D(DoSomeCalls(x-1));
}

Here, the calls to B, and D are tail calls, because the only thing done after the call is to return the value they'd returned. The call to C isn't a tail call, but it can be easily converted to one by removing the redundant assignment to ret by just returning directly. The call to A is not a tail call, and the nor are the call to DoSomeCalls, though they are recursive.

Now, the normal function call mechanism is implementation-dependent, but generally involves saving enrigstered values that might be needed after the call onto the stack, putting parameters onto the stack and/or into registers along with the current instruction position (to return to), moving the instruction pointer, and then reading the return value from a register or the stack when the instruction pointer is moved back to after where the call was done. With a tail call it's possible to skip a lot of this, because the called-into function can use the current stack frame and then return straight to the earlier caller.

The tail. prefix requests that this be done with a call.

While this isn't necessarily related to recursion, you were correct in talking about recursion, because the benefits of eliminating tail calls is greater in recursive cases than otherwise; making calls that are O(n) in stack space when actually using the function-call mechanism become O(1) in stack-space, along with reducing the per-item constant time costs lower (so it's still O(n) in this regard, but O(n) time means it takes n×k seconds, and we have a smaller k). In many cases this can be the difference between a call that works, and a call that throws a StackOverflowException.

Now, in ECMA-335 there are a few cases stated about how tail. may not always be honoured. In particular there is the text in §III.2.4 that states:

There can also be implementation-specific restrictions that prevent the tail. prefix from being obeyed in certain cases.

At its loosest, we could interpret this as preventing it in all manner of cases.

Conversely, the jitter is allowed to apply all manner of optimisations, including performing tail call elimination even when it wasn't requested by tail.

Because of this, there are in fact four ways to do tail-call elimination in IL:

Use the tail. prefix just before the call, and have it honoured (not guaranteed).
Don't use the tail. prefix before the call, but have the jitter decide to apply it any way (even less guaranteed).
Use the jmp IL instruction which is effectively a special case of tail call elimination (never used by C# because it produces unverifiable code for a normally relatively small gain, though it can be the easiest approach sometimes when hand-coding due to its relative simplicity).
Re-write the whole method to use a different approach; in particular the sort of recursive code that most benefits from tail call elimination can be re-written to explicitly use the sort of iterative algorithm the tail-call elimination effectively turns the recursion into.* (In other words, the tail-call elimination happens before the jitting or even the compilation).

(There's also sort of the case where the call is inlined, since it doesn't require a new stack frame, and indeed has normally a stronger improvement overall, and then in turn often allows even further optimisations to be performed, but it isn't generally considered a case of tail-call elimination because it's a call elimination that doesn't depend on it being a tail call).

Now, the first implementations of the jitter tended not to do tail call elimination in a lot of cases, even if it was requested.

Meanwhile at the C# side of things, there was a decision not to emit tail. There is a general approach with C# of not heavily optimising the code produced. There are some optimisations done (in particular, dead code removal), but for the most part since the optimisation efforts could just duplicate those done by the jitter (or even get in their way) the downsides of optimisation (more complications means more possible bugs, and the IL would be more confusing to many developers) relatively outweigh the upsides. Use of tail. is a classic example of this, because sometimes insisting on tail calls actually costs more than it saves with .NET so if the jitter is already trying to work out when it's a good idea, then there's a bigger chance that the C# compiler would be just making things worse a lot of the time, and making no difference the rest.

It's also worth noting that with the styles of coding most common with a C-style language like C#:

Developers tend not to write code that would particularly benefit from tail-call elimination compared to the styles more common in other languages.
Developers tend to know how to optimise the sort of recursive calls that would most benefit from tail-call elimination by re-writing them to be iterative.
Developers tend to have written them in the iterative manner in the first place.

Now, along came F#.

With the sort of functional and declarative programming F# encourages, there are a lot of cases where what is most naturally done in an iterative way in C# is most naturally done with a recursive approach. Where people hacking in C-style languages learn to turn recursive cases into iterative code, people hacking in F#-style languages learn to turn iterative cases into recursive code, and non-tail-calling recursive code into tail-calling recursive code.

So F# used tail. a lot.

And it got StackOverflowException a lot, because the jitter wasn't honouring it.

This is one of the things that led the jitter people to increase the number of cases where they eliminated tail calls, both in general and even further if tail. is used.

Meanwhile, the F# people couldn't just depend on tail. so F#'s compiler will optimise much more heavily than C#'s; just as we can manually rewrite recursive calls to be iterative as in the footnote, so the F# compiler does the equivalent when producing IL.

And for this reason, a lot of the time when you write an F# method where you'd expect to see some IL that uses tail., what you'd actually get is IL that does the equivalent thing iteratively.

However, F# will still use tail. when a method calls another method in a mutually-recursive manner like:

let rec even n = 
  if n = 0 then 
    true 
  else
    odd (n-1)
and odd n =
  if n = 1 then 
    true 
  else
    even (n-1)

Which I totally stole from this answer because I've only played a tiny bit with F# so I'd rather depend upon someone more familiar than I am.

In this case, because the tail-calls aren't in a single function, it can't just be rewritten to eliminate it at the IL compilation point, so it has to hope the jitter will do the elimination, and uses tail. to increase the chances it will.

*An example of turning a recursive call into an iterative would be to start with a recursive call like:

void ClearAllNodes(Node node)
{
  if(node != null)
  {
    node.Value = null;
    ClearAllNodes(node.Next)
  }
}

The simplest change is to then manually add what a tail-call elimination does, by ourselves setting up the parameter, and jumping back to the start of the method:

void ClearAllNodes(Node node)
{
start:
  if(node != null)
  {
    node.Value = null;
    node = node.Next;
    goto start;
  }
}

Since there are good reasons to avoid goto if we can, we would generally change it to something that does the same through more strictly-defined looping mechanisms:

void ClearAllNodes(Node node)
{
  while(node != null)
  {
    node.Value = null;
    node = node.Next;
  }
}

What is tail call optimization?

Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function. The most common use is tail-recursion, where a recursive function written to take advantage of tail-call optimization can use constant stack space.

Scheme is one of the few programming languages that guarantee in the spec that any implementation must provide this optimization, so here are two examples of the factorial function in Scheme:

(define (fact x)
  (if (= x 0) 1
      (* x (fact (- x 1)))))

(define (fact x)
  (define (fact-tail x accum)
    (if (= x 0) accum
        (fact-tail (- x 1) (* x accum))))
  (fact-tail x 1))

The first function is not tail recursive because when the recursive call is made, the function needs to keep track of the multiplication it needs to do with the result after the call returns. As such, the stack looks as follows:

(fact 3)
(* 3 (fact 2))
(* 3 (* 2 (fact 1)))
(* 3 (* 2 (* 1 (fact 0))))
(* 3 (* 2 (* 1 1)))
(* 3 (* 2 1))
(* 3 2)
6

In contrast, the stack trace for the tail recursive factorial looks as follows:

(fact 3)
(fact-tail 3 1)
(fact-tail 2 3)
(fact-tail 1 6)
(fact-tail 0 6)
6

As you can see, we only need to keep track of the same amount of data for every call to fact-tail because we are simply returning the value we get right through to the top. This means that even if I were to call (fact 1000000), I need only the same amount of space as (fact 3). This is not the case with the non-tail-recursive fact, and as such large values may cause a stack overflow.

Why doesn't .NET/C# optimize for tail-call recursion?

JIT compilation is a tricky balancing act between not spending too much time doing the compilation phase (thus slowing down short lived applications considerably) vs. not doing enough analysis to keep the application competitive in the long term with a standard ahead-of-time compilation.

Interestingly the NGen compilation steps are not targeted to being more aggressive in their optimizations. I suspect this is because they simply don't want to have bugs where the behaviour is dependent on whether the JIT or NGen was responsible for the machine code.

The CLR itself does support tail call optimization, but the language-specific compiler must know how to generate the relevant opcode and the JIT must be willing to respect it.
F#'s fsc will generate the relevant opcodes (though for a simple recursion it may just convert the whole thing into a while loop directly). C#'s csc does not.

See this blog post for some details (quite possibly now out of date given recent JIT changes). Note that the CLR changes for 4.0 the x86, x64 and ia64 will respect it.

Why does the F# compiler not create a tail call for this function?

In fact, you have already answered your own question. Quoting the comment from the source code,

// Throw away the unit result

is the pending operation after the call, preventing the compiler from using a tail call here.

There's a great blog post by Keith Battocchi, "Tail calls in F#" (scroll to section "Limitations / Calling function values returning unit") which discovers a lot of details.

In two words:

Normally, F# functions … -> unit are compiled to .NET methods returning void.

However, functions treated as values (e.g., those passed as arguments to higher-order functions) are stored in objects of type ('a->'b) so they actually return Microsoft.FSharp.Core.Unit, not void.

The compiler needs to pop the dummy unit value off of the stack before returning.

Therefore, there is a pending operation after the recursive call, and therefore the compiler can't optimize it to a tail call.

Good news:

Note that this issue only arises when using first-class functions as values. Calling normal .NET methods which return void doesn’t present this problem, because there is no return value to pop off of the stack.

C# compilation with tail recursive optimization?

There are different levels at which the tail call optimization can be supported. The JIT is really responsible for most of the optimizations in any .NET program. The C# compiler itself doesn't even do method inlining, that is the JIT compiler's responsibility. The C# compiler could use the Tailcall IL opcode designating a call as a tail call, however I believe no version of C# compiler does this. The JIT compiler is permitted to make tail call optimizations whenever it sees fit. In particular, I believe only the 64-bit JIT does this. This blog post outlines a number of scenarios in which JIT64 cannot use tail call optimization. I'm sure the criteria may be subject to change since they are working on a rewrite of the JIT compiler, codenamed RyuJIT.

If you want a short example of a program that can use TCO try this:

class Program
{
    static void Main(string[] args)
    {
        Test(1);
    }

    private static void Test(int i)
    {
        Console.WriteLine(i);
        Test(i + 1);
    }
}

Set the project to build Release/x64 (or AnyCPU w/o prefer 32-bit) and start without the debugger attached. The program will run forever. If I do not do all of those things, then I get a stackoverflow exception around 20947.

F# has tail call elimination?

The problem with Proper Tail Calls in Scala is one of engineering trade-offs. It would be easily possible to add PTCs to Scala: just add a sentence to the SLS. Voilà: PTCs in Scala. From a Language Design perspective, we are done.

Now the poor compiler writers need to implement that spec. Well, compiling into a language with PTCs is easy … but unfortunately, the JVM byte code isn't such a language. Okay, so what about GOTO? Nope. Continuations? Nope. Exceptions (which are known to be equivalent to Continuations)? Ah, now we are getting somewhere! So, we could use exceptions to implement PTCs. Or, alternatively, we could just not use the JVM call stack at all and implement our own stack.

After all, there are multiple Scheme implementations on the JVM, all of them support PTCs just fine. It's a myth that you cannot have PTCs on the JVM, just because the JVM doesn't support them. After all, x86 doesn't have them either, but nonetheless, there are languages running on x86 that have them.

So, if implementing PTCs on the JVM is possible, then why doesn't Scala have them? Like I said above, you could use exceptions or your own stack to implement them. But using exceptions for control flow or implementing your own stack means that everything which expects the JVM call stack to look a certain way would no longer work.

In particular, you would lose pretty much all interoperability with the Java tooling ecosystem (debuggers, visualizers, static analyzers). You would also have to build bridges to interoperate with Java libraries, which would be slow, so you lose interop with the Java library ecosystem as well.

But that is a major design goal of Scala! And that's why Scala doesn't have PTCs.

I call this "Hickey's Theorem", after Rich Hickey, the designer of Clojure who once said in a talk "Tail Calls, Interop, Performance – Pick Two."

You would also present the JIT compiler with some very unusual byte code patterns that it may not know how to optimize well.

If you were to port F# to the JVM, you would basically have to make exactly that choice: do you give up Tail Calls (you can't, because they are required by the Language Spec), do you give up Interop or do you give up Performance? On .NET, you can have all three, because Tail Calls in F# can simply be compiled into Tail Calls in MSIL. (Although the actual translation is more complex than that, and the implementation of Tail Calls in MSIL is buggy in some corner cases.)

This poses the question: why not add Tail Calls to the JVM? Well, this is very hard, due to a design flaw in the JVM byte code. The designers wanted the JVM byte code to have certain safety properties. But instead of designing the JVM byte code language in such a way that you cannot write an unsafe program in the first place (like, say, in Java, for example, where you cannot write a program that violates pointer safety, because the language just doesn't give you access to pointers in the first place), JVM byte code in itself is unsafe and needs a separate byte code verifier to make it safe.

That byte code verifier is based on stack inspection, and Tail Calls change the stack. So, the two are very hard to reconcile, but the JVM simply doesn't work without the byte code verifier. It took a long time and some very smart people to finally figure out how to implement Tail Calls on the JVM without losing the byte code verifier (see A Tail-Recursive Machine with Stack Inspection by Clements and Felleisen and tail calls in the VM by John Rose (JVM lead designer)), so we have now moved from the stage where it was an open research problem to the stage where it is "just" an open engineering problem.

Note that Scala and some other languages do have intra-method direct tail-recursion. However, that is pretty boring, implementation-wise: it is just a while loop. Most targets have while loops or something equivalent, e.g. the JVM has intra-method GOTO. Scala also has the scala.util.control.TailCalls object, which is kind-of a re-ified trampoline. (See Stackless Scala With Free Monads by Rúnar Óli Bjarnason for a more general version of this idea, which can eliminate all use of the stack, not just in tail-calls.) This can be used to implement a tail-calling algorithm in Scala, but this is then not compatible with the JVM stack, i.e. it doesn't look like a recursive method call to other languages or to a debugger:

import scala.util.control.TailCalls._

def isEven(xs: List[Int]): TailRec[Boolean] =
  if (xs.isEmpty) done(true) else tailcall(isOdd(xs.tail))

def isOdd(xs: List[Int]): TailRec[Boolean] =
 if (xs.isEmpty) done(false) else tailcall(isEven(xs.tail))

isEven((1 to 100000).toList).result

def fib(n: Int): TailRec[Int] =
  if (n < 2) done(n) else for {
    x <- tailcall(fib(n - 1))
    y <- tailcall(fib(n - 2))
  } yield (x + y)

fib(40).result

Clojure has the recur special form, which is also an explicit trampoline.

What is tail recursion?

Consider a simple function that adds the first N natural numbers. (e.g. sum(5) = 0 + 1 + 2 + 3 + 4 + 5 = 15).

Here is a simple JavaScript implementation that uses recursion:

function recsum(x) {
    if (x === 0) {
        return 0;
    } else {
        return x + recsum(x - 1);
    }
}

If you called recsum(5), this is what the JavaScript interpreter would evaluate:

recsum(5)
5 + recsum(4)
5 + (4 + recsum(3))
5 + (4 + (3 + recsum(2)))
5 + (4 + (3 + (2 + recsum(1))))
5 + (4 + (3 + (2 + (1 + recsum(0)))))
5 + (4 + (3 + (2 + (1 + 0))))
5 + (4 + (3 + (2 + 1)))
5 + (4 + (3 + 3))
5 + (4 + 6)
5 + 10
15

Note how every recursive call has to complete before the JavaScript interpreter begins to actually do the work of calculating the sum.

Here's a tail-recursive version of the same function:

function tailrecsum(x, running_total = 0) {
    if (x === 0) {
        return running_total;
    } else {
        return tailrecsum(x - 1, running_total + x);
    }
}

Here's the sequence of events that would occur if you called tailrecsum(5), (which would effectively be tailrecsum(5, 0), because of the default second argument).

tailrecsum(5, 0)
tailrecsum(4, 5)
tailrecsum(3, 9)
tailrecsum(2, 12)
tailrecsum(1, 14)
tailrecsum(0, 15)
15

In the tail-recursive case, with each evaluation of the recursive call, the running_total is updated.

Note: The original answer used examples from Python. These have been changed to JavaScript, since Python interpreters don't support tail call optimization. However, while tail call optimization is part of the ECMAScript 2015 spec, most JavaScript interpreters don't support it.

Generate Tail Call Opcode