How Do Stackless Coroutines Differ from Stackful Coroutines

How do stackless coroutines differ from stackful coroutines?

First, thank you for taking a look at CO2 :)

The Boost.Coroutine doc describes the advantage of stackful coroutine well:

stackfulness

In contrast to a stackless coroutine a stackful coroutine
can be suspended from within a nested stackframe. Execution resumes at
exactly the same point in the code where it was suspended before. With
a stackless coroutine, only the top-level routine may be suspended.
Any routine called by that top-level routine may not itself suspend.
This prohibits providing suspend/resume operations in routines within
a general-purpose library.

first-class continuation

A first-class continuation can be passed as
an argument, returned by a function and stored in a data structure to
be used later. In some implementations (for instance C# yield) the
continuation can not be directly accessed or directly manipulated.

Without stackfulness and first-class semantics, some useful execution
control flows cannot be supported (for instance cooperative
multitasking or checkpointing).

What does that mean to you? for example, imagine you have a function that takes a visitor:

template<class Visitor>
void f(Visitor& v);

You want to transform it to iterator, with stackful coroutine, you can:

asymmetric_coroutine<T>::pull_type pull_from([](asymmetric_coroutine<T>::push_type& yield)
{
    f(yield);
});

But with stackless coroutine, there's no way to do so:

generator<T> pull_from()
{
    // yield can only be used here, cannot pass to f
    f(???);
}

In general, stackful coroutine is more powerful than stackless coroutine.
So why do we want stackless coroutine? short answer: efficiency.

Stackful coroutine typically needs to allocate a certain amount of memory to accomodate its runtime-stack (must be large enough), and the context-switch is more expensive compared to the stackless one, e.g. Boost.Coroutine takes 40 cycles while CO2 takes just 7 cycles in average on my machine, because the only thing that a stackless coroutine needs to restore is the program counter.

That said, with language support, probably stackful coroutine can also take the advantage of the compiler-computed max-size for the stack as long as there's no recursion in the coroutine, so the memory usage can also be improved.

Speaking of stackless coroutine, bear in mind that it doesn't mean that there's no runtime-stack at all, it only means that it uses the same runtime-stack as the host side, so you can call recursive functions as well, just that all the recursions will happen on the host's runtime-stack. In contrast, with stackful coroutine, when you call recursive functions, the recursions will happen on the coroutine's own stack.

To answer the questions:

Are there any limitations on the use of automatic storage variables
(i.e. variables "on the stack")?

No. It's the emulation limitation of CO2. With language support, the automatic storage variables visible to the coroutine will be placed on the coroutine's internal storage. Note my emphasis on "visible to the coroutine", if the coroutine calls a function that uses automatic storage variables internally, then those variables will be placed on the runtime-stack. More specifically, stackless coroutine only has to preserve the variables/temporaries that can be used after resumed.

To be clear, you can use automatic storage variables in CO2's coroutine body as well:

auto f() CO2_RET(co2::task<>, ())
{
    int a = 1; // not ok
    CO2_AWAIT(co2::suspend_always{});
    {
        int b = 2; // ok
        doSomething(b);
    }
    CO2_AWAIT(co2::suspend_always{});
    int c = 3; // ok
    doSomething(c);
} CO2_END

As long as the definition does not precede any await.

Are there any limitations on what functions I can call from a
stackless coroutine?

No.

If there is no saving of stack context for a stackless coroutine,
where do automatic storage variables go when the coroutine is
running?

Answered above, a stackless coroutine doesn't care about the automatic storage variables used in the called functions, they'll just be placed on the normal runtime-stack.

If you have any doubt, just check CO2's source code, it may help you understand the mechanics under the hood ;)

What's difference between Stackless and stackfull coroutines in Kotlin(Android)?

It's not important to know the the distinction to be able to use coroutines in Kotlin.

Kotlin coroutines are stackless: You can only call suspend functions from inside other suspend functions.

To boil it down to a very simplistic definition, a stackful coroutine model would let you suspend or yield from anywhere. A stackless model is restricted to suspending only from inside coroutines, but the benefit is that it is lighter weight. Also, it's probably more feasible to retrofit to an existing language. I don't know any details about the Kotlin coroutines implementation under the hood, but I would bet it would be have been near impossible to implement a stackful model and still compile as Java bytecode.

Are Python coroutines stackless or stackful?

TLDR: Going by the C++ definition, Python's coroutines too are stackless. The defining feature is that the coroutine's persistent state is stored separately from the stack.

Coroutines are stackless: they suspend execution by returning to the caller and the data that is required to resume execution is stored separately from the stack. […]

On a less technical level, Python requires await, async for, etc. to allow child coroutines to suspend their parent(s). It is not possible for a regularly called function to suspend its parent.

async def foo():
    a = await some_awaitable()  # this may suspend `foo` as well
    b = some_function()         # this cannot suspend `foo`
    return a + b

In contrast to a stackless coroutine a stackful coroutine can be suspended from within a nested stackframe. (Boost: Coroutines)

Stack and Coroutines

The interaction of stack and coroutine is not as clearcut as for regular routines. A routine is either executing or done. This naturally maps to function execution adding a single execution frame to the stack;¹ once that frame completes the routine completes and the frame is popped from the stack and discarded.

In contrast, a coroutine can be executing, suspended, or done. This raises the question how to handle the state during suspension – practically, what to do with local variables. There are two prominent means:

We can treat each partial execution like a regular routine. To get coroutine semantics, when such a routine finishes we safe its state. This allows us to switch between individual coroutines.
We can treat the entire execution like a regular routine. To get coroutine semantics, when such a routine suspends we safe the entire stack. This allows us to switch between individual stacks.

Solution 1. is usually called stackless – the coroutine lives separately from the stack. This gives a lot of power since each step is first-class, but it exposes a lot of implementation details. We must explicitly emulate the stacking of nested coroutine execution – usually, this is what await is there for.

Solution 2. is usually called stackfull – the coroutine lives as part of the stack itself. This removes a lot of power since execution happens internally, but hides a lot of implementation details. The distinction of routines versus coroutines is hidden or even inconsequential – any routine may suspend at any point.

These definitions are not universal. For example, you may find 1. being called stackfull since the coroutine itself keeps its own suspended stack. When in doubt, compare the semantics instead of the naming.

However, this appears to be the most prevalent naming scheme (see references).

Where does Python fit in?

From a technical standpoint, Python's coroutines are stackless. Coroutine functions create first-class coroutine objects that allow partial execution and store state internally during suspension (as cr_frame). Coroutines are nested using explicit await and routines cannot await anything.

Notably, Python itself does not support stackful coroutines: Regular routines cannot suspend execution.²

¹The semantics of a frame stack do not necessarily mean that the implementation has a 1:1 relation of function execution to frames on a memory stack. Rather, consider that "the stack" as an abstract description of a runtime can be defined by function execution. Implementations may differ without having observable difference for high-level semantics of a language. Consider CPython which has a C stack running a Python stack.

²The absence of stackfull coroutines is not strictly guaranteed by the language semantics. There are third-party extensions to add stackfull coroutines, namely greenlet. Fittingly, this was introduced by Stackless Python – the "Stackless" refers to the use of the C-Stack by the interpreter, not to how coroutines use the stack.

However, this is a separate suspension mechanism from Python's own await and yield suspension.

References:

Boost: Stackfull Coroutines
Boost: Stackless Coroutines
We Never Needed Stackfull Coroutines
Stackless vs. Stackful Coroutines

Are stackless C++20 coroutines a problem?

Forward: When this post says just "coroutines", I am referring to the concept of a coroutine, not the specific C++20 feature. When talking about this feature, I will refer to it as "co_await" or "co_await coroutines".

On dynamic allocation

Cppreference sometimes uses looser terminology than the standard. co_await as a feature "requires" dynamic allocation; whether this allocation comes from the heap or from a static block of memory or whatever is a matter for the provider of the allocation. Such allocations can be elided in arbitrary circumstances, but since the standard does not spell them out, you still have to assume that any co_await coroutine may dynamically allocate memory.

co_await coroutines do have mechanisms for users to provide allocation for the coroutine's state. So you can substitute the heap/free store allocation for any particular pool of memory you prefer.

co_await as a feature is well-designed to remove verbosity from the point of use for any co_await-able objects and functionality. The co_await machinery is incredibly complicated and intricate, with lots of interactions between objects of several types. But at the suspend/resume point, it always looks like co_await <some expression>. Adding allocator support to your awaitable objects and promises requires some verbosity, but that verbosity lives outside of the place where those things get used.

Using alloca for a coroutine would be... highly inappropriate for most uses of co_await. While the discussion around this feature tries to hide it, the fact of the matter is that co_await as a feature is designed for asynchronous use. That's its intended purpose: to halt the execution of a function and schedule that function's resumption on potentially another thread, then shepherding any eventually generated value to some receiving code which may be somewhat distant from the code which invoked the coroutine.

alloca is not appropriate for that particular use case, since the caller of the coroutine is allowed/encouraged to go do whatever so that the value can be generated by some other thread. The space allocated by alloca would therefore no longer exist, and that is kind of bad for the coroutine that lives in it.

Also note that allocation performance in such a scenario will generally be dwarfed by other considerations: thread scheduling, mutexes, and other things will often be needed to properly schedule the coroutine's resumption, not to mention the time it takes to get the value from whatever asynchronous process is providing it. So the fact that a dynamic allocation is needed is not really a substantial consideration in this case.

Now, there are circumstances where in-situ allocation would be appropriate. Generator use cases are for when you want to essentially pause a function and return a value, then pick up where the function left off and potentially return a new value. In these scenarios, the stack for the function which invokes the coroutine will certainly still be around.

co_await supports such scenarios (though co_yield), but it does so in a less-than-optimal way, at least in terms of the standard. Because the feature is designed for up-and-out suspension, turning it into a suspend-down coroutine has the effect of having this dynamic allocation that doesn't need to be dynamic.

This is why the standard does not require dynamic allocation; if a compiler is smart enough to detect a generator pattern of usage, then it can remove the dynamic allocation and just allocate the space on the local stack. But again, this is what a compiler can do, not must do.

In this case, alloca-based allocation would be appropriate.

How it got into the standard

The short version is that it got into the standard because the people behind it put in the work, and the people behind the alternatives did not.

Any coroutine idea is complicated, and there will always be questions about implementability with regard to them. For example, the "resumeable functions" proposals looked great, and I would have loved to see it in the standard. But nobody actually implemented it in a compiler. So nobody could prove that it was actually a thing you could do. Oh sure, it sounds implementable, but that doesn't mean it is implementable.

Remember what happened the last time "sounds implementable" was used as the basis for adopting a feature.

You don't want to standardize something if you don't know it can be implemented. And you don't want to standadize something if you don't know if it actually solves the intended problem.

Gor Nishanov and his team at Microsoft put in the work to implement co_await. They did this for years, refining their implementation and the like. Other people used their implementation in actual production code and seemed quite satisfied with its functionality. Clang even implemented it. As much as I personally don't like it, it is undeniable that co_await is a mature feature.

By contrast, the "core coroutines" alternatives that were brought up a year ago as competing ideas with co_await failed to gain traction in part because they were difficult to implement. That's why co_await was adopted: because it was a proven, mature, and sound tool that people wanted and had the demonstrated ability to improve their code.

co_await is not for everyone. Personally, I will likely not use it much, as fibers work much better for my use cases. But it is very good for its specific use case: up-and-out suspension.

What does it mean for With a stackless coroutine, only the top-level routine may be suspended.

In each invocation of co_await, only the top level coroutine is suspended. To suspend a lower level, that level must suspend itself explicitly. And at that point, it is now the current "top level". So in every case, only the current top level gets suspended.

Compare this to a purely hypothetical stackful coroutine library:

//This function will always print the same thread ID.
void secondLevel(int id)
{
    while(!toBackgroundThread.poll())
      suspend_coroutine();

    cout << id << " run on " << this_thread::get_id() << endl;

    while(!toBackgroundThread.poll())
      suspend_coroutine();

    cout << id << " run on " << this_thread::get_id() << endl;
}

void topLevel() {
    secondLevel(1);
    secondLevel(2);
}

void listen(AsyncQueue& queue) {
    while (true) {
        if (!queue.poll()) {
            this_thread::sleep_for(100ms);
        }
    }
}

int main() {
    thread([]() {
        listen(toBackgroundThread);
    }).detach();

    auto coro = create_coroutine(topLevel);
    coro.switch_to();

    toMainThread.ready(); //Notes that the main thread is waiting
    while (true) {
        if (!toMainThread.poll()) {
            coro.switch_to();
        }
    }
};

topLevel doesn't have any explicit suspension machinery. Yet its execution suspends whenever any function it calls suspends execution. The entire call-stack, defined by the function given to create_coroutine and everything it calls, suspends. That's how a stackful coroutine works.

That is what is being contrasted with when it comes to stackless coroutines. In the stackless version, every function that needs to suspend must be specifically coded to do so. And thus isn't really "general purpose" anymore; it's now special-cased to suspending scenarios.

Is seastar::thread a stackful coroutine?

Yes, seastar::thread and "stackful coroutines" are indeed very similar concepts.

Note that Seastar also supports stackless coroutines, using the new C++20 coroutines feature. This is now almost always preferable over stackful coroutines (seastar::thread): stackless coroutines are lighter and are useful also in heavily concurrent operations (where a stack per waiter is a big waste), have nicer support in the C++ language. Stackless coroutines also cooperate better with future-based code - when you write a function assuming it is running in a seastar::thread, i.e., using get() on unready futures, you can only call this function from inside a seastar::thread. In contrast, if you write a stackless coroutine with the new C++ syntax - you can call it from any future-based or coroutine-based code.

Different ways of implementing a coroutine

Your option (1) is indeed a stackless coroutine, and this is how it's implemented in Kotlin, and usually Javascript (async/await), for example. This is how you do it when you don't necessarily have the low-level control of the call stack that other methods require. Languages that use this strategy require suspendable functions to be marked in some way. That's called the "red/blue function problem". See: https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/

Languages that do have low-level control of the call stack can implement coroutines in various ways, but none of them look like your option (2). They generally involve copying data into and out of the call stack when required (like Java project Loom) or using a completely different data structure in place of the call stack (like early Go implementations). In these languages, coroutine-callable functions don't usually need special markings.

How Do Stackless Coroutines Differ from Stackful Coroutines