Coroutines: A Million Stacks

2017-10-05

Can you have 1 million entities and let them all think with stackful coroutines and achieve acceptable frame rates?

An unreasonable question, but is it possible?

Obviously not and it’s easy to see why with a back-of-the-envelope calculation:

1 million entities * 1 MiB stack per coroutine = 1 TiB of memory

My machine has 64 GiB of RAM, a respectable amount, but only 6.4% of the memory requirements just for these coroutine stacks. This is just way too much.

But…

The vast majority of coroutines yield with a very small amount of their stacks in use. While active their stacks could grow quite large and they must not overflow hence the 1 MiB stack size. Our coroutines only require a stack that can grow large while in use.

This sounds a lot like a growable stack. Unfortunately growable stacks require relocatable stacks and relocatable stacks are impossible in C++ without draconian restrictions. You can take the address of any stack variable store it in heap memory, stack memory, what have you. You would have to be able to fixup all of those pointers to point to the new stack. This essentially requires the equivalent of a garbage collector. Worse you can put things on the stack that are not relocatable like a mutex.

int main() {
    std::mutex m;

    std::thread t{ [&m]() {
        std::lock_guard<std::mutex> _( m );
        std::cout << "This is fine" << std::endl;
    }};
    t.join();
    return 0;
}

Closely related to growable stack is segmented stack.1 This essentially turns the stack into a linked lists of small stacks called stacklets.2 These have some cool properties, like pointers to stack data remain valid after growth, and your initial stack size can be very small. But there are some issues. Take alloca for example which lets you allocate dynamically from the stack. This is occasionally quite useful and doesn’t interact well with stacklets. They are also prone to stack thrashing which has led them to fall from favor in Go or Rust.34 There is also the rather large issue of segmented stacks not being supported on Windows.

This seemed like another dead-end but a forum post by Walter Bright gave me new perspective on the problem.

Also, [segmented stacks] do not save RAM, they save address space. RAM is not committed until a stack memory page is actually used.5

This jogged my memory to recall an aptly titled Raymond Chen post: It’s the address space, stupid.6

Of course! The requirement isn’t 1 TiB of RAM it’s 1 TiB of address space.

This is actually not a showstopper. The user mode address space on 64 bit windows increased to 128 TiB in Windows 8.1.7 We could reserve the full 1 TiB and still have plenty of room for the rest of the program.

However we still can’t actually use all of that memory. The OS allocates memory in pages. Pages are typically 4k in size. You can verify this for your system by running this small program.

int main() {
    SYSTEM_INFO si;
    GetSystemInfo( &si );

    std::cout << "Page Size: " << si.dwPageSize << std::endl;
    return 0;
}
Page Size: 4096

There’s a set of virtual memory APIs in Windows that allow us to reserve a virtual address range without requiring it to be backed physical memory. Later we can commit this virtual memory and the OS will allocate physical pages to associate with the reserved virtual address range. It turns out that this is exactly how the stack for your main context of your program works too. We can write a bit of code to dump the stack info from a simple C++ console app.

### Stack Dump Start
Range: 0x39a12d0000 - 0x39a13ca000 Protect: 0x000 State: RESERVE Pages: 250
Range: 0x39a13ca000 - 0x39a13cd000 Protect: 0x104 State: COMMIT  Pages: 3
Range: 0x39a13cd000 - 0x39a13d0000 Protect: 0x004 State: COMMIT  Pages: 3
### Stack Dump Finish

What does this mean? At the bottom I have 3 page committed, this is the part of the stack my program used up to this point. Next there are another 3 committed pages. I’ll come back to these in a bit. Then there are 250 remaining pages that are reserved. Sum these up and our stack has address space for 256 pages, or 1 megabyte of stack space.

You’ve probably never written any code to commit virtual address ranges for your program’s stack. So how do reserved stack pages get committed? This is what those middle 3 committed pages are for. Those protect flags are PAGE_GUARD 0x100 and PAGE_READWRITE 0x4. When your program tries to read or write to memory in those guarded pages it will raise a page fault violation which will get caught by the OS. The OS will remove the guard from that block, and commit another guard page at the top of the reserve block. The result will look like:

### Stack Dump Start
Range: 0x39a12d0000 - 0x39a13c9000 Protect: 0x000 State: RESERVE Pages: 249
Range: 0x39a13c9000 - 0x39a13cc000 Protect: 0x104 State: COMMIT  Pages: 3
Range: 0x39a13cc000 - 0x39a13d0000 Protect: 0x004 State: COMMIT  Pages: 4
### Stack Dump Finish

The cool part is this is all handled automatically, even for library-based stackful coroutines, as long as we set up our stack correctly. So the OS will commit the stack memory for us, how will it decommit it?

Short answer is it won’t, at least not until our thread terminates. This is a sensible default behavior. Programs usually either terminate quickly in which case they’ll release their stack or they run in a loop in which case they’ll likely use roughly the same amount of stack every iteration. Even if there is a long running program that has a spike in stack usage that it will never achieve again it’s going to only grow to 1 MiB at most (for a default stack size) which is nothing for todays computers. Mapping physical pages can be expensive, as we’ll see later, so it makes sense to avoid doing it repeatedly by default.

Now that we know about VirtualAlloc and friends we can verify that bit about address space limits ourselves. This system is running Windows 10 Pro x64.

int main() {
    int64_t one_mb = (1 << 20);
    int64_t one_tb = one_mb * (1 << 20);
    for ( int64_t n = 1;; ++n ) {
        auto p = VirtualAlloc( 0, n * one_tb, MEM_RESERVE, PAGE_READWRITE );
        if ( p == nullptr ) {
            std::cout << "Reserved up to "<< n << " TiB." << std::endl;
            return 0;
        }
        VirtualFree( p, 0, MEM_RELEASE );
    }
}
Reserved up to 127 TiB.

Wait wasn’t the limit 128 TiB? Yes but we already used some of that for things like our stack, iostream, and I believe reserving just over 34 billion pages requires some memory usage, about 256 MiB according to task manager. Note: this program seems to return values ranging from 125 to 127 depending on the run. I’m not sure why; perhaps due to ASLR?

What if we really want to decommit our stack memory? Turns out the stack is just memory our process owns and we use VirtualFree directly on it to shrink our stack. There is a great article on doing just this this.8 This most important bit for our purposes is the StackShrink function. Unfortunately the code doesn’t compile on VS2017, it is over 10 years old after all. We can no longer use inline assembly to read the stack pointer directly but we can get at it through a bit of hackery.9

__declspec(noinline) PBYTE GetStackPointer() {
    return (PBYTE)_AddressOfReturnAddress() + 8;
}

Now that we know how to make a stack grow and shrink in-place, let’s turn our attention back to coroutines. I’m going to use Boost.Coroutine2 but this should be applicable to any coroutine library. We want a fixed-sized stack.

int main() {
    using think_co = boost::coroutines2::coroutine< void >;
    using stack_t = boost::coroutines2::fixedsize_stack;
    stack_t stack{ 1 * 1024 * 1024 };
    think_co::push_type think{ stack, 
        [&]( think_co::pull_type& c ) {
            DbgDumpStack();
        }};
        
    think();
    return 0;
}
### Stack Dump Start
Range: 0x250ed530000 - 0x250ed630000 Protect: 0x000004 State: COMMIT  Pages: 256
### Stack Dump Finish

Hmm, that doesn’t look like our main context stack. A quick modification to try and create our 1 million coroutines and on my system the program throws bad_alloc after creating just 67,279. What exactly is fixedsize_stack doing?10

stack_context allocate() {
    ...
    void * vp = ::VirtualAlloc( 0, size__, MEM_COMMIT, PAGE_READWRITE);
    if ( ! vp) throw std::bad_alloc();
    ...
}

Aha exactly as our dump shows it’s committing the whole stack up front. Luckily it’s pretty easy to modify to suit our needs.

stack_context allocate() {
    ...
    void * vp = ::VirtualAlloc( 0, size__, MEM_RESERVE, PAGE_READWRITE );
    if ( !vp ) goto error;

    // needs at least 2 pages to fully construct the coroutine and switch to it
    const auto init_commit_size = one_page_size + one_page_size;
    auto pPtr = static_cast<PBYTE>(vp) + size__;
    pPtr -= init_commit_size;
    if ( !VirtualAlloc( pPtr, init_commit_size, MEM_COMMIT, PAGE_READWRITE ) ) 
        goto cleanup;

    // create guard page so the OS can catch page faults and grow our stack
    pPtr -= one_page_size;
    if ( !VirtualAlloc( pPtr, one_page_size, MEM_COMMIT, PAGE_READWRITE|PAGE_GUARD ) )
        goto cleanup;
    ...
    cleanup:
        ::VirtualFree( vp, 0, MEM_RELEASE );
    error:
        throw std::bad_alloc();
}

I’ll call this new allocator reserved_fixedsize_stack. And now our stack dump looks much nicer.

### Stack Dump Start
Range: 0x237fe1f0000 - 0x237fe2ed000 Protect: 00000000 State: RESERVE Pages: 253
Range: 0x237fe2ed000 - 0x237fe2ee000 Protect: 0x000104 State: COMMIT  Pages: 1
Range: 0x237fe2ee000 - 0x237fe2f0000 Protect: 0x000004 State: COMMIT  Pages: 2
### Stack Dump Finish

And we have no trouble creating 1 million coroutines. But what happens when we actually go to use them?

void StackConsume( DWORD dwSizeExtra ) {
    PBYTE pPtr = GetStackPointer();
    const DWORD page_size = (DWORD)boost::context::stack_traits::page_size();
    for ( ; dwSizeExtra >= page_size; dwSizeExtra -= page_size ) {
        // Move our pointer to the next page on the stack.
        pPtr -= page_size;
        // read from this pointer. If the page isn't allocated yet - it will be.
        volatile BYTE nVal = *pPtr;
    }
}

int main() {
    using think_co = boost::coroutines2::coroutine< void >;
    using stack_t = reserved_fixedsize_stack;
    int count = 1'000'000;
    std::vector<think_co::push_type> thinks;
    stack_t stack{ 1 * 1024 * 1024 };
    thinks.reserve( count );
    for ( int i = 0; i < count; ++i ) {
        thinks.emplace_back( stack,
        [&]( think_co::pull_type& c ) {
            StackConsume( 900 * 1024 );
            StackShrink();
        } );
    }

    for ( auto & think : thinks ) {
        think();
    }
    return 0;
}

On my system this runs for a while, crashes google chrome, and throws a Stack overflow exception after task manager memory usage climbs over 24 GiB. Of course it does; we’re back to forcibly committing all of the stack pages again just later through the OS page fault mechanism instead of directly through VirtualAlloc.

We can release the unused portion of our stack before we return from our coroutine by calling our StackShrink function from earlier. This runs and eventually finishes, but not quickly. Let’s measure just how long it takes.

class timer_t {
private:
    std::chrono::time_point<std::chrono::high_resolution_clock> start_time;
public:
    timer_t() {
        start_time = std::chrono::high_resolution_clock::now();
    }

    double stop() {
        auto stop_time = std::chrono::high_resolution_clock::now();
        return std::chrono::duration<double>( stop_time - start_time ).count();
    }
};

int main() {
    using think_co = boost::coroutines2::coroutine< void >;
    using stack_t = reserved_fixedsize_stack;
    int count = 1'000'000;
    std::vector<think_co::push_type> thinks;
    stack_t stack{ 1 * 1024 * 1024 };
    thinks.reserve( count );

    for ( int i = 0; i < count; ++i ) {
        thinks.emplace_back( stack,
        [&]( think_co::pull_type& c ) {
            StackConsume( 900 * 1024 );
            StackShrink();
        } );
    }

    timer_t timer;
    for ( auto & think : thinks ) {
        think();
    }
    double elapsed = timer.stop();
    std::cout << "Thought for " << elapsed << " seconds." << std::endl;
    return 0;
}
Thought for 293.61 seconds.

Remember when I said we’d see mapping physical pages can be expensive? Nearly 5 minutes for one iteration! What makes this so expensive? Is it all the page faults? We’re touching 900k of memory per think. That’s up to 225 page faults per think or 225 million page faults that result in a mem commit and a page guard commit each. Let’s try to reduce that, commit the whole stack when we enter the think and shrink it again when before we leave.

void StackCommit() {
    PBYTE sp = GetStackPointer();

    const auto page_size = boost::context::stack_traits::page_size();

    // Get the stack last page.
    MEMORY_BASIC_INFORMATION stMemBasicInfo;
    BOOST_VERIFY( VirtualQuery( sp, &stMemBasicInfo, sizeof( stMemBasicInfo ) ) );
    PBYTE pCur = (PBYTE)stMemBasicInfo.BaseAddress;

    // Get the base of the stack
    // Commit everything except the last page
    PBYTE pCommit = (PBYTE)stMemBasicInfo.AllocationBase + page_size;
    if ( pCommit < pCur ) {
        BOOST_VERIFY(VirtualAlloc(pCommit, pCur-pCommit, MEM_COMMIT, PAGE_READWRITE));
    }
}

int main() {
    using think_co = boost::coroutines2::coroutine< void >;
    using stack_t = reserved_fixedsize_stack;
    int count = 1'000'000;
    std::vector<think_co::push_type> thinks;
    stack_t stack{ 1 * 1024 * 1024 };
    thinks.reserve( count );

    for ( int i = 0; i < count; ++i ) {
        thinks.emplace_back( stack,
        [&]( think_co::pull_type& c ) {
            StackCommit();
            StackConsume( 900 * 1024 );
            StackShrink();
        } );
    }

    timer_t timer;
    for ( auto & think : thinks ) {
        think();
    }
    double elapsed = timer.stop();
    std::cout << "Thought for " << elapsed << " seconds." << std::endl;
    return 0;
}
Thought for 153.696 seconds.

This ran in half the time, an improvement for sure, but still woefully inadequate. Time for a trace.

xperf trace

We can see we’re spending a lot of time in ntoskrnl.exe!MiResolveDemandZeroFault. This is a type of soft page fault.

A process references an allocated virtual page for the first time (sometimes called a demand-zero fault).11

This is because we are still returning our pages to the OS in StackShrink. Every page the OS gives us must be filled with zeros when we read it for the first time our process can see it. Otherwise we could potentially read sensitive information in another process’s released memory.12

We don’t need freshly zeroed pages from the OS for every coroutine. It would be perfectly acceptable to reassign pages released during ShrinkStack to the next coroutine stack’s virtual address range. This way memory would never be fully released by our process and the OS wouldn’t have to protect us from reading that data. But is this even possible?

We can get close with Address Windowing Extensions (AWE)13. Specifically we want to look at AllocateUserPhysicalPages and MapUserPhysicalPages. This one is quite complicated, it involves changing the default local security policy user rights assignments to allow us to lock pages in memory. We then have to make a pool of physical pages that we can hand out to our coroutine when it executes. There’s too much code to post the changes inline here, you’ll have to follow the link if you want to see the implementation.14

Thought for 8.10465 seconds.

Still not great, but at least it’s not 5 minutes. Let’s grab another trace.

xperf trace

All the time is being spent in map/unmap and those in turn are spending all their time in KernelBase.dll!MapUserPhysicalPages. All of the page faults are gone from StackConsume.

How does this match up to our baseline? Oh I haven’t discussed the baseline yet? Well you see the natural baseline for an entity think here would be just a normal function call instead of a coroutine. To keep it somewhat fair we’ll ahve a std::vector of a million function pointers and call each of them. The think function will touch 900k of the stack just like th coroutine think, only this time it will always be the main context stack.15

Thought for 0.279729 seconds.

Welp… maybe it’s not quite fair. Weighing in at 279ms, our baseline is even too expensive for a frame. Of course without the volatile reads in StackConsume the baseline drops to 1ms meanwhile the AWE version still clocks in at over 7 seconds with the same treatment.

At last I reached my wit’s end I admit defeat. It really does appear to be impossible, for now.

Quick Extras:

  • Linux provides mremap which might do the trick, but sadly no analogue on Windows.
  • We can exploit shared memory16, but with 64 KiB granularity it’s a nonstarter.