stackoverflow 有个比较 c++ stackful 和 stackless 协程的帖子

水木社区手机版

主题:stackoverflow 有个比较 c++ stackful 和 stackless 协程的帖子
楼主|hgoldfish|2023-05-29 23:22:04|展开
stackful 协程是指之前 boost 里面实现的 boost.context, boost.fiber 等等协程方案。基本原理是保存寄存器、jmp指令、恢复寄存器。

而 stackless 协程是指 c++20 实现的 co_await, co_yield 这个语法。它把协程的代码变换成为另外一段 c++ 的类型，类似于 lambda 那样继承一个专门的协程类型，然后调用它的方法。

这里写出了两种协程的对比。

stackless coroutines

    stackless coroutines (C++20) do code transformation (state machine)
    stackless in this case means, that the application stack is not used to store local variables (for instance variables in your algorithm)
    otherwise the local variables of the stackless coroutine would be overwritten by invocations of ordinary functions after suspending the stackless coroutine
    stackless coroutines do need memory to store local variables too, especially if the coroutine gets suspended the local variables need to be preserved
    for this purpose stackless coroutines allocate and use a so-called activation record (equivalent to a stack frame)
    suspending from a deep call stack is only possible if all functions in between are stackless coroutines too (viral; otherwise you would get a corrupted stack)
    some clang developers are sceptical that the Heap Allocation eLision Optimization (HALO) can always be applied

stackful coroutines

    in its essence a stackful coroutine simply switches stack and instruction pointer
    allocate a side-stack that works like a ordinary stack (storing local variables, advancing the stack pointer for called functions)
    the side-stack needs to be allocated only once (can also be pooled) and all subsequent function calls are fast (because only advancing the stack pointer)
    each stackless coroutines requires its own activation record -> called in a deep call chain a lot activation records have to be created/allocated
    stackful coroutines allow to suspend from a deep call chain while the functions in between can be ordinary functions (not viral)
    a stackful coroutine can outlive its caller/creator
    one version of the skynet benchmarks spawns 1 million stackful coroutines and shows that stackful coroutines are very efficient (outperforming version using threads)
    a version of the skynet benchmark using stackless coroutiens was not implemented yet
    boost.context represents the thread's primary stack as a stackful coroutine/fiber - even on ARM
    boost.context supports on demand growing stacks (GCC split stacks)

https://stackoverflow.com/questions/57163510/are-stackless-c20-coroutines-a-problem
--
FROM 117.24.94.*
4楼|hgoldfish|2023-05-30 18:08:50|展开
空间占用不严重。因为现在 linux/openbsd 等现代发行版都早就实现了自动增长的栈。一开始只给你 4KB，随着协程函数的运行才会继续增长。

所以你一次性创建 1m 协程，也只会占用 4GB 的内存空间。

Windows 也有这个功能，但我现在还没有找到哪个 API 可以创建这种自动增长的栈内存。但我知道 Windows 确实是有实现的。因为 CreateFiber() 这个函数有这个功能。

【在 ylh0315 的大作中提到: 】
: stackfull，stack的空间占用，类似多线程。
: 如果1百万携程，空间占用是很恐怖的。
: 可以考虑stack池，对应与线程池的设想。
: ...................
--
FROM 59.60.25.*
7楼|hgoldfish|2023-05-30 18:17:07|展开
怎么能这么想呢。stackless 的协程都投入工作，也得占用那个内存啊。而且因为 stackless 做同一件事的内存开销更大，需要不断地在堆里面分配内存，最终占用的内存说不定还会多很多呢。

【在 ylh0315 的大作中提到: 】
: 但是，这1M协程，早晚都会投入工作，都投入了，内存占用可观。
--
FROM 59.60.25.*
9楼|hgoldfish|2023-05-30 18:19:38|展开
go, python, java, c# 这几门语言的协程实现和 c++ stackful 是不一样的啊。只有 c++ 才能搞“保存寄存器”，“jmp 跳转”，“恢复寄存器”这种直接在机器指令级别干活的实现方案。其它语言要么有虚拟机，要么有 GC，不能这么搞的啊。

【在 ensonmj 的大作中提到: 】
: go不就是这么实现的吗？去挖一挖看看他在windows下调的啥api
--
FROM 59.60.25.*
13楼|hgoldfish|2023-05-30 18:24:41|展开
不应该啊。我跑过测试，一万个协程根本没用掉啥内存。可能是因为我每个协程的调用路径都非常短吧。

c++20 stackless 协程就是把程序拿出来变换，把每个 async 函数变成一个这样的类型：

struct Coroutine@main_cpp#l203 {
    int var1;
    string var2;
    int state = 204;   // 行号

    int resume();
};

然后把所有本来是访问局部变量的语句都变换到访问这这个结构体。再把整个函数丢到 resume() 里面，并且在每个 co_await() 位置插入 case 语句。具体的原理你查一下戴夫设备。

int Coroutine@main_cpp#l203::resume()
{
    switch (state) {
        ...
        state = 215;   // 第一个 co_yield() 的实现
        return value;
        case 215;  // 第一个 co_yield() 的行号
        ....
        state = 228;
        return value;   // 第二个 co_yield() 的实现
        case 228;  // 第二个 co_yield() 的行号
        ...
        return xxx;
    }
}

每次 co_await() 一个协程函数都需要在堆里面 new 这么一个结构体出来，可想而知有多占用内存了。而且生生地把连续的内存变成堆上的一个个块。

【在 ylh0315 的大作中提到: 】
: 我就是不懂stackless怎么玩栈。
: 我是用stackfull，当然我也没处理1m协程，大概处理过1万多点，内存已经很可观了。
--
修改:hgoldfish FROM 59.60.25.*
FROM 59.60.25.*
19楼|hgoldfish|2023-05-30 18:47:54|展开
没有虚拟机，但是有 GC 啊。不兼容 c/c++ 这种简单的内存栈管理方案。

【在 GoGoRoger 的大作中提到: 】
: go 没有虚拟机吧？
: 发自「今日水木 on M2007J17C」
: ※ 来源:·水木社区 http://www.mysmth.net·[FROM: 61.50.120.*]
: ...................
--
FROM 59.60.25.*
20楼|hgoldfish|2023-05-30 18:48:35|展开
那是你这个第三方函数的问题。不关协程的事。你非要每个协程里面运行一个 Scrypt 算法高纳德过来也没办法啊。

【在 ylh0315 的大作中提到: 】
: 我的一个第三方函数需要6m内存。这是在静态测试时测出来的。
: 用的stackfull。就算自增栈，也迅速占用。
--
FROM 59.60.25.*
21楼|hgoldfish|2023-05-30 18:50:13|展开
对哦。c++ 和 rust 都积重难返。为了兼容以前的代码，所以故意设计成 stackless 语法。不论是性能还是语法美观度都是渣渣。

而 go 从头开始设计，把协程作为语言的核心概念，是目前唯一一个这么干的主流语言，也是它能够流行的重要原因。

【在 ensonmj 的大作中提到: 】
: stackless每一次resume都要从top状态一层层遍历下来，不像stackfull直接记录ip，在嵌套比较深的情况下感觉效率有点低。
: 另外rust的async实现每次resume的时候好像还要copy这个状态机到执行线程站上，那就更慢了。不知道为啥要这么实现
--
FROM 59.60.25.*
25楼|hgoldfish|2023-05-30 19:39:17|展开
啊！那 rust 实在太蠢了，快和 JavaScript 的设计者差不多了。

【在 ensonmj 的大作中提到: 】
: rust1.0之前是实现的stackful的，不知道为啥去掉了，后来才搞的这个async。现在越来越觉得async和非async简直是两个世界
--
FROM 59.60.25.*
29楼|hgoldfish|2023-05-31 16:07:10|展开
听说现在都是巨硬在吹 rust，这就惨了。

【在 mygodxp 的大作中提到: 】
: rust不比go老多少吧，怎么就积重难返了？
--
FROM 59.60.25.*