不陷入内核的加锁, 解锁也要400多时钟周期? 假设1+1花费1时钟周

水木社区手机版

主题:不陷入内核的加锁, 解锁也要400多时钟周期? 假设1+1花费1时钟周
楼主|stub|2023-03-27 17:41:27|只看此ID
测试了下c++下atomic<int>递增的性能，cpu13900k，大小核

从下面看到atomic 变量在cpu之间颠簸的情况下, 是普通递增指令的200倍, 加锁, 解锁, 假设线程8个, 那么大概率atomic变量不在正在运行的cpu上, 那么就会耗费400时钟周期.

8个线程:
Atomic counter: 800000000 Time: 6192 ms
Non-atomic counter: 128105858 Time: 24 ms

1个线程
Atomic counter: 100000000 Time: 335 ms
Non-atomic counter: 100000000 Time: 18 ms
可以看到atomic没有竞争的情况下大概是普通递增指令的20倍, 多线程有颠簸的情况下是250倍

#include <iostream>
#include <atomic>
#include <chrono>
#include <thread>
#include <vector>

const int NUM_ITERATIONS = 1000000;
const int NUM_THREADS = 8;

void atomic_increment(std::atomic<int>& counter) {
    for (int i = 0; i < NUM_ITERATIONS; ++i) {
        ++counter;
    }
}

void non_atomic_increment(volatile int& counter) {
    for (int i = 0; i < NUM_ITERATIONS; ++i) {
        ++counter;
    }
}

int main() {
    std::atomic<int> atomic_counter(0);
    volatile int non_atomic_counter = 0;

    auto start_atomic = std::chrono::high_resolution_clock::now();
    std::vector<std::thread> atomic_threads;
    for (int i = 0; i < NUM_THREADS; ++i) {
        atomic_threads.push_back(std::thread(atomic_increment, std::ref(atomic_counter)));
    }
    for (auto& t : atomic_threads) {
        t.join();
    }
    auto end_atomic = std::chrono::high_resolution_clock::now();
    auto atomic_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_atomic - start_atomic).count();

    auto start_non_atomic = std::chrono::high_resolution_clock::now();
    std::vector<std::thread> non_atomic_threads;
    for (int i = 0; i < NUM_THREADS; ++i) {
        non_atomic_threads.push_back(std::thread(non_atomic_increment, std::ref(non_atomic_counter)));
    }
    for (auto& t : non_atomic_threads) {
        t.join();
    }
    auto end_non_atomic = std::chrono::high_resolution_clock::now();
    auto non_atomic_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_non_atomic - start_non_atomic).count();

    std::cout << "Atomic counter: " << atomic_counter << " Time: " << atomic_duration << " ms" << std::endl;
    std::cout << "Non-atomic counter: " << non_atomic_counter << " Time: " << non_atomic_duration << " ms" << std::endl;

    return 0;
}

CXX = g++
CXXFLAGS = -std=c++11 -O3 -pthread
TARGET = atomic_test
OBJS = main.o

all: $(TARGET)

$(TARGET): $(OBJS)
    $(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS)

main.o: main.cpp
    $(CXX) $(CXXFLAGS) -c main.cpp

clean:
    rm -f $(OBJS) $(TARGET)

测试代码和makefile 让chatgpt-4写的
--
修改:stub FROM 114.242.248.*
FROM 61.48.14.*
1楼|Bernstein|2023-03-27 22:09:55|只看此ID
这不是加锁解锁，而只是递增
加锁需要用compare_exchange_weak/strong，解锁需要用store/exchange

其他平台如何实现不清楚，x86/64平台的实现是通过锁内存总线实现的，性能显然会变低

【在 stub 的大作中提到: 】
: 测试了下c++下atomic<int>递增的性能
: 从下面看到atomic 变量在cpu之间颠簸的情况下, 是普通递增指令的200倍, 加锁, 解锁, 假设线程8个, 那么大概率atomic变量不在正在运行的cpu上, 那么就会耗费400时钟周期.
: 8个线程:
: ...................
--
FROM 221.218.209.*
2楼|stub|2023-03-27 22:29:13|只看此ID
【在 Bernstein 的大作中提到: 】
: 这不是加锁解锁，而只是递增
: 加锁需要用compare_exchange_weak/strong，解锁需要用store/exchange
: 其他平台如何实现不清楚，x86/64平台的实现是通过锁内存总线实现的，性能显然会变低
: ...................
我知道，但是我猜两条指令性能差不多。另外现在不是锁总线了，是用缓存一致性实现的
--
FROM 120.244.140.*
3楼|z16166|2023-03-27 22:50:39|只看此ID
vmware虚拟机（5800X3D，给虚拟机分配的4核。虚拟机不适合测性能，只参考）

happy@ubuntu:~$ g++ -std=c++11 -O3 -pthread 1.cpp
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 71 ms
Non-atomic counter: 1444110 Time: 68 ms
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 76 ms
Non-atomic counter: 1783212 Time: 72 ms
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 75 ms
Non-atomic counter: 1355109 Time: 74 ms

老机器i7-4790k上，真机：

happy@happy:~$ g++ -std=c++11 -O3 -pthread 1.cpp
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 139 ms
Non-atomic counter: 2049804 Time: 14 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 140 ms
Non-atomic counter: 3005134 Time: 8 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 140 ms
Non-atomic counter: 1551018 Time: 5 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 134 ms
Non-atomic counter: 1232533 Time: 5 ms
happy@happy:~$

一条带lock的add指令：

0000000000401120 <non_atomic_increment(int volatile&)>:
  401120:       ba 40 42 0f 00          mov    $0xf4240,%edx
  401125:       0f 1f 00                nopl   (%rax)
  401128:       8b 07                   mov    (%rdi),%eax
  40112a:       83 c0 01                add    $0x1,%eax
  40112d:       89 07                   mov    %eax,(%rdi)
  40112f:       83 ea 01                sub    $0x1,%edx
  401132:       75 f4                   jne    401128 <non_atomic_increment(int volatile&)+0x8>
  401134:       c3                      ret
  401135:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
  40113c:       00 00 00 00

0000000000401140 <atomic_increment(std::atomic<int>&)>:
  401140:       b8 40 42 0f 00          mov    $0xf4240,%eax
  401145:       0f 1f 00                nopl   (%rax)
  401148:       f0 83 07 01             lock addl $0x1,(%rdi)
  40114c:       83 e8 01                sub    $0x1,%eax
  40114f:       75 f7                   jne    401148 <atomic_increment(std::atomic<int>&)+0x8>
  401151:       c3                      ret
--
修改:z16166 FROM 221.218.163.*
FROM 221.218.163.*
4楼|fanci|2023-03-29 14:02:04|只看此ID
赞，我只在上学的时候玩过汇编，工作以后就没有机会接触微优化了

【在 z16166 (Netguy) 的大作中提到: 】
:  vmware虚拟机（5800X3D，给虚拟机分配的4核。虚拟机不适合测性能，只参考）
:
:  happy@ubuntu:~$ g++ -std=c++11 -O3 -pthread 1.cpp
:  happy@ubuntu:~$ ./a.out
--
FROM 203.145.94.*