- 主题:不陷入内核的加锁, 解锁也要400多时钟周期? 假设1+1花费1时钟周
测试了下c++下atomic<int>递增的性能,cpu13900k,大小核
从下面看到atomic 变量在cpu之间颠簸的情况下, 是普通递增指令的200倍, 加锁, 解锁, 假设线程8个, 那么大概率atomic变量不在正在运行的cpu上, 那么就会耗费400时钟周期.
8个线程:
Atomic counter: 800000000 Time: 6192 ms
Non-atomic counter: 128105858 Time: 24 ms
1个线程
Atomic counter: 100000000 Time: 335 ms
Non-atomic counter: 100000000 Time: 18 ms
可以看到atomic没有竞争的情况下大概是普通递增指令的20倍, 多线程有颠簸的情况下是250倍
#include <iostream>
#include <atomic>
#include <chrono>
#include <thread>
#include <vector>
const int NUM_ITERATIONS = 1000000;
const int NUM_THREADS = 8;
void atomic_increment(std::atomic<int>& counter) {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
++counter;
}
}
void non_atomic_increment(volatile int& counter) {
for (int i = 0; i < NUM_ITERATIONS; ++i) {
++counter;
}
}
int main() {
std::atomic<int> atomic_counter(0);
volatile int non_atomic_counter = 0;
auto start_atomic = std::chrono::high_resolution_clock::now();
std::vector<std::thread> atomic_threads;
for (int i = 0; i < NUM_THREADS; ++i) {
atomic_threads.push_back(std::thread(atomic_increment, std::ref(atomic_counter)));
}
for (auto& t : atomic_threads) {
t.join();
}
auto end_atomic = std::chrono::high_resolution_clock::now();
auto atomic_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_atomic - start_atomic).count();
auto start_non_atomic = std::chrono::high_resolution_clock::now();
std::vector<std::thread> non_atomic_threads;
for (int i = 0; i < NUM_THREADS; ++i) {
non_atomic_threads.push_back(std::thread(non_atomic_increment, std::ref(non_atomic_counter)));
}
for (auto& t : non_atomic_threads) {
t.join();
}
auto end_non_atomic = std::chrono::high_resolution_clock::now();
auto non_atomic_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_non_atomic - start_non_atomic).count();
std::cout << "Atomic counter: " << atomic_counter << " Time: " << atomic_duration << " ms" << std::endl;
std::cout << "Non-atomic counter: " << non_atomic_counter << " Time: " << non_atomic_duration << " ms" << std::endl;
return 0;
}
CXX = g++
CXXFLAGS = -std=c++11 -O3 -pthread
TARGET = atomic_test
OBJS = main.o
all: $(TARGET)
$(TARGET): $(OBJS)
$(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS)
main.o: main.cpp
$(CXX) $(CXXFLAGS) -c main.cpp
clean:
rm -f $(OBJS) $(TARGET)
测试代码和makefile 让chatgpt-4写的
--
修改:stub FROM 114.242.248.*
FROM 61.48.14.*
这不是加锁解锁,而只是递增
加锁需要用compare_exchange_weak/strong,解锁需要用store/exchange
其他平台如何实现不清楚,x86/64平台的实现是通过锁内存总线实现的,性能显然会变低
【 在 stub 的大作中提到: 】
: 测试了下c++下atomic<int>递增的性能
: 从下面看到atomic 变量在cpu之间颠簸的情况下, 是普通递增指令的200倍, 加锁, 解锁, 假设线程8个, 那么大概率atomic变量不在正在运行的cpu上, 那么就会耗费400时钟周期.
: 8个线程:
: ...................
--
FROM 221.218.209.*
【 在 Bernstein 的大作中提到: 】
: 这不是加锁解锁,而只是递增
: 加锁需要用compare_exchange_weak/strong,解锁需要用store/exchange
: 其他平台如何实现不清楚,x86/64平台的实现是通过锁内存总线实现的,性能显然会变低
: ...................
我知道,但是我猜两条指令性能差不多。另外现在不是锁总线了,是用缓存一致性实现的
--
FROM 120.244.140.*
vmware虚拟机(5800X3D,给虚拟机分配的4核。虚拟机不适合测性能,只参考)
happy@ubuntu:~$ g++ -std=c++11 -O3 -pthread 1.cpp
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 71 ms
Non-atomic counter: 1444110 Time: 68 ms
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 76 ms
Non-atomic counter: 1783212 Time: 72 ms
happy@ubuntu:~$ ./a.out
Atomic counter: 8000000 Time: 75 ms
Non-atomic counter: 1355109 Time: 74 ms
老机器i7-4790k上,真机:
happy@happy:~$ g++ -std=c++11 -O3 -pthread 1.cpp
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 139 ms
Non-atomic counter: 2049804 Time: 14 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 140 ms
Non-atomic counter: 3005134 Time: 8 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 140 ms
Non-atomic counter: 1551018 Time: 5 ms
happy@happy:~$ ./a.out
Atomic counter: 8000000 Time: 134 ms
Non-atomic counter: 1232533 Time: 5 ms
happy@happy:~$
一条带lock的add指令:
0000000000401120 <non_atomic_increment(int volatile&)>:
401120: ba 40 42 0f 00 mov $0xf4240,%edx
401125: 0f 1f 00 nopl (%rax)
401128: 8b 07 mov (%rdi),%eax
40112a: 83 c0 01 add $0x1,%eax
40112d: 89 07 mov %eax,(%rdi)
40112f: 83 ea 01 sub $0x1,%edx
401132: 75 f4 jne 401128 <non_atomic_increment(int volatile&)+0x8>
401134: c3 ret
401135: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
40113c: 00 00 00 00
0000000000401140 <atomic_increment(std::atomic<int>&)>:
401140: b8 40 42 0f 00 mov $0xf4240,%eax
401145: 0f 1f 00 nopl (%rax)
401148: f0 83 07 01 lock addl $0x1,(%rdi)
40114c: 83 e8 01 sub $0x1,%eax
40114f: 75 f7 jne 401148 <atomic_increment(std::atomic<int>&)+0x8>
401151: c3 ret
--
修改:z16166 FROM 221.218.163.*
FROM 221.218.163.*
赞,我只在上学的时候玩过汇编,工作以后就没有机会接触微优化了
【 在 z16166 (Netguy) 的大作中提到: 】
: vmware虚拟机(5800X3D,给虚拟机分配的4核。虚拟机不适合测性能,只参考)
:
: happy@ubuntu:~$ g++ -std=c++11 -O3 -pthread 1.cpp
: happy@ubuntu:~$ ./a.out
--
FROM 203.145.94.*