I'm implementing elliptic curve cryptography algorithms (i.e., X25519 and Ed25519) using the AVX-512IFMA instructions.
When I compared my vectorized implementation with the x86-64 assembly version, I found that my vectorized implementation suffered severe performance degradation under cold start conditions.
The warm start test means that the function is executed 1000 times to load the instruction and data cache before starting to record the CPU cycle (CC).
A cold start test means executing the function directly (without loading caches) and recording its CC.
My tests show that the x86-64 assembly implementation suffers little performance degradation under cold start conditions.
However, the vectorized implementation degrades performance by a factor of 2~3 under cold-start conditions; in other words, the CC under cold-start conditions is about 2~3 times higher than the CC under warm-start conditions.
In order to explore the cause of this problem, I made the following attempts.
Attempt 1: Measure their code size.
I found that the code size of the vectorized implementation and the x86-64 implementation are both close to 32KB (also the size of the L1I cache).
At the same time, I also tested their L1I misses using the perf tool. I found that the vectorized implementation is about 5320 times, while the x86-64 implementation is about 5100 times. The gap between these two metrics is not particularly large, so this is not enough to explain our problem.
Attempt 2: Analysis using the topdown analysis method
I found that under cold start conditions, the performance bottlenecks of both are CPU front-end and CPU back-end, and their ratios are close. The CPU front-end is 31.3% and 31.3% and the CPU back-end is 34% and 28.1% for the vectorized implementation and the x86-64 implementation, respectively.
The relevant results of the topdown analysis method can not explain my problem.
Attempt 3: Analyze Instruction Encoding
For x86-64 assembly instructions, the encoding length of `add %al, (%rax)` is 2 bytes, and the encoding length of `add %al, 0x53ab6345 (%rdx)` is 6 bytes.
For AVX-512 instructions, the encoding length of `vpaddq (%rdx), %zmm7, %zmm4` is 6 bytes, and the encoding length of `vpaddq 0x40(%rdx), %zmm5, %zmm3` is 7 bytes.
On average, the code length of AVX-512 instructions is longer than that of x86-64.
But this discovery seems to be of little significance, because their code size is close.
Now, I don't know how to explain this problem, can you give me some ideas?
I know this might be a very complicated question, but it would be very grateful if you could give me some ideas.
I can provide more relevant data if required.