Impact of SPLIT LOCK when running on SkyLake processors

Rustem · ‎04-03-2023

Hi Experts,

I am observing a huge impact on performance of locked operation being split across two cache lines even though the code the locked operation happens does not show much of cycles spent there.

With some simplification, in the case I have been looking at, there are two groups of threads: the first does some memory intensive calculations like checksums which consumes most of cpu cycles. The second is relatively light threads, consuming ~1-2% of cycles , which does an atomic operation (lock or) as a part of its code path. After some code modification, the atomic shift and spanned two cache line. The performance of the checksum dropped as a result.

Vtune's uarch showed that SQ and memory became a bottleneck for the checksum part.

Separately collected, SQ_MISC.SPLIT_LOCK showed quite high rate of the event attributed to the lock "lock org .."

Can anyone explain how the split lock may be implemented and how that affects the caches?

Thanks,

Rustem

BAD

$ perf stat -e sq_misc.split_lock -a -- sleep 3

Performance counter stats for 'system wide':

419881 sq_misc.split_lock

GOOD

$ perf stat -e sq_misc.split_lock -a -- sleep 3

Performance counter stats for 'system wide':

12 sq_misc.split_lock

A_T_Intel · ‎04-14-2023

Split locks occur when an access has to span two cache lines and is detrimental to performance. To avoid this you need to make sure that your access request does not cross 64B boundaries.