- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Experts,
I am observing a huge impact on performance of locked operation being split across two cache lines even though the code the locked operation happens does not show much of cycles spent there.
With some simplification, in the case I have been looking at, there are two groups of threads: the first does some memory intensive calculations like checksums which consumes most of cpu cycles. The second is relatively light threads, consuming ~1-2% of cycles , which does an atomic operation (lock or) as a part of its code path. After some code modification, the atomic shift and spanned two cache line. The performance of the checksum dropped as a result.
Vtune's uarch showed that SQ and memory became a bottleneck for the checksum part.
Separately collected, SQ_MISC.SPLIT_LOCK showed quite high rate of the event attributed to the lock "lock org .."
Can anyone explain how the split lock may be implemented and how that affects the caches?
Thanks,
Rustem
BAD
$ perf stat -e sq_misc.split_lock -a -- sleep 3
Performance counter stats for 'system wide':
419881 sq_misc.split_lock
GOOD
$ perf stat -e sq_misc.split_lock -a -- sleep 3
Performance counter stats for 'system wide':
12 sq_misc.split_lock
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Split locks occur when an access has to span two cache lines and is detrimental to performance. To avoid this you need to make sure that your access request does not cross 64B boundaries.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page