- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jrzhou -
What architecture are you asking about? Is this with regards to Pentium, Xeon, Itanium, or something else?
-- clay
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jrzhou -
I've consulted with some architecture experts within Intel and they've given me several reasons that might cause excessive pipeline clearing.
The first is due to false sharing. If you've got data that is being accessed by two threads within the same cache line, you have false sharing. When one thread modifies its variable, the cache line becomes "dirty" and must be written out to each physical or logical processor that has this line in local cache.
The second case is when the processor detects the chance of a memory order violation. Since such a violation would result in an incorrect program execution, the hardware needs to make sure the correct memory order is maintained. How this problem is handled is going to be specific to the hardware implementation, but every Intel processor is built to detect the possibility and takes steps to guarantee correct memory ordering. There weren't many details about this, but I'm assuming that out of order execution can lead to accessing memory in an incorrect order.
The third possibility mentionedhad todo with self-modifying code, but I'm hoping that we don't need to deal with that.
Are any of the above helpful?
-- clay
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:
FORALL(I=5:N)
A[5:N] = ...
B[1:N] = A[4:N] + B[1:N]
The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:
FORALL(I=5:N)
A[5:N] = ...
B[1:N] = A[4:N] + B[1:N]
The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compilers will look for this, and take action to remove the conflict from generated code:
DO I = 5,N
A = ...
B = A[I-1] + B
END DO
The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop. The parallel code is slower than serial code, because the memory order hardware has to delay the load until the preceding store has gone to memory. In this simple case, that is corrected by "distributing" (splitting) the loop, letting all A[] go to memory before reading themback, without incurring the memory order clear for each load.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page