Cost of IPI (inter-processor interrupt) ?

gallus2 · ‎09-18-2009

Dear forum contributors,
what is the cost of IPI? As far I know, inter-processor interrupts are used to synchronize cache between cores and processors. Such synchronization can be "costly" (my state of knowledge does not allow me to use precise expressions...). However, what is the cost of IPI itself? Is there anything else, besides the cache synchronization, that can trigger IPI?

Please share some information on this topic.

robert-reed · ‎09-18-2009

Quoting - gallus2

what is the cost of IPI? As far I know, inter-processor interrupts are used to synchronize cache between cores and processors. Such synchronization can be "costly" (my state of knowledge does not allow me to use precise expressions...). However, what is the cost of IPI itself? Is there anything else, besides the cache synchronization, that can trigger IPI?

Actually, cache synchronization between HW threads rely on one of several cache coherence protocols that do bus sniffing and snooping to keep trackfor each of the cores which cache lines are valid or in other states of the protocol. As far as I can tell, IPI is limited to some startup activities and other OS or driver-related tasks. So the cost of a HW thread having to synchronize a cache line is on the order of a memory write (assuming the desired memory was cached in a different HW thread), a memory read (to get the data into the desired HW thread) plus a few bus transactions to amortize the cost of the associated snoop traffic. This is a very rough estimate that glosses over a lot of details, which may vary from specific architecture to specific architecture.

gallus2 · ‎09-22-2009

Quoting - Robert Reed (Intel)

Actually, cache synchronization between HW threads rely on one of several cache coherence protocols that do bus sniffing and snooping to keep trackfor each of the cores which cache lines are valid or in other states of the protocol. As far as I can tell, IPI is limited to some startup activities and other OS or driver-related tasks. So the cost of a HW thread having to synchronize a cache line is on the order of a memory write (assuming the desired memory was cached in a different HW thread), a memory read (to get the data into the desired HW thread) plus a few bus transactions to amortize the cost of the associated snoop traffic. This is a very rough estimate that glosses over a lot of details, which may vary from specific architecture to specific architecture.

I was following what wikipedia says:

"An inter-processor interrupt (IPI) is a special type of interrupt by which one processor may interrupt another processor in a multiprocessor system. IPIs are typically used to implement a cache coherency synchronization point.

(...)

In x86 based systems, an IPI synchronizes the cache and Memory Management Unit (MMU) between processors.
"

Are you sure that IPIs do not have anything to do with cache synchronization?

Thanks in advance.
Gallus2

Dmitry_Vyukov · ‎09-22-2009

Quoting - gallus2

Dear forum contributors,
what is the cost of IPI? As far I know, inter-processor interrupts are used to synchronize cache between cores and processors. Such synchronization can be "costly" (my state of knowledge does not allow me to use precise expressions...). However, what is the cost of IPI itself? Is there anything else, besides the cache synchronization, that can trigger IPI?

Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself.

However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs if you are still interested in them. When I do simple synthetic test on a dual core machine I've obtained following numbers.

420 cycles is the minimum cost of the function on issuing core.

1600 cycles is mean cost of the function on issuing core.

1300 cycles is mean cost of the function on remote core.

Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns.

Dmitry_Vyukov · ‎09-22-2009

Quoting - gallus2

I was following what wikipedia says:

"An inter-processor interrupt (IPI) is a special type of interrupt by which one processor may interrupt another processor in a multiprocessor system. IPIs are typically used to implement a cache coherency synchronization point.

(...)

In x86 based systems, an IPI synchronizes the cache and Memory Management Unit (MMU) between processors.
"

Are you sure that IPIs do not have anything to do with cache synchronization?

Note "This computer hardware-related article is a stub" at the bottom. It's unclear what is "a cache coherency synchronization point" and what is "to synchronize the cache". To the best of my knowledge there is no such terms.
Perhaps the author means that with IPI one can ensure instruction ordering on remote processor. This does not directly relate to cache-coherency.
And I was wrong saying that user-space level developer does not care about IPIs. Because of the above mentioned application (ensure instruction ordering on remote processor) with IPIs one is able to develop algorithms that draw it's strength from dark side of force. As an example you may see following asymmetric reader-writer mutex algorithm which slaughters all other rw mutexes on read-mostly workload:
http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571c6137
It may be implemented with other means, but IPIs are preferable because of their "reactivity".

gallus2 · ‎09-22-2009

Quoting - Dmitriy Vyukov

And I was wrong saying that user-space level developer does not care about IPIs. Because of the above mentioned

I'm investigating the IPIs because of following paper:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1430575

(which is a successor of following paper: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1409136)

They stated, that setting process and network card affinity causes performance gain because of:
- better cache coherency,
- lower amount of IPIs.

The IPIs have indirect cost of flushing the processor pipeline. Until today I was thinking that most common way of triggering the IPIs (causing pipeline flush) is false sharing on cache. But it turns out that this isn't the truth. Not bad news :)

However, are there any easily abusable ways to trigger the IPIs? It would be good to avoid them, to avoid processor pipeline flushes.

robert-reed · ‎09-22-2009

Quoting - gallus2

I'm investigating the IPIs because of following paper:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1430575

They stated, that setting process and network card affinity causes performance gain because of:
- better cache coherency,
- lower amount of IPIs.

The IPIs have indirect cost of flushing the processor pipeline. Until today I was thinking that most common way of triggering the IPIs (causing pipeline flush) is false sharing on cache. But it turns out that this isn't the truth. Not bad news :)

However, are there any easily abusable ways to trigger the IPIs? It would be good to avoid them, to avoid processor pipeline flushes.

I haven't read the paper (either paper) but I have looked at the appropriate systems programming guide which suggests the common uses for IPIs: startup (SIPIs), self interrupting, and propagating interrupts (either interrupt another processor or allow a processor to forward an interrupt to another processor). It seems logical that setting process and network card affinity would increase the likelihood that the processor that first gets the initial NIC interrupt would be able to handle it rather than deferring to another processor. The recommended uses seem very limited though Dmitriy's observation that Win32 provides a function call could mean all kinds of crazies are using it out there.

gallus2 · ‎09-23-2009

Quoting - Robert Reed (Intel)

I haven't read the paper (either paper) but I have looked at the appropriate systems programming guide which suggests the common uses for IPIs: startup (SIPIs), self interrupting, and propagating interrupts (either interrupt another processor or allow a processor to forward an interrupt to another processor). It seems logical that setting process and network card affinity would increase the likelihood that the processor that first gets the initial NIC interrupt would be able to handle it rather than deferring to another processor. The recommended uses seem very limited though Dmitriy's observation that Win32 provides a function call could mean all kinds of crazies are using it out there.

Thank you and others for the answers.