I'm trying to optimize image processing algorithms using SSE for "Penryn" family of processor and I need to know what instructions can run in parallel. There does not seem to be any conclusive information on the internet (or is there?).
For example, can the following instruction combinations execute in one clock cycle:
pmadd | padd | padd
psraw/d | (un)pack
palignr | pshufb
I think I might be able to find out by writing a test program and testing all kinds of combinations, but it would be better if there was an easier way.
the information you're interested in what instructions can be issued in the same cycle via port 0, 1, 5 is essentially listed in Table 2-2 of the Intel 64 Architecture Optimization manual. http://developer.intel.com/products/processor/manuals/index.htm
Note that, padd is not limited to be issued by one specific port (there are more than one SIMD ALU), the instruction that has throughput of 1 cycle generally would be constrained to one specific issue port. You should also be aware that the specific arrangement dependency of instruction operands will impose further constraints on parallelism. If you're writing micro-kernels for experimentation (the out-of-order engine will try to the issue multiple micro-ops as permitted by a number of factors). It's more practical to verify what happend via performance monitoring events than trying to predict based on instruction latency, throughput, port bindings (there are some pointers in Appenix B of the Optimization manual). The three factors (latency/throughput, port bindings) form only a part of conditions for OOO engine to achieve instruction-level parallelism.
I've checked the optimization manual, but it doesn't have very detailed information. Basically it says that each issue port is connected to "Integer SIMD ALU", but ALU is a very broad definition and can include almost anything. I agree that many factors affect performance, but there are cases when knowing issue port is usefull. For example, if I use some type of instructions a lot, issue port for those instructions will become congested and I should try to replace some of the instructions (i.e. replace shifts with muls, bytewise shifts with shuffles, etc.) or take different approach to the problem. If I don't know issue port organization, it is not easy to do this kind of optimization, or even know if I need to do it.