Im writing assembly code. Say I have three parallel, mostly independent, streams of execution, call them P, Q and R. Say that for the most part, opcodes in each stream are serially dependent (each opcode uses results from its predecessor). Im executing the code on a Penryn processor. For the sake of simplicity, assume all the opcodes have latency and throughput of 1 cycle. Is it better to:
code one opcode from stream P, one from stream Q, one from stream R, then back to P again?
3 or 4 opcodes from P, 3 or 4 opcodes from Q, 3 or 4 opcodes from R, then back to P again
Section 22.214.171.124 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, ROB Read Port Stalls, mildly suggests that programmers keep short dependency chains together, which makes one think that option B is preferred. On the other hand, if the RS is keeping tabs on 32 opcodes and choice A normally supplies 3 opcodes that are ready to go on every clock cycle, maybe it works just as well.
I prefer option A simply because it helps me see opportunities to cram more work into fewer cycles.