- code one opcode from stream P, one from stream Q, one from stream R, then back to P again?
- 3 or 4 opcodes from P, 3 or 4 opcodes from Q, 3 or 4 opcodes from R, then back to P again
- Something else?
Section 188.8.131.52 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, ROB Read Port Stalls, mildly suggests that programmers keep short dependency chains together, which makes one think that option B is preferred. On the other hand, if the RS is keeping tabs on 32 opcodes and choice A normally supplies 3 opcodes that are ready to go on every clock cycle, maybe it works just as well.
I prefer option A simply because it helps me see opportunities to cram more work into fewer cycles.