Opcode ordering

Intel_C_Intel — Tue, 06 May 2008 14:54:24 GMT

Im writing assembly code. Say I have three parallel, mostly independent, streams of execution, call them P, Q and R. Say that for the most part, opcodes in each stream are serially dependent (each opcode uses results from its predecessor). Im executing the code on a Penryn processor. For the sake of simplicity, assume all the opcodes have latency and throughput of 1 cycle. Is it better to:

code one opcode from stream P, one from stream Q, one from stream R, then back to P again?
3 or 4 opcodes from P, 3 or 4 opcodes from Q, 3 or 4 opcodes from R, then back to P again
Something else?

Section 3.5.2.1 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, ROB Read Port Stalls, mildly suggests that programmers keep short dependency chains together, which makes one think that option B is preferred. On the other hand, if the RS is keeping tabs on 32 opcodes and choice A normally supplies 3 opcodes that are ready to go on every clock cycle, maybe it works just as well.

I prefer option A simply because it helps me see opportunities to cram more work into fewer cycles.

Thanks,
Brian

topic Opcode ordering in Intel® Moderncode for Parallel Architectures

Opcode ordering