Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Opcode ordering

Im writing assembly code. Say I have three parallel, mostly independent, streams of execution, call them P, Q and R. Say that for the most part, opcodes in each stream are serially dependent (each opcode uses results from its predecessor). Im executing the code on a Penryn processor. For the sake of simplicity, assume all the opcodes have latency and throughput of 1 cycle. Is it better to:

  1. code one opcode from stream P, one from stream Q, one from stream R, then back to P again?

  2. 3 or 4 opcodes from P, 3 or 4 opcodes from Q, 3 or 4 opcodes from R, then back to P again

  3. Something else?

Section of the Intel 64 and IA-32 Architectures Optimization Reference Manual, ROB Read Port Stalls, mildly suggests that programmers keep short dependency chains together, which makes one think that option B is preferred. On the other hand, if the RS is keeping tabs on 32 opcodes and choice A normally supplies 3 opcodes that are ready to go on every clock cycle, maybe it works just as well.

I prefer option A simply because it helps me see opportunities to cram more work into fewer cycles.


0 Kudos
0 Replies