- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im writing assembly code. Say I have three parallel, mostly independent, streams of execution, call them P, Q and R. Say that for the most part, opcodes in each stream are serially dependent (each opcode uses results from its predecessor). Im executing the code on a Penryn processor. For the sake of simplicity, assume all the opcodes have latency and throughput of 1 cycle. Is it better to:
- code one opcode from stream P, one from stream Q, one from stream R, then back to P again?
- 3 or 4 opcodes from P, 3 or 4 opcodes from Q, 3 or 4 opcodes from R, then back to P again
- Something else?
Section 3.5.2.1 of the Intel 64 and IA-32 Architectures Optimization Reference Manual, ROB Read Port Stalls, mildly suggests that programmers keep short dependency chains together, which makes one think that option B is preferred. On the other hand, if the RS is keeping tabs on 32 opcodes and choice A normally supplies 3 opcodes that are ready to go on every clock cycle, maybe it works just as well.
I prefer option A simply because it helps me see opportunities to cram more work into fewer cycles.
Thanks,
Brian
Link Copied
0 Replies
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page