Showing results for 
Search instead for 
Did you mean: 

__asm__ volatile code placement


I have a reproducer below where I'm not sure the placement of an "__asm__ volatile" code fragment by the compiler is strictly correct. I'm assuming that the use of the volatile statment on an assembly instruction implies a code ordering enforcement, which may be wrong. I hate to even bring this up because I don't want a fix for this to water down other optimizations. On the other hand, for people writing device drivers, I can see how such code motion could be catastrophic.

First I'll give the sample code, then describe the issue:

typedef unsigned long long ticks;
static ticks ET_loopStat[10] ;
static int ET_loopStack[10] ;
static int ET_loopLevel ;

static __inline__ ticks elapsed(ticks t1, ticks t0) { return t1 - t0; }

static __inline__ ticks getticks(void)
unsigned a, d;
__asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
return ((ticks)a) | (((ticks)d) << 32);

static __inline__ ticks ET_PushLoopSeqId(int loopId)
++ET_loopLevel ;
ET_loopStack[ET_loopLevel] = loopId ;
return getticks() ;

static __inline__ void ET_PopLoopSeqId(ticks startTime)
ticks stopTime = getticks() ;
int idx = ET_loopStack[ET_loopLevel] ;
ET_loopStat[idx] += elapsed(stopTime,startTime) ;
--ET_loopLevel ;

void mm(double a[10UL][10UL],double b[10UL][10UL],double c[10UL][10UL])
static int ET_loop2 = 3 ;
static int ET_loop1 = 2 ;
static int ET_loop0 = 1 ;
ticks ET_loop2_time ;
ticks ET_loop1_time ;
ticks ET_loop0_time ;

ET_loop0_time = ET_PushLoopSeqId(ET_loop0);
for (int row = 0; row < 10; ++row) {
ET_loop1_time = ET_PushLoopSeqId(ET_loop1);
for (int col = 0; col < 10; ++col) {
double sum = 0.0;
ET_loop2_time = ET_PushLoopSeqId(ET_loop2);
for (int k = 0; k < 10; ++k) {
sum += (((a[row])) * ((b)[col]));
(c[row])[col] = sum;


Compile the above code using "icc -O3 -S -std=c99 mm.c".

If you look at the assembly code near the corresponding "ET_PopLoopSeqId(ET_loop2_time);" in the mm function, you will see this assembly with the "icc (ICC) 12.0.0 20100512" compiler:

mulsd 720(%rsi,%r14,8), %xmm8 #47.35
rdtsc #49.7
addsd %xmm8, %xmm9 #47.9

If the "__asm__ volative" in the getticks() function were strictly enforced, the ordering would instead be like this:

mulsd 720(%rsi,%r14,8), %xmm8 #47.35
addsd %xmm8, %xmm9 #47.9
rdtsc #49.7

The intel 12 compiler is *much* better than "icc (ICC) 11.1 20090630", which I include below (the rdtsc opcode should be at the end of the code block below instead of the middle). This means that in the Intel 12 compiler the bug was almost fixed! :

movsd 64(%rcx,%r13), %xmm7 #47.19
mulsd 640(%rsi,%r15,8), %xmm7 #47.35
movsd 72(%rcx,%r13), %xmm8 #47.19
mulsd 720(%rsi,%r15,8), %xmm8 #47.35
movl %eax, %r9d #45.23
movl %edx, %edx #45.23
shlq $32, %rdx #45.23
addsd %xmm0, %xmm9 #47.9
addsd %xmm1, %xmm9 #47.9
orq %rdx, %r9 #45.23
rdtsc #49.7
addsd %xmm2, %xmm9 #47.9
addsd %xmm3, %xmm9 #47.9
addsd %xmm4, %xmm9 #47.9
addsd %xmm5, %xmm9 #47.9
addsd %xmm6, %xmm9 #47.9
addsd %xmm7, %xmm9 #47.9
addsd %xmm8, %xmm9 #47.9

Hopefully my belief that a volatile keyword should enforce ordering is wrong, so this will end up being a non-issue.


0 Kudos
1 Reply
Black Belt


static volatil ticks ET_loopStat[10] ;

static volatile int ET_loopStack[10] ;

Jim Dempsey