Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Inline asm

New Contributor II

I have below asm code which has the largest hotsopts (CPU_CLK_UNHALTED.CORE) of 5.31% or CVTPS2PD instruction generated by VTune as below -

..B1.10: # Preds ..B1.9 ..B1.8
.loc 1 261
movslq %r14d, %r13 #261.35
lea (%rsi,%r13,4), %rdi #261.21
movslq %ecx, %rcx #261.35
lea (%rcx,%rcx,2), %r14 #261.35
lea (%r12,%r12,8), %r12 #261.35
shlq $4, %r12 #261.35
lea (%r12,%r14,8), %r15 #261.21
movss (%r15,%rdi), %xmm0 #261.35
cvtps2pd %xmm0, %xmm7 #261.35
addsd %xmm8, %xmm7 #261.21
movaps %xmm6, %xmm8 #261.92
addsd %xmm7, %xmm8 #261.92
# LOE rax rdx rbx rbp rsi r11 r8d r9d r10d xmm1 xmm2 xmm3 xmm4 xmm5 xmm8

The above asm code is part of disassembly code for one of the C/C++ file for which VTune generates CPU utilization profiling of almost 31% and within this file for Line# 261, above instruction (CVTPS2PD) has the largest hotspots.

I tried using above asm to represent as GNU-syntax Inline asm code for a line #261 of C/C++ file code as -

asm (
"movslq %r14d, %r13 \n\t"
"lea (%rsi,%r13, 4), %rdi \n\t"
"movslq %ecx, %rcx \n\t"
"lea (%rcx,%rcx, 2), %r14 \n\t"
"lea (%r12,%r12, 8), %r12 \n\t"
"shlq $04, %r12 \n\t"
"lea (%r12,%r14, 8), %r15 \n\t"
"movss (%r15,%rdi), %xmm0 \n\t"
"cvtps2pd %xmm0, %xmm7 \n\t"
"addsd %xmm8, %xmm7 \n\t"
"movaps %xmm6, %xmm8 \n\t"
"addsd %xmm7, %xmm8 \n\t"


The Inline asm code when replared L#261 C/C++ code did compiled and linked succesfully, but while performing GDB debugging for Line #261 Inline asm code, it gave -

Program received signal SIGSEGV, Segmentation fault.
0x000000000044e9a1 in orangel (nonbondlist=0x10041, ptr_ad_energy_tables=0x1, tcoord=0x10, Nnb=1,
B_calcIntElec=0 '\0', B_include_1_4_interactions=17 '\021', scale_1_4=0.0449839644, qsp_abs_charge=0x7fbffef600,
parameterArray=0x7fbffde358, B_use_non_bond_cutoff=0 '\0', B_have_flexible_residues=0 '\0') at
261 asm (
(gdb) n

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

The GDB debugging for same L#261 C/C++ code is succesful.

Please suggest.

0 Kudos
1 Solution


Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.



View solution in original post

0 Kudos
3 Replies


Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.



0 Kudos
New Contributor II


Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.



Max, really thanks for your i/p.

The piece of code(L#261) which has above asm and also for which Inline asm has to be written -


e_internal += ptr_ad_energy_tables->e_vdW_Hb[index_lt_NEINT][t2][t1] + e_desolv; L# 261

0 Kudos
New Contributor II


Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.



So, as per your qoute "So it is not conversion but memory load is the actual "hot spot" in the case above", it's -

movss (%r15,%rdi), %xmm0 #261.35
cvtps2pd %xmm0, %xmm7 #261.35
addsd %xmm8, %xmm7 #261.21
movaps %xmm6, %xmm8 #261.92

instructions "addsd %xmm8, %xmm7", the next instructions after "cvtps2pd%xmm0, %xmm7" is the actual hotspot.

Please confim.

0 Kudos