Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Inline asm

srimks
New Contributor II
2,294 Views
Hello,

I have below asm code which has the largest hotsopts (CPU_CLK_UNHALTED.CORE) of 5.31% or CVTPS2PD instruction generated by VTune as below -

-----
..B1.10: # Preds ..B1.9 ..B1.8
..LN119:
.loc 1 261
movslq %r14d, %r13 #261.35
..LN121:
lea (%rsi,%r13,4), %rdi #261.21
..LN123:
movslq %ecx, %rcx #261.35
lea (%rcx,%rcx,2), %r14 #261.35
lea (%r12,%r12,8), %r12 #261.35
shlq $4, %r12 #261.35
..LN125:
lea (%r12,%r14,8), %r15 #261.21
..LN127:
movss (%r15,%rdi), %xmm0 #261.35
cvtps2pd %xmm0, %xmm7 #261.35
..LN129:
addsd %xmm8, %xmm7 #261.21
..LN131:
movaps %xmm6, %xmm8 #261.92
addsd %xmm7, %xmm8 #261.92
# LOE rax rdx rbx rbp rsi r11 r8d r9d r10d xmm1 xmm2 xmm3 xmm4 xmm5 xmm8
-----

The above asm code is part of disassembly code for one of the C/C++ file for which VTune generates CPU utilization profiling of almost 31% and within this file for Line# 261, above instruction (CVTPS2PD) has the largest hotspots.

I tried using above asm to represent as GNU-syntax Inline asm code for a line #261 of C/C++ file code as -

---
asm (
"movslq %r14d, %r13 \n\t"
"lea (%rsi,%r13, 4), %rdi \n\t"
"movslq %ecx, %rcx \n\t"
"lea (%rcx,%rcx, 2), %r14 \n\t"
"lea (%r12,%r12, 8), %r12 \n\t"
"shlq $04, %r12 \n\t"
"lea (%r12,%r14, 8), %r15 \n\t"
"movss (%r15,%rdi), %xmm0 \n\t"
"cvtps2pd %xmm0, %xmm7 \n\t"
"addsd %xmm8, %xmm7 \n\t"
"movaps %xmm6, %xmm8 \n\t"
"addsd %xmm7, %xmm8 \n\t"
);

-----

The Inline asm code when replared L#261 C/C++ code did compiled and linked succesfully, but while performing GDB debugging for Line #261 Inline asm code, it gave -

--
Program received signal SIGSEGV, Segmentation fault.
0x000000000044e9a1 in orangel (nonbondlist=0x10041, ptr_ad_energy_tables=0x1, tcoord=0x10, Nnb=1,
B_calcIntElec=0 '\0', B_include_1_4_interactions=17 '\021', scale_1_4=0.0449839644, qsp_abs_charge=0x7fbffef600,
parameterArray=0x7fbffde358, B_use_non_bond_cutoff=0 '\0', B_have_flexible_residues=0 '\0') at eintcal.cc:261
261 asm (
(gdb) n

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
--

The GDB debugging for same L#261 C/C++ code is succesful.

Please suggest.

~BR
0 Kudos
1 Solution
Max_L
Employee
2,294 Views

Hi,

Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.

Thanks,

-Max

View solution in original post

0 Kudos
3 Replies
Max_L
Employee
2,295 Views

Hi,

Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.

Thanks,

-Max

0 Kudos
srimks
New Contributor II
2,294 Views

Hi,

Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.

Thanks,

-Max


Max, really thanks for your i/p.

The piece of code(L#261) which has above asm and also for which Inline asm has to be written -

------------------

e_internal += ptr_ad_energy_tables->e_vdW_Hb[index_lt_NEINT][t2][t1] + e_desolv; L# 261
------------------

~BR
0 Kudos
srimks
New Contributor II
2,294 Views

Hi,

Generally tools doing event sampling, like VTune, would put CPU_CLK_UNHALTED.CORE samples at the instruction _next_ to the one actually taking longer to retire, it is due to the way sampling works - it captures CS:R/EIP from the interrupt stack at the service routine (statistically happening at the instructions taking longer to retire) and captured instruction pointer (R/EIP)points at that time to the next instruction, obviously. So it is not conversion but memory load is the actual "hot spot" in the case above.

I don't know what are you doing exactly with inline asm - but it does not seem you have final section communicating to compiler input and output dependencies and relation of registers to variables - it is not enough to just put ICC generated asm back to source and expect it to work automatically ... it would help if you put full code example here though.

Thanks,

-Max


So, as per your qoute "So it is not conversion but memory load is the actual "hot spot" in the case above", it's -

--
movss (%r15,%rdi), %xmm0 #261.35
cvtps2pd %xmm0, %xmm7 #261.35
..LN129:
addsd %xmm8, %xmm7 #261.21
..LN131:
movaps %xmm6, %xmm8 #261.92
------

instructions "addsd %xmm8, %xmm7", the next instructions after "cvtps2pd%xmm0, %xmm7" is the actual hotspot.

Please confim.

~BR
0 Kudos
Reply