Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Inline assembly to generate most heat on SB-E

DLake1
New Contributor I
724 Views

I'm curious as to what __asm instructions would generate the most heat on a SB-E for stability testing, with prime95 I can get the CPU package power to just over 130w but experimenting with my own AVX assembly I cant get more than 100w out of it?

0 Kudos
14 Replies
Bernard
Valued Contributor I
725 Views

AFAIK there is no any information on heat dissipation per single instruction. From what I have been able to find AVX instructions can in general dissipate more heat than scalar instructions. I can only suppose that complex AVX instructions like VDIVPx instruction which are probably implemented by microcode assists can generate more heat than simple instructions which are directly decoded to corresponding uops.

0 Kudos
DLake1
New Contributor I
725 Views
I know the transistors use the most power when they change state so the instructions that cause more transistors to switch on and off are likely to use more power and generate the most heat. But there are other factors like cache that should use some power but the latency may cause the CPU to do less work.
0 Kudos
Bernard
Valued Contributor I
725 Views

Yes of course, but this depends on the internal CPU units which are involved in execution of that instruction. Small correction to my previous post.  VDIVPx instruction may be executed by FP divider unit and not with the help of microcode assists.To the cache heat dissipation you can also add workload of complex instruction decoder(s) and number of physical registers needed to store operands and temporaries.

0 Kudos
DLake1
New Contributor I
725 Views

Suprisingly, simple instructions such as vaddps consume more power than the more complex ones.

0 Kudos
Patrick_F_Intel1
Employee
725 Views

Hello CommanderLake,

I don't know which instructions use the most power. It seems like linpack is one of the highest power using apps.

So the floating point intensive apps seem to use the most power but I haven't dug much further than that.

Pat

0 Kudos
Bernard
Valued Contributor I
725 Views

>>>Suprisingly, simple instructions such as vaddps consume more power than the more complex ones.>>>

Do you mean consumed power per single instruction?

0 Kudos
DLake1
New Contributor I
725 Views

iliyapolak wrote:
Do you mean consumed power per single instruction?

That seemed to be the case until I made the loop more complex, with this the CPU package and IA cores power readings in aida are higher than anything I can get in prime95 but suprisingly it doesnt generate as much heat:

#include "stdafx.h"
#include <boost/thread.hpp>

void execute();

int _tmain(int argc, _TCHAR* argv[]){
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr);
	boost::thread(&execute, nullptr).join();
	return 0;
}
void execute(){
	auto i = new unsigned char[32];
	auto j = new unsigned char[1048576];
	memset(i, 255, 32);
	memset(j, 0, 1048576);
	__asm{
		vmovdqu ymm8, i;
		vmovdqu ymm9, i;
		vmovdqu ymm10, i;
		vmovdqu ymm11, i;
		vmovdqu ymm12, i;
		vmovdqu ymm13, i;
		vmovdqu ymm14, i;
		vmovdqu ymm15, i;
		mov rax, j;
		mov rcx, 4096;
	loop:
		prefetcht0 256[rax];
		prefetcht0 288[rax];
		prefetcht0 320[rax];
		prefetcht0 352[rax];
		prefetcht0 384[rax];
		prefetcht0 416[rax];
		prefetcht0 448[rax];
		prefetcht0 480[rax];
		vaddps ymm0, ymm8, 0[rax];
		vaddps ymm1, ymm9, 32[rax];
		vaddps ymm2, ymm10, 64[rax];
		vaddps ymm3, ymm11, 96[rax];
		vaddps ymm4, ymm12, 128[rax];
		vaddps ymm5, ymm13, 160[rax];
		vaddps ymm6, ymm14, 192[rax];
		vaddps ymm7, ymm15, 224[rax];
		vaddps ymm0, ymm8, ymm0;
		vaddps ymm1, ymm9, ymm1;
		vaddps ymm2, ymm10, ymm2;
		vaddps ymm3, ymm11, ymm3;
		vaddps ymm4, ymm12, ymm4;
		vaddps ymm5, ymm13, ymm5;
		vaddps ymm6, ymm14, ymm6;
		vaddps ymm7, ymm15, ymm7;
		vaddps ymm0, ymm8, ymm0;
		vaddps ymm1, ymm9, ymm1;
		vaddps ymm2, ymm10, ymm2;
		vaddps ymm3, ymm11, ymm3;
		vaddps ymm4, ymm12, ymm4;
		vaddps ymm5, ymm13, ymm5;
		vaddps ymm6, ymm14, ymm6;
		vaddps ymm7, ymm15, ymm7;
		vxorps ymm0, ymm8, 0[rax];
		vxorps ymm1, ymm9, 32[rax];
		vxorps ymm2, ymm10, 64[rax];
		vxorps ymm3, ymm11, 96[rax];
		vxorps ymm4, ymm12, 128[rax];
		vxorps ymm5, ymm13, 160[rax];
		vxorps ymm6, ymm14, 192[rax];
		vxorps ymm7, ymm15, 224[rax];
		vxorps ymm0, ymm8, ymm0;
		vxorps ymm1, ymm9, ymm1;
		vxorps ymm2, ymm10, ymm2;
		vxorps ymm3, ymm11, ymm3;
		vxorps ymm4, ymm12, ymm4;
		vxorps ymm5, ymm13, ymm5;
		vxorps ymm6, ymm14, ymm6;
		vxorps ymm7, ymm15, ymm7;
		vxorps ymm0, ymm8, ymm0;
		vxorps ymm1, ymm9, ymm1;
		vxorps ymm2, ymm10, ymm2;
		vxorps ymm3, ymm11, ymm3;
		vxorps ymm4, ymm12, ymm4;
		vxorps ymm5, ymm13, ymm5;
		vxorps ymm6, ymm14, ymm6;
		vxorps ymm7, ymm15, ymm7;
		add rax, 256;
		dec rcx;
		jnz loop;
		mov rax, j;
		mov rcx, 4096;
		jmp loop;
	}
}

 

0 Kudos
Bernard
Valued Contributor I
725 Views

I was thinking about the method to approximate dissipated heat per single instruction. Of course it is unreliable because of the averaging.

Here is the link: https://software.intel.com/en-us/forums/topic/506734

0 Kudos
DLake1
New Contributor I
725 Views

I was hoping John D. McCalpin would have something to say.

0 Kudos
McCalpinJohn
Honored Contributor III
725 Views

Unfortunately this is a multidimensional problem that is not easily analyzable.

Typically the LINPACK benchmark runs pretty close to the maximum power of the processor.  Intel has pre-compiled binaries for Linux that would make an interesting comparison with Prime95.

0 Kudos
Bernard
Valued Contributor I
725 Views

>>>Suprisingly, simple instructions such as vaddps consume more power than the more complex ones>>>

I think that in that posted snippet of code vaddps consumes more power because of heavy usage of all available architectural registers. More complex instruction like vsqrtpd can probably consume more power because of more complicated internal representation of machine code instruction which is in turn decoded into 3-4 uops and some method to compute square root of number(Newton-Raphson??) so the utilization of internal units like FP Divider , phys registers for result accumulation, temporary allocation and subtraction at intermediate step is bigger than in comparison to vaddps. Of course I can be wrong on my assumption.

0 Kudos
DLake1
New Contributor I
725 Views

You can see the latency and throughput of the instructions here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

The instructions with a lower latency use more power as I just compared vsqrtps, vrsqrtps and vaddps and the power figures line up with the latency figures nicely with vaddps using the most power and vsqrtps using the least.

In the sample I posted add uses more power than xor but executing xor after add uses more power than either one on its own, even when I simply replace the instructions so there are the same number of instructions per loop.

0 Kudos
Bernard
Valued Contributor I
725 Views

Thanks for the info. It seems that I was under wrong assumption that more complicated instructions consume more power. At the first glance it seems contradictory or maybe counter intuitive that simpler instruction require more power for its execution.

0 Kudos
Bernard
Valued Contributor I
725 Views

I suppose that lower latency is the reason for the more averaged power consumption of vaddps/d instructions when compared to more complicated one. Put it simply execution pipeline mainly FP Adder unit is processing more instruction per some specific unit period hence more power is consumed. More complex instruction like vsqrtps require ~28 CPU cycles to be executed hence the pipeline is stalled for 16-28 cycles(Agner Fog instr. tables) until next same type instruction can be executed. In the same time more than 10 vaddps instruction can scheduled for execution taking into account lack of any interdependency between the executed instructions. 

0 Kudos
Reply