Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

IvyBridge CPU part performance degradation when GPU part used for computations

Raistmer
Beginner
945 Views

I saw many reports (and did own tests) that show considerable CPU part performance degradation of Ivy Bridge when GPU part of device used for GPGPU (OpenCL computational app executed).

Also, it seems that BayTrail APU doesn't experience so big performance degradation.

What could be the reasons of such behavior? (both very considerable performance hit on Ivy Bridge and much less hit on BayTrail). 

0 Kudos
6 Replies
Patrick_F_Intel1
Employee
945 Views

Hello Raistmer,

I'm guessing that processor hits the power limit when the GPU is fired up and then the CPU frequency is reduced in order to keep the total package power below the package power limit.

You can check this by looking at the frequency on the CPU during your tests. The package reaches the power limit and has to divert relatively more power to either the GPU or CPU. There are some knobs (MSR_PP0_POLICY and MSR_PP1_POLICY) that you can play with to tell the package the relative priority of power to each part of the package (PP0 == CPU, PP1 == GPU).

Pat

0 Kudos
Raistmer
Beginner
945 Views

Thanks, that was one of theories indeed. But as I understand frequency drop should be detected in this case by such tools like CPU-Z and GPU-Z, right?

I'll check additionally but from memory no freq drop was detected.

There is another 2 theories of that slowdown and I would like to consult how better to discriminate between them:

1) Memory bus saturation

What tool could explore this possibility? For example, what to look inside vTune output ?

2) Cache pollution. 

Could you clarify cache structure for Ivy Bridge device? Does it has physically separated caches for CPU and GPU parts? Or they shared on some level? If shared - what GPU policy to read/write data? What about write combined buffers and GPU part?

 

0 Kudos
Raistmer
Beginner
945 Views

Patrick Fay (Intel) wrote:

There are some knobs (MSR_PP0_POLICY and MSR_PP1_POLICY) that you can play with to tell the package the relative priority of power to each part of the package (PP0 == CPU, PP1 == GPU).

And is it possible to lift that power limit in whole provided device not overheated? AFAIK some GPUs with power-limits allow to rise limit a little if needed (provided adequate cooling used). Can iGPU allow this or only re-distribution inside fixed power budget is possible? We using both CPU and GPU for computations and on lesser models GPU part not too speedy so performance sacrifice in either part would be not desired.

 

0 Kudos
McCalpinJohn
Honored Contributor III
945 Views

VTune should be able to give you a measurement of memory traffic.  I have not tried this on a system with an integrated GPU, but if VTune reports the "raw" DRAM traffic, then you should be able to measure the required traffic for your CPU workload and your GPU workload independently.  (This may take some reconfiguring of your code to run the CPU and GPU pieces to independently.)  Comparing the sum of these to the available hardware memory bandwidth should give you a good idea of whether contention for memory bandwidth is likely to be a problem.

 

0 Kudos
Patrick_F_Intel1
Employee
945 Views

Hello Raistmer,

I don't know of any knobs to increase the max TDP allowed. The turbo mode allows the max power usage to exceed max TDP for a certain amount of time but this is pretty short.

From the IVB datasheet (the mobile sheet is 3rd-gen-core-family-mobile-vol-2-datasheet.pdf), there are some registers you can use to see what is limiting frequencies on pp0 and pp1.

The register PCU_MMIO_FREQ_CLIPPING_CAUSE_STATUS (at the MCHBAR offset 0x5C20 for 4 bytes). This register indicates the reason for any frequency clipping currently being done.

For instance, for my laptop at idle, the register shows (via the output from my id_cpu utility):

chipset: 	 	PCU_MMIO_FREQ_CLIPPING_CAUSE_STATUS below Register current status: 0/0/0/MCHBAR PCU Address Offset: 5C20-5C23h
chipset: 	No	pp1_clipped: Set if the PP1 (GT) frequency requested was clipped.
chipset: 	No	pp1_clipped_non_turbo: Set if the PP1 (GT) frequency requested was clipped, but current frequency is lower than RP1 (MAX_NON_TURBO).
chipset: 	No	pp1_clipped_edp: Set if the PP1 (GT) frequency requested was clipped by EDP limit (Vmax, Iccmax, Reliability, and so on).
chipset: 	No	pp1_clipped_hot_vr: Set if the PP1 (GT) frequency requested was clipped by HOT indication from VR on SVID.
chipset: 	No	pp1_clipped_pl2: Set if the PP1 (GT) frequency requested was clipped by PL2 (POWER_LIMIT_2) power limiting algorithm.
chipset: 	No	pp1_clipped_pl1: Set if the PP1 (GT) frequency requested was clipped by PL1 (POWER_LIMIT_1) power limiting algorithm.
chipset: 	No	pp1_clipped_thermals: Set if the PP1 (GT) frequency requested was clipped by internal Thermal Throttling algorithm.
chipset: 	No	pp1_clipped_ext_prochot: Set if the PP1 (GT) frequency requested was clipped by external PROCHOT indication.
chipset: 	No	pp0_clipped: Set if the PP0 (IA) frequency requested by OS was clipped.
chipset: 	No	pp0_clipped_n_core_turbo: Set if the PP0 (IA) frequency requested by OS was clipped, but current frequency is lower than MAX_TURBO[n-cores].
chipset: 	No	pp0_clipped_non_turbo: Set if the PP0 (IA) frequency requested by OS was clipped, but current frequency is lower than MAX_NON_TURBO.
chipset: 	No	pp0_clipped_edp: Set if the PP0 (IA) frequency requested by OS was clipped by EDP limit (Vmax, Iccmax, Reliability, and so on).
chipset: 	No	pp0_clipped_mct: Set if the PP0 (IA) frequency requested by OS was clipped by Multi Core Turbo demotion algorithm.
chipset: 	No	pp0_clipped_hot_vr: Set if the PP0 (IA) frequency requested by OS was clipped by HOT indication from VR on SVID.
chipset: 	No	pp0_clipped_pl2: Set if the PP0 (IA) frequency requested by OS was clipped by PL2 (POWER_LIMIT_2) power limiting algorithm.
chipset: 	No	pp0_clipped_gt_driver: Set if the PP0 (IA) frequency requested by OS was clipped by GT driver.
chipset: 	No	pp0_clipped_pl1: Set if the PP0 (IA) frequency requested by OS was clipped by PL1 (POWER_LIMIT_1) power limiting algorithm.
chipset: 	No	pp0_clipped_thermals: Set if the PP0 (IA) frequency requested by OS was clipped by internal Thermal Throttling algorithm.
chipset: 	No	pp0_clipped_ext_prochot: Set if the PP0 (IA) frequency requested by OS was clipped by external PROCHOT indication.

When I run plugged in (AC) with a 'spin' program (attached, has negligible memory traffic) that just does adds in a loop on all the cpus (not running  graphcs program) I get output below(not showing all the 'no' lines). The spin program reports 316 Million_ops/cpu_sec and taskmanager reports the average frequency at 2.58 GHz (on the Performance tab, the Speed field)..

chipset: 	Yes	pp0_clipped: Set if the PP0 (IA) frequency requested by OS was clipped.

When I run plugged in (AC) without the 'spin' program and with a graphics program (the reflectdino openGL example program full screen) I get the output below.  taskmanager reports the ave freq as 2.07 GHz (reflectdino.exe uses about 10% of the processor).

chipset: 	Yes	pp0_clipped: Set if the PP0 (IA) frequency requested by OS was clipped.
chipset: 	Yes	pp0_clipped_n_core_turbo: Set if the PP0 (IA) frequency requested by OS was clipped, but current frequency is lower than MAX_TURBO[n-cores].
chipset: 	Yes	pp0_clipped_gt_driver: Set if the PP0 (IA) frequency requested by OS was clipped by GT driver.

When I run plugged in (AC) with the 'spin' program on all cpus and with reflectdino I get the result below (the same result as above): spin reports 243 Million_ops/cpu_second.

chipset: 	Yes	pp0_clipped: Set if the PP0 (IA) frequency requested by OS was clipped.
chipset: 	Yes	pp0_clipped_n_core_turbo: Set if the PP0 (IA) frequency requested by OS was clipped, but current frequency is lower than MAX_TURBO[n-cores].
chipset: 	Yes	pp0_clipped_gt_driver: Set if the PP0 (IA) frequency requested by OS was clipped by GT driver.

So we see that running a graphics program can have an impact on the frequency of the CPU and further that the GPU driver can cause the CPU frequency to be clipped.

Note that, when I run on DC (unplugged), there isn't any GT driver clipping of the CPU frequency but this is because the graphics program power usage is capped at 1.9 Watts (versus about 5-7 watts on AC).

I've attached a picture showing the variation in power usage from my upcoming release of my IPPET utility. The system is on AC and idle for T=0-30 seconds and power usage is 2 watts. Then I start spin3.exe and it uses about 8.3 watts (the reddish color). At T=51 secs I start reflectdino (the pinkish color) and it uses about 5-6.7 watts and spin3.exe gets shunted down to about 5-6 watts. At T=80 secs I unplug the AC and reflectdino gets reduced to 1.9 watts and spin3.exe bumps back up to 8.3 watts.

I hope this helps.

Pat

 

 
0 Kudos
Bernard
Valued Contributor I
945 Views

I would advise to run VTune analysis when GPU part is executing OpenCL code. I do not know if GPU counters are accessed by VTune driver, but at least CPU performance can be monitored. You should pay attention to Front-End stalls where probably memory perrormance is involved.

0 Kudos
Reply