Taylor Kidd

Qingpeng_N_ · ‎03-19-2014

Hi Here is my way of measuring the total power consumption on Xeon Phi. But the performance result seems not to be so reasonable. I use "Total power, win 0 uW" as the parameter for measurements. The code use one pthread to read the power in uW every 50 milliseconds since /sys/class/micras/power updated in around 50 milliseconds. Each time power watt updated, I update the energy with adding 50*power. Anything wrong with this measurement in measuring the total power consumption?

The thing is that for dgemm call in MKL I measure the dense matrix with 5000*2500 matrix times 2500*5000 matrix. The result I have comparing with CPU energy consumption is not that make sense to me.

Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt

Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz: 60.89 Joules in 798.345 milliseconds with 271.831 GFLOPS. So the GFLOPS/Watt = 271.831/(60.89/(798.345/1000))=3.564 GFLOPS/Watt. Intel(R) Xeon(R) CPU power is measured by PAPI RAPL PACKAGE_ENERGY:PACKAGE0 event.

http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf The performance per watt reported on above document on DGEMM call is for MIC is above 4 while for CPU is less than 1.5 GFLOPS/Watt.

micpowerSample.cc

// -sh-4.2$ cat /sys/class/micras/power
// 104000000 //Total power, win 0 uW
// 104000000 //Total power, win 1 uW
// 106000000 //PCI-E connector power
// 217000000 //Instantaneous power uW
// 32000000 //Max Instantaneous power
// 26000000 //2x3 connector power
// 48000000 //2x4 connector power
// 1) Core rail; Power reading(uW)
// 2) Core rail; Current(uA)
// 3) Core rail; Voltage(uV)
// 24000000 0 995000
// 33000000 0 1000000 //Uncore rail; Power reading, Current, Voltage
// 32000000 0 1501000 //Memory subsystem rail; Power reading, Current, Voltage

#define MICPOWER_MAX_COUNTERS 16
typedef struct MICPOWER_control_state {
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
long long lastupdate;
} MICPOWER_control_state_t;

static int read_sysfs_file(long long* counts) {
       FILE* fp = NULL;
       int i;
       int retval = 1;
       fp = fopen("/sys/class/micras/power", "r");
       if (!fp)
       return 0;
       for (i=0; i < MICPOWER_MAX_COUNTERS-9; i++) {
               retval &= fscanf(fp, "%lld", &counts);
       }
       for (i = MICPOWER_MAX_COUNTERS - 9; i < MICPOWER_MAX_COUNTERS; i += 3) {
               retval &= fscanf(fp, "%lld %lld %lld", &counts, &counts[i+1], &counts[i+2]);
       }
       fclose(fp);
       return retval;
}

volatile bool keepAlive = true;
double passEnergy = 0.0;
void* recordEnergy(void *arg) {
int retval = 0;
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 50000000L; // 50 milliseconds
double energy = 0.0;
while (keepAlive) {
retval = read_sysfs_file(counts);
energy += counts[0];
nanosleep(&ts, NULL);
}
keepAlive = true;
passEnergy = energy;
printf("Energy used in %lf\n", energy * 50.0 / 1000 / 1000 / 1000);
energy = 0.0;
}

pthread_t micPthread;
void micpower_start() {
int iret1 = pthread_create(&micPthread, NULL, recordEnergy, (void*)NULL);
}

double micpower_finalize() {
keepAlive = false;
pthread_join(micPthread, NULL);
return passEnergy * 50.0 / 1000 / 1000 / 1000;
}

// int main() {
// micpower_start();
// long long sum = 0;
// double la = 9.0;
// for (int j = 0; j < 500; ++j) {
// for (int i = 0; i < 100000; ++i) {
// sum += i;
// la = la * i / 60.0;
// }
// }
// double energy = micpower_finalize();
// return 0;
// }

//

Sumedh_N_Intel · ‎03-21-2014

I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable.

With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples.

TaylorIoTKidd · ‎03-21-2014

Qingpeng,

(Please let me know if I am referring to you in the proper way.)

I've seen a lot of your interesting questions pop up both on Linkedin and here.

You are obviously verifying power and energy claims for the coprocessor.

If you don't mind telling me, is this academic? Industry? I'll do my best to help you address any issues you come up with.

Regards
---
Taylor

Qingpeng_N_ · ‎03-21-2014

"I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable. "

return passEnergy * 50.0 / 1000 / 1000 / 1000; I times 50 milliseconds with Power so it is in Joules without problems here.

"With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples." Sorry I forgot to mention that I run the kernel 20 times and measuring power together. The number above is the 20 times average. So there should be 180 samples.

I just tried my kernel on both CPU with larger matrix size as 15600 * 15600 which is 1.1 GFLOPS/Watt is reasonable.

Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz uses total energy 6899.917236 joules in 41958.895947 milliseconds with 180.958813.

So the GFLOPS/Watt is

180.958813/(6899.917236/(41958.895947/1000))

1.1004236349016445

And for MIC Xeon Phi it is 2.427 GFLOPS/Watt which, though still not larger than 4, but it is better than CPU.

Energy used in 51446.200000
40 times passed 338927.343994, time per dgmm = 8473.183600, m=11600 k=11600 n=11600 GFLOPS/s=368.432002

>>> time=8473.183600
>>> gflops=368.432002
>>> energy=51446.2/40
>>> gflops/(energy/(time/1000))
2.427228442187425

It seems the energy efficiency in Performance/Watt that MIC Xeon Phi only has better result on larger dataset.

Qingpeng_N_ · ‎03-21-2014

Taylor Kidd

"is this academic? Industry?" It is for academic research.

Just saw your reply on Linkedin with a list of good power management articles which is the articles I am looking for.

I will begin to read them to understand more on the energy issues of Xeon Phi. Thanks a lot for kindly replying on my questions. I will ask after reading more on your articles if I still have further questions.

Sumedh_N_Intel · ‎03-21-2014

Yes. You are right. I see what you did there.

Qingpeng_N_ · ‎03-21-2014

Hi Taylor

Is there any detail documents on the output of /sys/class/micras/power.

I can only find the papi comment as document. The structure is corresponding to the output of /sys/class/micras/power. I use the first line which is Total power, win 0 to measure the total energy. Is this correct?

Is there any detail explanations on "Total power, win 0", "Total power, win 1", "PCI-E connector power".. ?

https://icl.cs.utk.edu/papi/docs/d7/d04/linux-micpower_8c_source.html

/* Intel says

 13 ----

 14 The power measurements can be obtained from the host as well as the MIC card 

 15 over a 50msec interval. The SMC is designed to sample power consumption only 

 16 every 50mSecs.

 17 ----

 static MICPOWER_native_event_entry_t _micpower_native_events[] = {

 27  { .name = "tot0",

 28  .units = "uW",

 29  .description = "Total power, win 0",

 30  .resources.selector = 1

 31  },

 32  { .name = "tot1",

 33  .units = "uW",

 34  .description = "Total power, win 1",

 35  .resources.selector = 2

 36  },

 37  { .name = "pcie",

 38  .units = "uW",

 39  .description = "PCI-E connector power",

 40  .resources.selector = 3

 41  },

 42  { .name = "inst",

 43  .units = "uW",

 44  .description = "Instantaneous power",

 45  .resources.selector = 4

 46  },

 47  { .name = "imax",

 48  .units = "uW",

 49  .description = "Max Instantaneous power",

 50  .resources.selector = 5

 51  },

 52  { .name = "c2x3",

 53  .units = "uW",

 54  .description = "2x3 connector power",

 55  .resources.selector = 6

 56  },

TaylorIoTKidd · ‎03-21-2014

Hi Qingpeng,

This is one of those embarrassing times where I have the information and can't release it to you. And it's not because there is anything sensitive, just because such documents have to go through certain procedures because of their location.

The best I can suggest is to go through the source tree (which is public) and look for the file "micras_api.h". It's pretty well documented and likely has the information you need.

Regards
--
Taylor

Qingpeng_N_ · ‎03-21-2014

"micras_api.h" seems to have the exactly same documents with PAPI mic component docs.

If I want to measure the total power consumption on the whole system for MIC native mode program, can I just use "Total power, win 0" event as counter?

typedef struct mr_rsp_power {

MrRspPws tot0; /* Total power, win 0 */

MrRspPws tot1; /* Total power, win 1 */

MrRspPws inst; /* Instantaneous power */

MrRspPws imax; /* Max instantaneous power */

MrRspPws pcie; /* PCI-E connector power */

MrRspPws c2x3; /* 2x3 connector power */

MrRspPws c2x4; /* 2x4 connector power */

MrRspVrr vccp; /* Core rail */

MrRspVrr vddg; /* Uncore rail */

MrRspVrr vddq; /* Memory subsystem rail */

} MrRspPower;

TaylorIoTKidd · ‎04-01-2014

Hi Qingpeng,

Now that the dreaded EOQ (End Of Quarter) stuff is out of the way, I'm working on reproducing your results.

Regards
---
Taylor

TaylorIoTKidd · ‎04-11-2014

Qingpeng,

I've been trying to reproduce your MKL results for the last couple of days.

I'm doing a little better than you but still not close to the quoted figures. I haven't started tuning yet, and am using micnativeloadex for convenience. I don't think micnativeloadex is affecting the results, but under utilizing the cores on the card probably is. See my data on time vs power at the bottom.

One thing that bothers me is your quote of 157 GFLOPS for the MIC.

You are doing a dense matrix mult of (5000x2500)x(2500x5000). Each operation involves 2 floating point ops, giving 125e9 floating point ops (5000^3). Your results indicate only 72e9 floating point ops (157e9 ops/s * .460 s) for the matrix multiply.

Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the
GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt

On another topic, I also looked at the change in the wattage over time of the card. Not warming up the card first, I had it start at ~100W and then ramp up to a steady state ~173W.

I'm using a 3120A (has fan) with all power management states enabled.

I suspect the reason the power is ~173W vs the data sheet's 300W is due to the processor cores being under utilized, meaning a good fraction are in a C1 state.

--
Taylor

TaylorIoTKidd · ‎04-11-2014

Qingpeng,

This article might be useful to you. It uses cblas_dgemm() and looks at performance using single and multiple threads on a core.

http://www.arc.vt.edu/resources/hpc/blueridge_mic.php#native

--
Taylor

PS I should clarify that the 300W I quoted above is the -TDB- TDP (thermal design point) and is not necessarily what you will get when you fully utilize all 60 or so cores. It is the maximum thermal specification used by the designers of the supporting host or cluster. See the datasheet, Table 5-1 on p. 49. (If the link is stale, look at http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html. You'll have to dig down a little to find the data sheet. Or you can just search for it using "datasheet phi|mic".)

Vladimir_Dergachev · ‎04-16-2014

The power used actually climbs higher if the card is used inefficiently.

For example, you can get to 200W on 5110P if you make a tight loop that issues prefetch instructions (only) - this will make sure cpu,memory and cache are all busy.

best

Vladimir Dergachev

TaylorIoTKidd · ‎04-16-2014

Vladimir,

That is a good point, and I'm glad you brought it up.

In this case, the inefficiency (I believe) is that all the cores are not utilized. Any unused cores will drop into a lower C-state to conserve energy / minimize the temperature of the co-processor. I'll do my best to confirm my hypothesis.

Regards
--
Taylor

jcebrian · ‎09-19-2014

Hi.

I've been trying to find out what is the difference between Total Power (win 0) and (win 1). Is it two consecutive 50ms time windows?

Best regards, Juan Manuel

TaylorIoTKidd · ‎11-14-2014

Juan,

I believe they referred to different moving integration intervals, though I can't tell you want the sizes of the windows where. micras is gone as far as I know.

Regards
--
Taylor

Energy Measurement on MIC Xeon Phi