- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Here is my way of measuring the total power consumption on Xeon Phi. But the performance result seems not to be so reasonable. I use "Total power, win 0 uW" as the parameter for measurements. The code use one pthread to read the power in uW every 50 milliseconds since /sys/class/micras/power updated in around 50 milliseconds. Each time power watt updated, I update the energy with adding 50*power. Anything wrong with this measurement in measuring the total power consumption?
The thing is that for dgemm call in MKL I measure the dense matrix with 5000*2500 matrix times 2500*5000 matrix. The result I have comparing with CPU energy consumption is not that make sense to me.
Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz: 60.89 Joules in 798.345 milliseconds with 271.831 GFLOPS. So the GFLOPS/Watt = 271.831/(60.89/(798.345/1000))=3.564 GFLOPS/Watt. Intel(R) Xeon(R) CPU power is measured by PAPI RAPL PACKAGE_ENERGY:PACKAGE0 event.
http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf The performance per watt reported on above document on DGEMM call is for MIC is above 4 while for CPU is less than 1.5 GFLOPS/Watt.
micpowerSample.cc
// -sh-4.2$ cat /sys/class/micras/power
// 104000000 //Total power, win 0 uW
// 104000000 //Total power, win 1 uW
// 106000000 //PCI-E connector power
// 217000000 //Instantaneous power uW
// 32000000 //Max Instantaneous power
// 26000000 //2x3 connector power
// 48000000 //2x4 connector power
// 1) Core rail; Power reading(uW)
// 2) Core rail; Current(uA)
// 3) Core rail; Voltage(uV)
// 24000000 0 995000
// 33000000 0 1000000 //Uncore rail; Power reading, Current, Voltage
// 32000000 0 1501000 //Memory subsystem rail; Power reading, Current, Voltage
#define MICPOWER_MAX_COUNTERS 16
typedef struct MICPOWER_control_state {
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
long long lastupdate;
} MICPOWER_control_state_t;
static int read_sysfs_file(long long* counts) {
FILE* fp = NULL;
int i;
int retval = 1;
fp = fopen("/sys/class/micras/power", "r");
if (!fp)
return 0;
for (i=0; i < MICPOWER_MAX_COUNTERS-9; i++) {
retval &= fscanf(fp, "%lld", &counts);
}
for (i = MICPOWER_MAX_COUNTERS - 9; i < MICPOWER_MAX_COUNTERS; i += 3) {
retval &= fscanf(fp, "%lld %lld %lld", &counts, &counts[i+1], &counts[i+2]);
}
fclose(fp);
return retval;
}
volatile bool keepAlive = true;
double passEnergy = 0.0;
void* recordEnergy(void *arg) {
int retval = 0;
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 50000000L; // 50 milliseconds
double energy = 0.0;
while (keepAlive) {
retval = read_sysfs_file(counts);
energy += counts[0];
nanosleep(&ts, NULL);
}
keepAlive = true;
passEnergy = energy;
printf("Energy used in %lf\n", energy * 50.0 / 1000 / 1000 / 1000);
energy = 0.0;
}
pthread_t micPthread;
void micpower_start() {
int iret1 = pthread_create(&micPthread, NULL, recordEnergy, (void*)NULL);
}
double micpower_finalize() {
keepAlive = false;
pthread_join(micPthread, NULL);
return passEnergy * 50.0 / 1000 / 1000 / 1000;
}
// int main() {
// micpower_start();
// long long sum = 0;
// double la = 9.0;
// for (int j = 0; j < 500; ++j) {
// for (int i = 0; i < 100000; ++i) {
// sum += i;
// la = la * i / 60.0;
// }
// }
// double energy = micpower_finalize();
// return 0;
// }
//
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable.
With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Qingpeng,
(Please let me know if I am referring to you in the proper way.)
I've seen a lot of your interesting questions pop up both on Linkedin and here.
You are obviously verifying power and energy claims for the coprocessor.
If you don't mind telling me, is this academic? Industry? I'll do my best to help you address any issues you come up with.
Regards
---
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable. "
return passEnergy * 50.0 / 1000 / 1000 / 1000; I times 50 milliseconds with Power so it is in Joules without problems here.
"With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples." Sorry I forgot to mention that I run the kernel 20 times and measuring power together. The number above is the 20 times average. So there should be 180 samples.
I just tried my kernel on both CPU with larger matrix size as 15600 * 15600 which is 1.1 GFLOPS/Watt is reasonable.
40 times passed 338927.343994, time per dgmm = 8473.183600, m=11600 k=11600 n=11600 GFLOPS/s=368.432002
>>> gflops=368.432002
>>> energy=51446.2/40
>>> gflops/(energy/(time/1000))
2.427228442187425
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"is this academic? Industry?" It is for academic research.
Just saw your reply on Linkedin with a list of good power management articles which is the articles I am looking for.
I will begin to read them to understand more on the energy issues of Xeon Phi. Thanks a lot for kindly replying on my questions. I will ask after reading more on your articles if I still have further questions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. You are right. I see what you did there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Taylor
Is there any detail documents on the output of /sys/class/micras/power.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Qingpeng,
This is one of those embarrassing times where I have the information and can't release it to you. And it's not because there is anything sensitive, just because such documents have to go through certain procedures because of their location.
The best I can suggest is to go through the source tree (which is public) and look for the file "micras_api.h". It's pretty well documented and likely has the information you need.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"micras_api.h" seems to have the exactly same documents with PAPI mic component docs.
If I want to measure the total power consumption on the whole system for MIC native mode program, can I just use "Total power, win 0" event as counter?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Qingpeng,
Now that the dreaded EOQ (End Of Quarter) stuff is out of the way, I'm working on reproducing your results.
Regards
---
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Qingpeng,
I've been trying to reproduce your MKL results for the last couple of days.
I'm doing a little better than you but still not close to the quoted figures. I haven't started tuning yet, and am using micnativeloadex for convenience. I don't think micnativeloadex is affecting the results, but under utilizing the cores on the card probably is. See my data on time vs power at the bottom.
One thing that bothers me is your quote of 157 GFLOPS for the MIC.
You are doing a dense matrix mult of (5000x2500)x(2500x5000). Each operation involves 2 floating point ops, giving 125e9 floating point ops (5000^3). Your results indicate only 72e9 floating point ops (157e9 ops/s * .460 s) for the matrix multiply.
Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the
GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt
On another topic, I also looked at the change in the wattage over time of the card. Not warming up the card first, I had it start at ~100W and then ramp up to a steady state ~173W.
I'm using a 3120A (has fan) with all power management states enabled.
I suspect the reason the power is ~173W vs the data sheet's 300W is due to the processor cores being under utilized, meaning a good fraction are in a C1 state.
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Qingpeng,
This article might be useful to you. It uses cblas_dgemm() and looks at performance using single and multiple threads on a core.
http://www.arc.vt.edu/resources/hpc/blueridge_mic.php#native
--
Taylor
PS I should clarify that the 300W I quoted above is the -TDB- TDP (thermal design point) and is not necessarily what you will get when you fully utilize all 60 or so cores. It is the maximum thermal specification used by the designers of the supporting host or cluster. See the datasheet, Table 5-1 on p. 49. (If the link is stale, look at http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html. You'll have to dig down a little to find the data sheet. Or you can just search for it using "datasheet phi|mic".)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The power used actually climbs higher if the card is used inefficiently.
For example, you can get to 200W on 5110P if you make a tight loop that issues prefetch instructions (only) - this will make sure cpu,memory and cache are all busy.
best
Vladimir Dergachev
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
That is a good point, and I'm glad you brought it up.
In this case, the inefficiency (I believe) is that all the cores are not utilized. Any unused cores will drop into a lower C-state to conserve energy / minimize the temperature of the co-processor. I'll do my best to confirm my hypothesis.
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
I've been trying to find out what is the difference between Total Power (win 0) and (win 1). Is it two consecutive 50ms time windows?
Best regards, Juan Manuel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Juan,
I believe they referred to different moving integration intervals, though I can't tell you want the sizes of the windows where. micras is gone as far as I know.
Regards
--
Taylor

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page