- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Here is my way of measuring the total power consumption on Xeon Phi. But the performance result seems not to be so reasonable. I use "Total power, win 0 uW" as the parameter for measurements. The code use one pthread to read the power in uW every 50 milliseconds since /sys/class/micras/power updated in around 50 milliseconds. Each time power watt updated, I update the energy with adding 50*power. Anything wrong with this measurement in measuring the total power consumption?
The thing is that for dgemm call in MKL I measure the dense matrix with 5000*2500 matrix times 2500*5000 matrix. The result I have comparing with CPU energy consumption is not that make sense to me.
Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz: 60.89 Joules in 798.345 milliseconds with 271.831 GFLOPS. So the GFLOPS/Watt = 271.831/(60.89/(798.345/1000))=3.564 GFLOPS/Watt. Intel(R) Xeon(R) CPU power is measured by PAPI RAPL PACKAGE_ENERGY:PACKAGE0 event.
http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf The performance per watt reported on above document on DGEMM call is for MIC is above 4 while for CPU is less than 1.5 GFLOPS/Watt.
micpowerSample.cc
// -sh-4.2$ cat /sys/class/micras/power
// 104000000 //Total power, win 0 uW
// 104000000 //Total power, win 1 uW
// 106000000 //PCI-E connector power
// 217000000 //Instantaneous power uW
// 32000000 //Max Instantaneous power
// 26000000 //2x3 connector power
// 48000000 //2x4 connector power
// 1) Core rail; Power reading(uW)
// 2) Core rail; Current(uA)
// 3) Core rail; Voltage(uV)
// 24000000 0 995000
// 33000000 0 1000000 //Uncore rail; Power reading, Current, Voltage
// 32000000 0 1501000 //Memory subsystem rail; Power reading, Current, Voltage
#define MICPOWER_MAX_COUNTERS 16
typedef struct MICPOWER_control_state {
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
long long lastupdate;
} MICPOWER_control_state_t;
static int read_sysfs_file(long long* counts) {
FILE* fp = NULL;
int i;
int retval = 1;
fp = fopen("/sys/class/micras/power", "r");
if (!fp)
return 0;
for (i=0; i < MICPOWER_MAX_COUNTERS-9; i++) {
retval &= fscanf(fp, "%lld", &counts);
}
for (i = MICPOWER_MAX_COUNTERS - 9; i < MICPOWER_MAX_COUNTERS; i += 3) {
retval &= fscanf(fp, "%lld %lld %lld", &counts, &counts[i+1], &counts[i+2]);
}
fclose(fp);
return retval;
}
volatile bool keepAlive = true;
double passEnergy = 0.0;
void* recordEnergy(void *arg) {
int retval = 0;
long long counts[MICPOWER_MAX_COUNTERS]; // used for caching
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 50000000L; // 50 milliseconds
double energy = 0.0;
while (keepAlive) {
retval = read_sysfs_file(counts);
energy += counts[0];
nanosleep(&ts, NULL);
}
keepAlive = true;
passEnergy = energy;
printf("Energy used in %lf\n", energy * 50.0 / 1000 / 1000 / 1000);
energy = 0.0;
}
pthread_t micPthread;
void micpower_start() {
int iret1 = pthread_create(&micPthread, NULL, recordEnergy, (void*)NULL);
}
double micpower_finalize() {
keepAlive = false;
pthread_join(micPthread, NULL);
return passEnergy * 50.0 / 1000 / 1000 / 1000;
}
// int main() {
// micpower_start();
// long long sum = 0;
// double la = 9.0;
// for (int j = 0; j < 500; ++j) {
// for (int i = 0; i < 100000; ++i) {
// sum += i;
// la = la * i / 60.0;
// }
// }
// double energy = micpower_finalize();
// return 0;
// }
//
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable.
With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Qingpeng,
(Please let me know if I am referring to you in the proper way.)
I've seen a lot of your interesting questions pop up both on Linkedin and here.
You are obviously verifying power and energy claims for the coprocessor.
If you don't mind telling me, is this academic? Industry? I'll do my best to help you address any issues you come up with.
Regards
---
Taylor
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
"I took a quick glance at your code and it seems like you are summing up the "Total power, win 0 " measurements. However, in your calculations you report this as energy (Joules) instead of Power (W or J/Sec). Hence, your calculation simply becomes 156.57/59 = 2.65 GFLOPS/W which seems more reasonable. "
return passEnergy * 50.0 / 1000 / 1000 / 1000; I times 50 milliseconds with Power so it is in Joules without problems here.
"With that said, you are taking only 9 samples over the run of your app. You might get better accuracy with more number of samples." Sorry I forgot to mention that I run the kernel 20 times and measuring power together. The number above is the 20 times average. So there should be 180 samples.
I just tried my kernel on both CPU with larger matrix size as 15600 * 15600 which is 1.1 GFLOPS/Watt is reasonable.
40 times passed 338927.343994, time per dgmm = 8473.183600, m=11600 k=11600 n=11600 GFLOPS/s=368.432002
>>> gflops=368.432002
>>> energy=51446.2/40
>>> gflops/(energy/(time/1000))
2.427228442187425
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
"is this academic? Industry?" It is for academic research.
Just saw your reply on Linkedin with a list of good power management articles which is the articles I am looking for.
I will begin to read them to understand more on the energy issues of Xeon Phi. Thanks a lot for kindly replying on my questions. I will ask after reading more on your articles if I still have further questions.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Yes. You are right. I see what you did there.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Taylor
Is there any detail documents on the output of /sys/class/micras/power.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Qingpeng,
This is one of those embarrassing times where I have the information and can't release it to you. And it's not because there is anything sensitive, just because such documents have to go through certain procedures because of their location.
The best I can suggest is to go through the source tree (which is public) and look for the file "micras_api.h". It's pretty well documented and likely has the information you need.
Regards
--
Taylor
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
"micras_api.h" seems to have the exactly same documents with PAPI mic component docs.
If I want to measure the total power consumption on the whole system for MIC native mode program, can I just use "Total power, win 0" event as counter?
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Qingpeng,
Now that the dreaded EOQ (End Of Quarter) stuff is out of the way, I'm working on reproducing your results.
Regards
---
Taylor
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Qingpeng,
I've been trying to reproduce your MKL results for the last couple of days.
I'm doing a little better than you but still not close to the quoted figures. I haven't started tuning yet, and am using micnativeloadex for convenience. I don't think micnativeloadex is affecting the results, but under utilizing the cores on the card probably is. See my data on time vs power at the bottom.
One thing that bothers me is your quote of 157 GFLOPS for the MIC.
You are doing a dense matrix mult of (5000x2500)x(2500x5000). Each operation involves 2 floating point ops, giving 125e9 floating point ops (5000^3). Your results indicate only 72e9 floating point ops (157e9 ops/s * .460 s) for the matrix multiply.
Intel MIC 5110P : 59.00 Joules in 459.843 milliseconds with 156.57 GFLOPS. So the
GFLOPS/Watt=156.57/(59.00/(459.843/1000))=1.2202GFLOPS/Watt
On another topic, I also looked at the change in the wattage over time of the card. Not warming up the card first, I had it start at ~100W and then ramp up to a steady state ~173W.
I'm using a 3120A (has fan) with all power management states enabled.
I suspect the reason the power is ~173W vs the data sheet's 300W is due to the processor cores being under utilized, meaning a good fraction are in a C1 state.
--
Taylor
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Qingpeng,
This article might be useful to you. It uses cblas_dgemm() and looks at performance using single and multiple threads on a core.
http://www.arc.vt.edu/resources/hpc/blueridge_mic.php#native
--
Taylor
PS I should clarify that the 300W I quoted above is the -TDB- TDP (thermal design point) and is not necessarily what you will get when you fully utilize all 60 or so cores. It is the maximum thermal specification used by the designers of the supporting host or cluster. See the datasheet, Table 5-1 on p. 49. (If the link is stale, look at http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html. You'll have to dig down a little to find the data sheet. Or you can just search for it using "datasheet phi|mic".)
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The power used actually climbs higher if the card is used inefficiently.
For example, you can get to 200W on 5110P if you make a tight loop that issues prefetch instructions (only) - this will make sure cpu,memory and cache are all busy.
best
Vladimir Dergachev
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Vladimir,
That is a good point, and I'm glad you brought it up.
In this case, the inefficiency (I believe) is that all the cores are not utilized. Any unused cores will drop into a lower C-state to conserve energy / minimize the temperature of the co-processor. I'll do my best to confirm my hypothesis.
Regards
--
Taylor
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi.
I've been trying to find out what is the difference between Total Power (win 0) and (win 1). Is it two consecutive 50ms time windows?
Best regards, Juan Manuel
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Juan,
I believe they referred to different moving integration intervals, though I can't tell you want the sizes of the windows where. micras is gone as far as I know.
Regards
--
Taylor
