- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm trying to maximise use of cache in matrices, so I'm testing some of my codes with both IPC and PAPI. The problem is that the results obtained are very different. I'm measuring L2 and L3 hit ratio with 4 programs:
PAPI | IPC | |||
L2 | L3 | L2 | L3 | |
matrix0 | 0,000257 | 0,000394 | 0,0108473 | 0,0274595 |
matrix1 | 0,001641 | 0,590435 | 0,00420045 | 0,0081431 |
matrix2 | 0,001943 | 0,641179 | 0,00416087 | 0,00807843 |
matrix3 | 0,001849 | 0,942466 | 0,00388092 | 0,0484803 |
The L3 results are specially significant. With PAPI, I obtained a hit ratio of 60~90%, but when measured with IPC i obtained 0~4%. The routines measured are the same, so I don't undestand the results. Is IPC measuring wrong?
For example, the code for matrix1 (accesing a matrix by columns):
With PAPI:
[bash]/**
* Ejecucion de matriz normal en multihilo con openmp utilizando
* bucle for de openmp. Eventos medidos con PAPI
*/
#include
#include
#include
#include
#include "papi.h"
#define NUM_EVENTS 5
int Events[NUM_EVENTS] = {PAPI_TOT_CYC, PAPI_L2_TCA, PAPI_L2_TCM, PAPI_L3_TCA, PAPI_L3_TCM};
long long values[NUM_EVENTS];
long long start_usec, end_usec, start_v_usec, end_v_usec, start_cycles, end_cycles;
int EventSet = PAPI_NULL;
int num_counters;
const PAPI_hw_info_t *hwinfo = NULL;
int main(int argc, char* argv[])
{
int n;
if ((n=PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT) {
printf("\n Papi ver current (%d) distinto de %d \n", n,PAPI_VER_CURRENT);
}
/* Gets the starting time in microseconds */
if ((hwinfo = PAPI_get_hardware_info()) == NULL) {
printf("\n Papi: Error PAPI_get_hardware_info null\n");
}
else {
printf("\n%d CPU at %f Mhz.\n",hwinfo->totalcpus,hwinfo->mhz);
}
// CODIGO A MEDIR
#include
#include
#include
#include "timer.h"
#include
#define nth 4
#define F 17000
#define C 17000
#define VAR double
int i,j;
// Seleccionamos el numero de hilos de ejecucion
omp_set_num_threads(nth);
// Reservamos matriz
VAR** m;
m = (VAR**)malloc(F*sizeof(VAR*));
for(i = 0; i = (VAR*)malloc(C*sizeof(VAR));
}
// Inicializacion matriz
for(i=0;i = 100+i+j;
}
}
start_timer(0);
/* Start counting events */
if ((n=PAPI_start_counters(Events, NUM_EVENTS)) != PAPI_OK)
printf("\n Error %d: PAPI_start_counters\n",n);
#pragma omp parallel for shared(m) private(i,j)
for(i=0;i = (VAR)sqrt(m);
}
}
stop_timer(0);
printf("No padding. Execution time: ");
print_timer(0);
// FIN DE CODIGO A MEDIR
if ((n=PAPI_stop_counters(values, NUM_EVENTS)) != PAPI_OK)
printf("\n Error %d : PAPI_stop_counters\n", n);
printf("Total Cycles: \t%lld\n", values[0]);
printf("\nPAPI:\n");
printf("L2 Data Accesses:\t%lld\nL2 Data Misses:\t\t%lld\n", values[1], values[2]);
printf("\nL3 Accesses:\t\t%lld\nL3 Data Misses:\t\t%lld\n", values[3], values[4]);
printf("\nL2 Success Rate:\t%lf\n", 1-((double)values[2]/(double)values[1]));
printf("L3 Success Rate:\t%lf\n", 1-((double)values[4]/(double)values[3]));
return 0;
}
[/bash]
With IPC:
[bash]#include "cpucounters.h"
#include
#include
#include
#include
#define nth 4
#define F 17000
#define C 17000
using namespace std;
int
main(){
cout<<"Testing Intel PCM\n"<program() != PCM::Success){
printf("Error Code: %d\n",ipc->program());
return -1;
}
// Begin of custom code
int i,j;
// Reservamos matriz
double** m;
m = (double**)malloc(F*sizeof(double));
for(i = 0; i = (double*)malloc(C*sizeof(double));
}
// Inicializacion matriz
for(i=0;i = 100+i+j;
}
}
//Begin of measures
SystemCounterState before_sstate = getSystemCounterState();
/**
* Ejecucion de matriz mala con parallel for.
*/
#pragma omp parallel for shared(m) private(i,j)
for(i=0;i = sqrt(m);
}
}
// End of measures
// End of custom code
SystemCounterState after_sstate = getSystemCounterState();
// Stop and detach PMU (IMPORTANT!!)
ipc->cleanup();
cout <<< "RESULTS:"<
I'm measuring wrong? The results I obtain with PAPI have more coherence for me (at least in L3)
Thanks in advance.
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
korso,
what are the sizes of your matrices (0,1,2,3) and what is the hardware configuration are running (number of sockets, processor type, etc) ?
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman,
Matrices size are 17000x17000 in all codes. C Double type. They have been selected for not reaching RAM limit (and avoid using virtual memory)
My processor is ai7 CPU 860 @ 2.80GHz
It has a 3 level cache:
L1 -> C=64; L=8; W=64 -> 32K instructions, 32K data (per core)
L2 -> C=512; L=8; W=64 -> 256K (per core)
L3 -> C=8192; L=16; W=64 -> 8192K (unified)
Sockets -> 1
Cores -> 4
RAM -> 4GB
If you need any other information, just ask for it.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Korso,
do you know how PAPI maps it "virtual"PAPI_L3_TCA, PAPI_L3_TCM events to real hardware event and what are those?
17K x 17K x 8 matrix implies data size >= 2GByte and the L3 cache size is only 8 MByte. Your access pattern (by column - increasing j index) is not sequential. Why do you expect L3 hit rate > 60% ?
- for(i=0;
- for(j=0;
- m
= (VAR)sqrt(m ); - }
- }
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman,
Well, each program is different, and the code I posted is the worst case scenario. Let me expain a bit:matrix0 code is a sequential access for the matrix, the code measured is:
[bash]//Begin of measures
SystemCounterState before_sstate = getSystemCounterState();
/**
* Ejecucion de matriz mala con parallel for y padding simple.
*/
#pragma omp parallel for shared(m) private(i,j)
for(i=0;i = sqrt(m);
}
}
// End of measures
// End of custom code
SystemCounterState after_sstate = getSystemCounterState(); [/bash]
matrix1 code is a non sequential access for the matrix, and the code is the same I posted before:
[bash]//Begin of measures
SystemCounterState before_sstate = getSystemCounterState();
/**
* Ejecucion de matriz mala con parallel for y padding simple.
*/
#pragma omp parallel for shared(m) private(i,j)
for(i=0;i = sqrt(m);
}
}
// End of measures
// End of custom code[/bash] matrix2 code is same than matrix1 but aplying basic array padding not optimized for multi-core (the only change is that matrix column size is slightly greater)
matrix3 code is an experimental method to use array padding to access the matrices by columns so the cache only need to store a single column of the matrix (L2 W is 64bytes/block, so if a double value has 8 bytes long, a m[0][0] access will produce a cache miss and store into L3 block m[0][0] to m[0][7] cells).
My algorithm guarantees the maximization of cache size so if cache can store num_threads*num_files*8 matrix cells, the hit ratio should be nearly the same as in a sequential access. But even if my algorithm was bad, matrix0 is a sequential access, so I expect a higher hit ratio in both caches.
In fact, I use this large matrices so I can obtain more differences between worst case scenario and my algorithm.
About PAPI question, I use PAPI_L3_TCA and TCM for total cache accesses and misses, and i obtain hit ratio:
hit_ratio = 1-(misses/acesses). Both events are available and native in my processor, but I don't know any more details.
I could be using PAPI bad, but L3 results seem to have more sense to me (I can't understand L2 low hit ratio, and that was the main reason for me to change PAPI to IPC)
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Replying to reupload the post...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi korso,
Do you know howPAPI_L3_TCA and PAPI_L3_TCA PAPI generic events are mapped to the low-level Intel event (names)? As far as I understanf PAPI mappingscould be dependent on the PAPI version and also underlying CPU architecture. Is there utility in PAPI that can output such mapping on your particular system? Or any documentation?
It would be also useful to see and compare the absolute counts of L2/L3 cache hits and misses in PCM and PAPI. Could you post them here?
Thank you,
Roman
Do you know howPAPI_L3_TCA and PAPI_L3_TCA PAPI generic events are mapped to the low-level Intel event (names)? As far as I understanf PAPI mappingscould be dependent on the PAPI version and also underlying CPU architecture. Is there utility in PAPI that can output such mapping on your particular system? Or any documentation?
It would be also useful to see and compare the absolute counts of L2/L3 cache hits and misses in PCM and PAPI. Could you post them here?
Thank you,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi korso,
Assuming your OS is Linux, you can use "perf" utility as a 3rd method to check, without having to modify source code. Simply run, on the command line:
> sudo perf stat -e rXXXX,rYYYY,rZZZZ,... ./ ...
where the rXXXX etc are the hex codes formed by Umask and EventCode of relevant cache events. The Intel Programming Guide (Volume 3B), Chapters 18, 19 on Performance Counters give the event codes for your processor (Core i7) nehalem.
Or, have you already done it?
Sanath
Assuming your OS is Linux, you can use "perf" utility as a 3rd method to check, without having to modify source code. Simply run, on the command line:
> sudo perf stat -e rXXXX,rYYYY,rZZZZ,... ./
where the rXXXX etc are the hex codes formed by Umask and EventCode of relevant cache events. The Intel Programming Guide (Volume 3B), Chapters 18, 19 on Performance Counters give the event codes for your processor (Core i7) nehalem.
Or, have you already done it?
Sanath
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page