Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Odd cache results

korso
Beginner
975 Views
Hi all,
I'm trying to maximise use of cache in matrices, so I'm testing some of my codes with both IPC and PAPI. The problem is that the results obtained are very different. I'm measuring L2 and L3 hit ratio with 4 programs:
PAPI IPC
L2 L3 L2 L3
matrix0 0,000257 0,000394 0,0108473 0,0274595
matrix1 0,001641 0,590435 0,00420045 0,0081431
matrix2 0,001943 0,641179 0,00416087 0,00807843
matrix3 0,001849 0,942466 0,00388092 0,0484803
The L3 results are specially significant. With PAPI, I obtained a hit ratio of 60~90%, but when measured with IPC i obtained 0~4%. The routines measured are the same, so I don't undestand the results. Is IPC measuring wrong?
For example, the code for matrix1 (accesing a matrix by columns):
With PAPI:
[bash]/** * Ejecucion de matriz normal en multihilo con openmp utilizando * bucle for de openmp. Eventos medidos con PAPI */ #include #include #include #include #include "papi.h" #define NUM_EVENTS 5 int Events[NUM_EVENTS] = {PAPI_TOT_CYC, PAPI_L2_TCA, PAPI_L2_TCM, PAPI_L3_TCA, PAPI_L3_TCM}; long long values[NUM_EVENTS]; long long start_usec, end_usec, start_v_usec, end_v_usec, start_cycles, end_cycles; int EventSet = PAPI_NULL; int num_counters; const PAPI_hw_info_t *hwinfo = NULL; int main(int argc, char* argv[]) { int n; if ((n=PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT) { printf("\n Papi ver current (%d) distinto de %d \n", n,PAPI_VER_CURRENT); } /* Gets the starting time in microseconds */ if ((hwinfo = PAPI_get_hardware_info()) == NULL) { printf("\n Papi: Error PAPI_get_hardware_info null\n"); } else { printf("\n%d CPU at %f Mhz.\n",hwinfo->totalcpus,hwinfo->mhz); } // CODIGO A MEDIR #include #include #include #include "timer.h" #include #define nth 4 #define F 17000 #define C 17000 #define VAR double int i,j; // Seleccionamos el numero de hilos de ejecucion omp_set_num_threads(nth); // Reservamos matriz VAR** m; m = (VAR**)malloc(F*sizeof(VAR*)); for(i = 0; i = (VAR*)malloc(C*sizeof(VAR)); } // Inicializacion matriz for(i=0;i = 100+i+j; } } start_timer(0); /* Start counting events */ if ((n=PAPI_start_counters(Events, NUM_EVENTS)) != PAPI_OK) printf("\n Error %d: PAPI_start_counters\n",n); #pragma omp parallel for shared(m) private(i,j) for(i=0;i = (VAR)sqrt(m); } } stop_timer(0); printf("No padding. Execution time: "); print_timer(0); // FIN DE CODIGO A MEDIR if ((n=PAPI_stop_counters(values, NUM_EVENTS)) != PAPI_OK) printf("\n Error %d : PAPI_stop_counters\n", n); printf("Total Cycles: \t%lld\n", values[0]); printf("\nPAPI:\n"); printf("L2 Data Accesses:\t%lld\nL2 Data Misses:\t\t%lld\n", values[1], values[2]); printf("\nL3 Accesses:\t\t%lld\nL3 Data Misses:\t\t%lld\n", values[3], values[4]); printf("\nL2 Success Rate:\t%lf\n", 1-((double)values[2]/(double)values[1])); printf("L3 Success Rate:\t%lf\n", 1-((double)values[4]/(double)values[3])); return 0; } [/bash]
With IPC:
[bash]#include "cpucounters.h" #include #include #include #include #define nth 4 #define F 17000 #define C 17000 using namespace std; int main(){ cout<<"Testing Intel PCM\n"<program() != PCM::Success){ printf("Error Code: %d\n",ipc->program()); return -1; } // Begin of custom code int i,j; // Reservamos matriz double** m; m = (double**)malloc(F*sizeof(double)); for(i = 0; i = (double*)malloc(C*sizeof(double)); } // Inicializacion matriz for(i=0;i = 100+i+j; } } //Begin of measures SystemCounterState before_sstate = getSystemCounterState(); /** * Ejecucion de matriz mala con parallel for. */ #pragma omp parallel for shared(m) private(i,j) for(i=0;i = sqrt(m); } } // End of measures // End of custom code SystemCounterState after_sstate = getSystemCounterState(); // Stop and detach PMU (IMPORTANT!!) ipc->cleanup(); cout <<< "RESULTS:"<
I'm measuring wrong? The results I obtain with PAPI have more coherence for me (at least in L3)
Thanks in advance.
0 Kudos
7 Replies
Roman_D_Intel
Employee
975 Views
korso,
what are the sizes of your matrices (0,1,2,3) and what is the hardware configuration are running (number of sockets, processor type, etc) ?
Thanks,
Roman
0 Kudos
korso
Beginner
975 Views
Hi Roman,
Matrices size are 17000x17000 in all codes. C Double type. They have been selected for not reaching RAM limit (and avoid using virtual memory)
My processor is ai7 CPU 860 @ 2.80GHz
It has a 3 level cache:
L1 -> C=64; L=8; W=64 -> 32K instructions, 32K data (per core)
L2 -> C=512; L=8; W=64 -> 256K (per core)
L3 -> C=8192; L=16; W=64 -> 8192K (unified)
Sockets -> 1
Cores -> 4
RAM -> 4GB
If you need any other information, just ask for it.
Thanks.
0 Kudos
Roman_D_Intel
Employee
975 Views
Korso,
do you know how PAPI maps it "virtual"PAPI_L3_TCA, PAPI_L3_TCM events to real hardware event and what are those?
17K x 17K x 8 matrix implies data size >= 2GByte and the L3 cache size is only 8 MByte. Your access pattern (by column - increasing j index) is not sequential. Why do you expect L3 hit rate > 60% ?
  1. for(i=0;
  2. for(j=0;
  3. m = (VAR)sqrt(m);
  4. }
  5. }
Thanks,
Roman
0 Kudos
korso
Beginner
975 Views
Hi Roman,
Well, each program is different, and the code I posted is the worst case scenario. Let me expain a bit:
matrix0 code is a sequential access for the matrix, the code measured is:
[bash]//Begin of measures SystemCounterState before_sstate = getSystemCounterState(); /** * Ejecucion de matriz mala con parallel for y padding simple. */ #pragma omp parallel for shared(m) private(i,j) for(i=0;i = sqrt(m); } } // End of measures // End of custom code SystemCounterState after_sstate = getSystemCounterState(); [/bash]
matrix1 code is a non sequential access for the matrix, and the code is the same I posted before:
[bash]//Begin of measures SystemCounterState before_sstate = getSystemCounterState(); /** * Ejecucion de matriz mala con parallel for y padding simple. */ #pragma omp parallel for shared(m) private(i,j) for(i=0;i = sqrt(m); } } // End of measures // End of custom code[/bash] matrix2 code is same than matrix1 but aplying basic array padding not optimized for multi-core (the only change is that matrix column size is slightly greater)
matrix3 code is an experimental method to use array padding to access the matrices by columns so the cache only need to store a single column of the matrix (L2 W is 64bytes/block, so if a double value has 8 bytes long, a m[0][0] access will produce a cache miss and store into L3 block m[0][0] to m[0][7] cells).
My algorithm guarantees the maximization of cache size so if cache can store num_threads*num_files*8 matrix cells, the hit ratio should be nearly the same as in a sequential access. But even if my algorithm was bad, matrix0 is a sequential access, so I expect a higher hit ratio in both caches.
In fact, I use this large matrices so I can obtain more differences between worst case scenario and my algorithm.
About PAPI question, I use PAPI_L3_TCA and TCM for total cache accesses and misses, and i obtain hit ratio:
hit_ratio = 1-(misses/acesses). Both events are available and native in my processor, but I don't know any more details.
I could be using PAPI bad, but L3 results seem to have more sense to me (I can't understand L2 low hit ratio, and that was the main reason for me to change PAPI to IPC)
Thanks
0 Kudos
korso
Beginner
975 Views
Replying to reupload the post...
0 Kudos
Roman_D_Intel
Employee
975 Views
Hi korso,

Do you know howPAPI_L3_TCA and PAPI_L3_TCA PAPI generic events are mapped to the low-level Intel event (names)? As far as I understanf PAPI mappingscould be dependent on the PAPI version and also underlying CPU architecture. Is there utility in PAPI that can output such mapping on your particular system? Or any documentation?

It would be also useful to see and compare the absolute counts of L2/L3 cache hits and misses in PCM and PAPI. Could you post them here?

Thank you,
Roman
0 Kudos
Sanath_Jayasena
Beginner
975 Views
Hi korso,

Assuming your OS is Linux, you can use "perf" utility as a 3rd method to check, without having to modify source code. Simply run, on the command line:

> sudo perf stat -e rXXXX,rYYYY,rZZZZ,... ./ ...

where the rXXXX etc are the hex codes formed by Umask and EventCode of relevant cache events. The Intel Programming Guide (Volume 3B), Chapters 18, 19 on Performance Counters give the event codes for your processor (Core i7) nehalem.

Or, have you already done it?

Sanath
0 Kudos
Reply