Cost of covariance matrix estimation

Jean-françois_D_ · ‎11-29-2012

Hello,

I'm having a bit of trouble finding some informations about some BLAS functions. I want to estimate a covariance matrix from a set of K vectors (of length N).Two ways for doing that:

- put all the vectors x_k into a matrix X (size N*K) and use zgemm to do X*X^{H} so computational cost = 8*K*N²

- use zher K times updating each time the matrix with x_k*x_k^{H} --> what's the cost of that ?

Also, i'm a bit lost when talking about computing power calculation. If, for a given matrix-matrix multiplication, I need 200 GFlop per second (calculated with 8*K*N² / the time I have to do it). Can I compare these 200 GFlops to theorical power of my CPU ? Because i always see that the power of CPUs/GPUs is given in MAD GFlops. Does this mean that i can divide the 200 GFlops by 2 because one multiplication+addition is done per cycle ??

Thank you.

Jean-François

Ilya_B_Intel · ‎11-29-2012

Hello Jean-François, There is a dedicated functionality for covariance matrix estimation in Math Kernel Library. You can check Statistical Functions - Summary Statistics chapter in Reference Manual. You can also check the following example: ./vslc/source/vslsbasicstats.c Ilya

Jean-françois_D_ · ‎11-29-2012

Thank you Ilya. I didnt know these functions! But I dont see any computational cost in the ref manuel for these functions ! Jean-François

Zhang_Z_Intel · ‎12-03-2012

Jean-François, Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have? Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

Ilya_B_Intel · ‎12-04-2012

Computational cost depends on the method used and whether weights are required. Practical efficiency also significantly depends on data format: columns- or rows-major and actual task size. In the case of VSL_SS_METHOD_FAST, no weights and dimention << observations number, that will dominated by ssyrk/dsyrk matrix cost.

Jean-françois_D_ · ‎12-04-2012

Zhang Z (Intel) wrote:
Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_PressBriefing.pdf That is around 80 GFlops for my X5570, for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ? When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.

Zhang_Z_Intel · ‎12-05-2012

Jean-françois D. wrote:
Quote:
Zhang Z (Intel) wrote:
Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_P...

That is around 80 GFlops for my X5570,

for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ?

When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.

I cannot comment on GPU peak performance. But for Intel Xeon X5570, my calculation gives 94 GFlops theoretical peak performance for double precision floating-point operations and 188 GFlops for single precision floating-point operations. This is based on 2.93 GHz CPU frequency, 2 sockets, 4 cores per socket, and assumes all operations are vectorized. You can use this information to compute an upper limit of the speed for you operations.

Jean-françois_D_ · ‎12-06-2012

Zhang Z (Intel) wrote:
Quote:
Jean-françois D. wrote:
Quote:

Zhang Z (Intel) wrote:

Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_P...

That is around 80 GFlops for my X5570,

for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ?

When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.

I cannot comment on GPU peak performance. But for Intel Xeon X5570, my calculation gives 94 GFlops theoretical peak performance for double precision floating-point operations and 188 GFlops for single precision floating-point operations. This is based on 2.93 GHz CPU frequency, 2 sockets, 4 cores per socket, and assumes all operations are vectorized. You can use this information to compute an upper limit of the speed for you operations.

Thank you, it really helps ! Indeed, I got two X5570 clocked at 2.93 GHz !