- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello,

I'm having a bit of trouble finding some informations about some BLAS functions. I want to estimate a covariance matrix from a set of K vectors (of length N).Two ways for doing that:

- put all the vectors x_k into a matrix X (size N*K) and use zgemm to do X*X^{H} so computational cost = 8*K*N²

- use zher K times updating each time the matrix with x_k*x_k^{H} --> what's the cost of that ?

Also, i'm a bit lost when talking about computing power calculation. If, for a given matrix-matrix multiplication, I need 200 GFlop per second (calculated with 8*K*N² / the time I have to do it). Can I compare these 200 GFlops to theorical power of my CPU ? Because i always see that the power of CPUs/GPUs is given in MAD GFlops. Does this mean that i can divide the 200 GFlops by 2 because one multiplication+addition is done per cycle ??

Thank you.

Jean-François

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Zhang Z (Intel) wrote:No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_PressBriefing.pdf That is around 80 GFlops for my X5570, for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ? When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Jean-françois D. wrote:I cannot comment on GPU peak performance. But for Intel Xeon X5570, my calculation gives 94 GFlops theoretical peak performance for double precision floating-point operations and 188 GFlops for single precision floating-point operations. This is based on 2.93 GHz CPU frequency, 2 sockets, 4 cores per socket, and assumes all operations are vectorized. You can use this information to compute an upper limit of the speed for you operations.

Quote:Zhang Z (Intel)wrote:Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_P...

That is around 80 GFlops for my X5570,

for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ?

When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Zhang Z (Intel) wrote:Thank you, it really helps ! Indeed, I got two X5570 clocked at 2.93 GHz !

Quote:Jean-françois D.wrote:

Quote:

Zhang Z (Intel)wrote:Jean-François,

Would you please share how you get the theoretical peak performance (GFLOPS) of your processor? It seems you assume one multiplication and one addition can be done in one cycle. This is true ONLY if the processor is capable of FMA instructions (fused multiply-add). For Intel processors, FMA will be introduced in the upcoming Haswell microarchitecture in 2013. What processor do you have?

Ilya please comment on the computational cost (in terms of the number of floating-point operations) of covariance matrix estimation, but I think it should be on par with the cost of matrix multiplication.

No I don't, I found approximations of the theorical peak here for exemple : http://download.intel.com/pressroom/kits/xeon/5600series/pdf/Xeon_5600_P...

That is around 80 GFlops for my X5570,

for the estimation of the covariance matrix, it is pretty simple if I use zgemm its 8*N²*K floating operations, so can I compare this number (divided by the time i have to make the calculation) to the theorical peak to have an idea if this should run fast enough ?

When programming on GPUs, which i also use for bigger matrices, as far as i know, FMA i supported and thus i guess i can divide the 8*N²*M floating operations by 2 as one mul-add is made in one cycle.

I cannot comment on GPU peak performance. But for Intel Xeon X5570, my calculation gives 94 GFlops theoretical peak performance for double precision floating-point operations and 188 GFlops for single precision floating-point operations. This is based on 2.93 GHz CPU frequency, 2 sockets, 4 cores per socket, and assumes all operations are vectorized. You can use this information to compute an upper limit of the speed for you operations.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page