Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

_m512 to float*

mti
Beginner
402 Views

Hi!,

Using the gcc 9.1 compiler on linux, I am trying to develop a gradient descent function using avx512, and I think I got the algorithm right. However, the issue I am facing is, I would like to return the optimal solution for both theta0 and theta1. From the intrinsics documentation _m512 handles a vector of 16. I tried defining theta[2] = {0,0}, but when this got loaded with _m512_loadu_ps and I used gdb to look at the loaded data only the first two entries had the actual data everything else was filled with garbage. Which in turn affects the final computation of the results. The following is the code for the gradient descent:

static inline float* avx512GradientDescent(float *_x, float *_y, float _alpha, size_t num_iter){
float* thetas = (float *)aligned_alloc(ALIGNE , col*sizeof(float));
trans(_x, xtrans);
 __m512 nsamples = _mm512_set1_ps(2*col);// broadcast to all 16 values
 __m512 samples = _mm512_set1_ps(col);// broadcast to all 16 values
 __m512 theta = _mm512_setzero_ps();
 //assert(col % 16 == 0);
 for(uint64_t i = 0; i < col; i += ALIGNE){
 __m512 hypothesis = _mm512_setzero_ps();
 __m512 loss = _mm512_setzero_ps();
 __m512 J = _mm512_setzero_ps();
 __m512 gradient = _mm512_setzero_ps();
 __m512 alpha = _mm512_set1_ps(_alpha);// broadcast to all 16 values
 __m512 xtemp = _mm512_loadu_ps(&(_x[i]));
 __m512 ytemp = _mm512_loadu_ps(&(_y[i]));
 __m512 xtranspose = _mm512_loadu_ps(&(xtrans[i]));
 for(uint64_t iter = 0; iter < num_iter; iter +=16)
 {
 hypothesis = _mm512_mul_ps(xtemp, theta);
 loss = _mm512_sub_ps(hypothesis, ytemp);
 J = _mm512_div_ps(_mm512_fmadd_ps(loss,loss, J), nsamples);
 gradient = _mm512_div_ps(_mm512_mul_ps(xtranspose, loss), samples);
 theta = _mm512_sub_ps(theta, _mm512_mul_ps(alpha, gradient));
 }
 _mm512_storeu_ps(thetas, theta);
}

 

0 Kudos
4 Replies
RahulV_intel
Moderator
368 Views

Hi,

 

Can you try with Intel compiler and see if the issue persists on icpc as well?

 

Command:

icpc -xCORE-AVX512 filename.cpp

 

Regards,

Rahul

 

jimdempseyatthecove
Black Belt
359 Views

When you paste code, please use the paste button on the tool bar (it looks like </>), the pulldown for Markup, select the source code format (C++).

Note (gripe to Intel), when the pasted code, last line, does not contain a line terminator, clicking the OK returns to the main reply page with the newly pasted code selected. Thus when you attempt to continue with your reply, it deletes the selected text (code). To get around this is after inserting code, and OK, click on the right arrow, then Enter.

mti,

a problem I see with your code is it is not using the results generated. Compilers now tend to optimize this out.

>> I tried defining theta[2] = {0,0} // struct of two int's

__mm512 theta[2];
theta[0] = _mm512_setzero_ps();
theta[1] = _mm512_setzero_ps();

Jim Dempsey

RahulV_intel
Moderator
335 Views

@mti ,

 

Could you let me know if the above suggestions worked for you?

 

Thanks,

Rahul

 

RahulV_intel
Moderator
308 Views

Hi,


I have not heard back from you, so I will go ahead and close this thread from my end. However, please note that this thread will remain open for community discussion. Feel free to post a new question if you still face any issues.


--Rahul


Reply