- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi!,

Using the gcc 9.1 compiler on linux, I am trying to develop a gradient descent function using avx512, and I think I got the algorithm right. However, the issue I am facing is, I would like to return the optimal solution for both theta0 and theta1. From the intrinsics documentation _m512 handles a vector of 16. I tried defining theta[2] = {0,0}, but when this got loaded with _m512_loadu_ps and I used gdb to look at the loaded data only the first two entries had the actual data everything else was filled with garbage. Which in turn affects the final computation of the results. The following is the code for the gradient descent:

static inline float* avx512GradientDescent(float *_x, float *_y, float _alpha, size_t num_iter){

float* thetas = (float *)aligned_alloc(ALIGNE , col*sizeof(float));

trans(_x, xtrans);

__m512 nsamples = _mm512_set1_ps(2*col);// broadcast to all 16 values

__m512 samples = _mm512_set1_ps(col);// broadcast to all 16 values

__m512 theta = _mm512_setzero_ps();

//assert(col % 16 == 0);

for(uint64_t i = 0; i < col; i += ALIGNE){

__m512 hypothesis = _mm512_setzero_ps();

__m512 loss = _mm512_setzero_ps();

__m512 J = _mm512_setzero_ps();

__m512 gradient = _mm512_setzero_ps();

__m512 alpha = _mm512_set1_ps(_alpha);// broadcast to all 16 values

__m512 xtemp = _mm512_loadu_ps(&(_x[i]));

__m512 ytemp = _mm512_loadu_ps(&(_y[i]));

__m512 xtranspose = _mm512_loadu_ps(&(xtrans[i]));

for(uint64_t iter = 0; iter < num_iter; iter +=16)

{

hypothesis = _mm512_mul_ps(xtemp, theta);

loss = _mm512_sub_ps(hypothesis, ytemp);

J = _mm512_div_ps(_mm512_fmadd_ps(loss,loss, J), nsamples);

gradient = _mm512_div_ps(_mm512_mul_ps(xtranspose, loss), samples);

theta = _mm512_sub_ps(theta, _mm512_mul_ps(alpha, gradient));

}

_mm512_storeu_ps(thetas, theta);

}

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

Can you try with Intel compiler and see if the issue persists on icpc as well?

Command:

`icpc -xCORE-AVX512 filename.cpp`

Regards,

Rahul

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When you paste code, please use the paste button on the tool bar (it looks like </>), the pulldown for Markup, select the source code format (C++).

Note (gripe to Intel), when the pasted code, last line, does not contain a line terminator, clicking the OK returns to the main reply page with the newly pasted code selected. Thus when you attempt to continue with your reply, it deletes the selected text (code). To get around this is after inserting code, and OK, click on the right arrow, then Enter.

mti,

a problem I see with your code is it is not using the results generated. Compilers now tend to optimize this out.

>> I tried defining theta[2] = {0,0} // struct of two int's

```
__mm512 theta[2];
theta[0] = _mm512_setzero_ps();
theta[1] = _mm512_setzero_ps();
```

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

I have not heard back from you, so I will go ahead and close this thread from my end. However, please note that this thread will remain open for community discussion. Feel free to post a new question if you still face any issues.

--Rahul

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page