Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
747 Discussions

[SVML functions for __m512 performed in __m256]

arsene-marzorati
Beginner
530 Views

Hello everyone, 

I am trying to work with SVML functions for _m512. When I use the optimisation flag -O0 it seems that functions such as _mm512_sin_ps and _mm512_sin_pd are performed in __m256 and __m256d. Whereas, with a flag -01 or more everything seems right. 

I would like to know if this behavior is expected or the counters I am using are not reliable.

For doing so, I used the code bench.cpp (see at the end of the message). 

I'm running on Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201).

 

icpx -OX -march=native -std=gnu++20 -o bench.cpp.o -c bench.cpp
icpx -OX -march=native -rdynamic bench.cpp.o -o bench

 

 With X = 0 or 1.


Then I run: 

 

perf stat --event=fp_arith_inst_retired.512b_packed_single ./bench bench.txt

 

 I get 0 for this counter with flag -O0 and 190  with flag -O1.

 

perf stat --event=fp_arith_inst_retired.256b_packed_single ./bench bench.txt

 

I get 380 for this counter with flag -O0 and 0  with flag -O1 .

 

Thank you in advance for your answer.

 

---------------------------------------------------------------------------------------------------------------

* For making available the perf counters, it could require the following commands

 

sudo -s
echo -1 >> /proc/sys/kernel/perf_event_paranoid

 

* Code for bench.cpp

 

// BENCH.CPP
#include <immintrin.h> 
#include <numeric>
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <fstream>
#include <chrono>
#include <random>

template<typename T>
void myGenerateUniform(T *__restrict__ myArray, long arrSize, const T& a, const T& b, const unsigned int& seed ) {
    std::mt19937 gen(seed);
    std::uniform_real_distribution<> dis(a, b);
    for(long i = 0; i < arrSize; i++) {
        myArray[i] = dis(gen);
    }
}

void sin_on_m512s(float *__restrict__ a, long n){
    __m512 vsin;
    for(long ind = 0; ind < n; ind+=16){
        vsin = _mm512_loadu_ps(a+ind);
        vsin = _mm512_sin_ps(vsin);
        _mm512_store_ps(a+ind, vsin);
    }
}

void sin_on_m512d(double *__restrict__ a, long n){
    __m512d vsin;
    for(long ind = 0; ind < n; ind+=8){
        vsin = _mm512_loadu_pd(a+ind);
        vsin = _mm512_sin_pd(vsin);
        _mm512_store_pd(a+ind, vsin);
    }
}

int main(int argc, char *argv[]){

    int noLine(1), Test(-1);
	long Nsize(0), nbRep(0);

    std::string filename = argv[1];
	std::string line;
	std::ifstream infile(filename);
    std::vector<std::string> codeTest ={"M512D_SIN_XXX","M512S_SIN_XXX"};
	if(!infile.is_open()) {
		std::cerr << "Error: File not opened."<<std::endl;
		return -1;
	}

    while (std::getline(infile, line)){
		std::stringstream stream(line);
		if(stream >> Nsize >> nbRep >> Test) {
			std::cout << "Do: "<< Test << std::endl;

            int seed = 10;
            std::cout.precision(12);
        
            if(Test == 0){
                double a = -2, b = 2;
                double res(0.0);
                double *arr = (double*)malloc(Nsize*sizeof(double));
                myGenerateUniform(arr, Nsize, a, b, seed); 
                auto t1 = std::chrono::high_resolution_clock::now();
                for(long rep = 0; rep < nbRep; rep ++){
                    sin_on_m512d(arr, Nsize);
                }
                auto t2 = std::chrono::high_resolution_clock::now();
                std::chrono::duration<double> elapsed_seconds{t2 - t1};	
                res = std::reduce(arr,arr + Nsize, 0.0)/Nsize;
                std::cout << codeTest[Test] << "Fl, " << Nsize <<", " << nbRep <<", "<< res << ", " << elapsed_seconds.count(); 
                std::cout << std::endl;
                free(arr);
            }
            if(Test == 1){
                float a = -2, b = 2;
                float res(0.0);
                float *arr = (float*)malloc(Nsize*sizeof(float));
                myGenerateUniform(arr, Nsize, a, b, seed); 
                auto t1 = std::chrono::high_resolution_clock::now();
                for(long rep = 0; rep < nbRep; rep ++){
                    sin_on_m512s(arr, Nsize);
                }
                auto t2 = std::chrono::high_resolution_clock::now();
                std::chrono::duration<double> elapsed_seconds{t2 - t1};	
                res = std::reduce(arr,arr + Nsize, 0.0)/Nsize;
                std::cout << codeTest[Test] << "Fl, " << Nsize <<", " << nbRep <<", "<< res << ", " << elapsed_seconds.count(); 
                std::cout << std::endl;
                free(arr);
            }
         }
            
        }
    return 0;
}

 

* bench.txt

 

160 1 1

 

Remark: I let the function for making the same observation in double precision. 

Changes needed:

- in bench.txt 

160 1 0

Run:

perf stat --event=fp_arith_inst_retired.512b_packed_double ./bench bench.txt

perf stat --event=fp_arith_inst_retired.256b_packed_double ./bench bench.txt

 

0 Kudos
3 Replies
Viet_H_Intel
Moderator
339 Views

 

 I have sudo issue. Can it be run without superuser?

$ echo -1 >> /proc/sys/kernel/perf_event_paranoid
bash: /proc/sys/kernel/perf_event_paranoid: Permission denied

 

$ perf stat --event=fp_arith_inst_retired.256b_packed_double ./bench bench.txt
event syntax error: 'fp_arith_inst_retired.256b_packed_double'
\___ Bad event name

Unable to find event on a PMU of 'fp_arith_inst_retired.256b_packed_double'
Run 'perf list' for a list of valid events

Usage: perf stat [<options>] [<command>]

$

$ perf stat --event=fp_arith_inst_retired.512b_packed_double ./bench bench.txt
event syntax error: 'fp_arith_inst_retired.512b_packed_double'
\___ Bad event name

Unable to find event on a PMU of 'fp_arith_inst_retired.512b_packed_double'
Run 'perf list' for a list of valid events

Usage: perf stat [<options>] [<command>]

$

$perf list >& t&& grep fp_arith_inst_retired t

$ perf -v
perf version 6.8.12

 

0 Kudos
arsene-marzorati
Beginner
302 Views

Hello, 

I would say that without superuser, it is not possible. If I run without the sudo command, I get:

I only managed to change perf_event_paranoid value with a sudo.

Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 3:
  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)

Nevertheless, my version of perf is older than yours:

$ perf -v
perf version 5.10.226
$ perf list >& t&& grep fp_arith_inst_retired t
  fp_arith_inst_retired.128b_packed_double          
  fp_arith_inst_retired.128b_packed_single          
  fp_arith_inst_retired.256b_packed_double          
  fp_arith_inst_retired.256b_packed_single          
  fp_arith_inst_retired.512b_packed_double          
  fp_arith_inst_retired.512b_packed_single          
  fp_arith_inst_retired.scalar_double               
  fp_arith_inst_retired.scalar_single 

 

0 Kudos
Viet_H_Intel
Moderator
265 Views

By looking at code generation for those 2 intrinsics (_mm512_sin_ps and _mm512_sin_pd). In -O0 case (Disables all optimizations), I see vmovups %ymm1, 64(%rsp); whereas, in -O1 case (Enables optimizations for speed ), I see vmovups %zmm0, (%rsp).

So, this is an expected behavior. 

0 Kudos
Reply