- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
I am trying to work with SVML functions for _m512. When I use the optimisation flag -O0 it seems that functions such as _mm512_sin_ps and _mm512_sin_pd are performed in __m256 and __m256d. Whereas, with a flag -01 or more everything seems right.
I would like to know if this behavior is expected or the counters I am using are not reliable.
For doing so, I used the code bench.cpp (see at the end of the message).
I'm running on Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201).
icpx -OX -march=native -std=gnu++20 -o bench.cpp.o -c bench.cpp
icpx -OX -march=native -rdynamic bench.cpp.o -o bench
With X = 0 or 1.
Then I run:
perf stat --event=fp_arith_inst_retired.512b_packed_single ./bench bench.txt
I get 0 for this counter with flag -O0 and 190 with flag -O1.
perf stat --event=fp_arith_inst_retired.256b_packed_single ./bench bench.txt
I get 380 for this counter with flag -O0 and 0 with flag -O1 .
Thank you in advance for your answer.
---------------------------------------------------------------------------------------------------------------
* For making available the perf counters, it could require the following commands
sudo -s
echo -1 >> /proc/sys/kernel/perf_event_paranoid
* Code for bench.cpp
// BENCH.CPP
#include <immintrin.h>
#include <numeric>
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <fstream>
#include <chrono>
#include <random>
template<typename T>
void myGenerateUniform(T *__restrict__ myArray, long arrSize, const T& a, const T& b, const unsigned int& seed ) {
std::mt19937 gen(seed);
std::uniform_real_distribution<> dis(a, b);
for(long i = 0; i < arrSize; i++) {
myArray[i] = dis(gen);
}
}
void sin_on_m512s(float *__restrict__ a, long n){
__m512 vsin;
for(long ind = 0; ind < n; ind+=16){
vsin = _mm512_loadu_ps(a+ind);
vsin = _mm512_sin_ps(vsin);
_mm512_store_ps(a+ind, vsin);
}
}
void sin_on_m512d(double *__restrict__ a, long n){
__m512d vsin;
for(long ind = 0; ind < n; ind+=8){
vsin = _mm512_loadu_pd(a+ind);
vsin = _mm512_sin_pd(vsin);
_mm512_store_pd(a+ind, vsin);
}
}
int main(int argc, char *argv[]){
int noLine(1), Test(-1);
long Nsize(0), nbRep(0);
std::string filename = argv[1];
std::string line;
std::ifstream infile(filename);
std::vector<std::string> codeTest ={"M512D_SIN_XXX","M512S_SIN_XXX"};
if(!infile.is_open()) {
std::cerr << "Error: File not opened."<<std::endl;
return -1;
}
while (std::getline(infile, line)){
std::stringstream stream(line);
if(stream >> Nsize >> nbRep >> Test) {
std::cout << "Do: "<< Test << std::endl;
int seed = 10;
std::cout.precision(12);
if(Test == 0){
double a = -2, b = 2;
double res(0.0);
double *arr = (double*)malloc(Nsize*sizeof(double));
myGenerateUniform(arr, Nsize, a, b, seed);
auto t1 = std::chrono::high_resolution_clock::now();
for(long rep = 0; rep < nbRep; rep ++){
sin_on_m512d(arr, Nsize);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds{t2 - t1};
res = std::reduce(arr,arr + Nsize, 0.0)/Nsize;
std::cout << codeTest[Test] << "Fl, " << Nsize <<", " << nbRep <<", "<< res << ", " << elapsed_seconds.count();
std::cout << std::endl;
free(arr);
}
if(Test == 1){
float a = -2, b = 2;
float res(0.0);
float *arr = (float*)malloc(Nsize*sizeof(float));
myGenerateUniform(arr, Nsize, a, b, seed);
auto t1 = std::chrono::high_resolution_clock::now();
for(long rep = 0; rep < nbRep; rep ++){
sin_on_m512s(arr, Nsize);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds{t2 - t1};
res = std::reduce(arr,arr + Nsize, 0.0)/Nsize;
std::cout << codeTest[Test] << "Fl, " << Nsize <<", " << nbRep <<", "<< res << ", " << elapsed_seconds.count();
std::cout << std::endl;
free(arr);
}
}
}
return 0;
}
* bench.txt
160 1 1
Remark: I let the function for making the same observation in double precision.
Changes needed:
- in bench.txt
160 1 0
Run:
perf stat --event=fp_arith_inst_retired.512b_packed_double ./bench bench.txt
perf stat --event=fp_arith_inst_retired.256b_packed_double ./bench bench.txt
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have sudo issue. Can it be run without superuser?
$ echo -1 >> /proc/sys/kernel/perf_event_paranoid
bash: /proc/sys/kernel/perf_event_paranoid: Permission denied
$ perf stat --event=fp_arith_inst_retired.256b_packed_double ./bench bench.txt
event syntax error: 'fp_arith_inst_retired.256b_packed_double'
\___ Bad event name
Unable to find event on a PMU of 'fp_arith_inst_retired.256b_packed_double'
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
$
$ perf stat --event=fp_arith_inst_retired.512b_packed_double ./bench bench.txt
event syntax error: 'fp_arith_inst_retired.512b_packed_double'
\___ Bad event name
Unable to find event on a PMU of 'fp_arith_inst_retired.512b_packed_double'
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
$
$perf list >& t&& grep fp_arith_inst_retired t
$
$ perf -v
perf version 6.8.12
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I would say that without superuser, it is not possible. If I run without the sudo command, I get:
I only managed to change perf_event_paranoid value with a sudo.
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 3:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
Nevertheless, my version of perf is older than yours:
$ perf -v
perf version 5.10.226
$ perf list >& t&& grep fp_arith_inst_retired t
fp_arith_inst_retired.128b_packed_double
fp_arith_inst_retired.128b_packed_single
fp_arith_inst_retired.256b_packed_double
fp_arith_inst_retired.256b_packed_single
fp_arith_inst_retired.512b_packed_double
fp_arith_inst_retired.512b_packed_single
fp_arith_inst_retired.scalar_double
fp_arith_inst_retired.scalar_single
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By looking at code generation for those 2 intrinsics (_mm512_sin_ps and _mm512_sin_pd). In -O0 case (Disables all optimizations), I see vmovups %ymm1, 64(%rsp); whereas, in -O1 case (Enables optimizations for speed ), I see vmovups %zmm0, (%rsp).
So, this is an expected behavior.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page