- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#include <stdio.h> #include <stdlib.h> #include <string.h> int matrix_rot(uint8_t* src, uint8_t* des, int w, int h) { for (int y = 0; y < h; ++y) { const int stride = y * w + h; for (int x = 0; x < w; ++x) { des[x*h+y] = src[stride-x]; } } return 0; } int matrix_dump(uint8_t* buf, int w, int h) { for (int y = 0; y < h; y++) { for (int x = 0; x < w; x++) { printf("%d ", buf[y*w+x]); } printf("\n"); } return 0; } int main(int argc, char** argv) { if (argc != 3) return -1; int w = strtol(argv[1], NULL, 10); int h = strtol(argv[2], NULL, 10); uint8_t *bmp = malloc(w*h); uint8_t *dst = malloc(w*h); srand(12); int c = rand(); printf("%d\n", c); for (int i = 0; i < c; i++) { matrix_rot(bmp, dst, w, h); matrix_rot(bmp, dst, w, h); } matrix_dump(bmp, 4, 3); matrix_dump(dst, 3, 4); free(bmp); free(dst); return 0; }
ICC 15.0 generates slower AVX/SSE4.2 codes than SSSE3 codes.
The SSSE3 compilation command line and result:
$ icc -std=c99 -xSSSE3 -O3 -o rot rot.c $ time ./rot 320 240 201684 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ./rot 320 240 10.06s user 0.01s system 99% cpu 10.077 total
The AVX compilation command line and result:
$ icc -std=c99 -xSSE4.2 -O3 -o rot rot.c $ time ./rot 320 240 201684 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ./rot 320 240 16.64s user 0.00s system 99% cpu 16.650 total
As you can see, AVX version runs 60% slower than the SSSE3 version.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You'd probably want to compare opt-report4 or vec-report. I didn't think AVX offered any new facilities for these data types; if AVX option removes optimizations, it seems a deficiency. If there is any new AVX optimization, you would want to try the tests with 32-byte data alignment.
it may not be surprising if AVX doesn't offer an advantage on strided stores, or if the AVX option brings vectorization which wasn't there before, that it isn't advantageous. Does one of the options result in loop interchange?
I was able to compile after adding #include stdint.h
Apparently, the vectorization strategy changes from "outer loop vectorized" to one with "heavy-overhead vector operations."
So, addition of #pragma omp simd on the outer loop (so as to return in effect to outer loop vectorization) shows a slight advantage for AVX over SSSE3; a larger advantage for AVX2 (on a second warm cache execution) with aligned_malloc.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page