Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
872 Discussions

Critical Code Gen Bug: Silent Data Corruption in Vectorized Modulo Loops (-O2)

HavardGraff
Beginner
390 Views

Severity: High (Silent Data Corruption) Component: Loop Vectorizer / Code Generation Compiler: Intel(R) oneAPI DPC++/C++ Compiler (ICX) Flags: -O2 (Reproduces at -O2, -O3, and with -xCORE-AVX512/ -xCORE-AVX2 / -xAVX)

Summary

The ICX compiler generates logically incorrect code when vectorizing loops that initialize arrays using modulo operations with power-of-2 divisors (e.g., i % 4). This results in silent data corruption where specific elements in the sequence are written with the wrong values.

The issue persists even when the loop bound is a variable, indicating a fundamental flaw in the vectorizer's pattern generation logic, not just a constant-folding error.

Reproduction Code (Variable Size)

C
 
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>

// Bug triggers even with variable 'size' parameter
void init_arr(int16_t * a_buf, int size)
{
    for (int i = 0; i < size; i++) {
        // Pattern: 3, 2, 2, 2, 3, 2, 2, 2...
        if ((i % 4) == 0) {
            a_buf[i] = 3;
        } else {
            a_buf[i] = 2;
        }
    }
}

int main(void)
{
    // Test with size 17
    int size = 17;
    int16_t * a_buf = (int16_t *)_mm_malloc (size * sizeof (int16_t), 64);

    init_arr(a_buf, size);

    // Verification
    int failure_count = 0;
    for (int i = 0; i < size; i++)
    {
        int16_t expected = ((i % 4) == 0) ? 3 : 2;
        if (a_buf[i] != expected) {
            printf("Index %d: Expected %d, Got %d\n", i, expected, a_buf[i]);
            failure_count++;
        }
    }

    if (failure_count > 0) printf("Total Failures: %d\n", failure_count);
    
    _mm_free (a_buf);
    return (failure_count == 0) ? 0 : 1;
}

Observed Behavior

When compiled with -O2, the code fails to write the value 3 at indices 4, 8, 12, 16. It instead writes 2.

Disassembly Analysis (Proof of Logic Error)

The generated assembly for -O2 (SSE/AVX) shows that the compiler explicitly hardcodes the wrong values. For the tail case at index 16 (where size=17), the compiler emits a scalar store of 2 instead of 3.

Code snippet
 
# Disassembly of init_arr (Intel Syntax)
...
# Vector stores (filling the array with incorrect patterns)
movups %xmm0, (%rdi)
movups %xmm0, 0x10(%rdi)

# CRITICAL ERROR:
# At offset 0x20 (Index 16), the compiler hardcodes immediate value 2.
# Since 16 % 4 == 0, this instruction SHOULD be writing 3.
movw   $0x2, 0x20(%rdi)  <-- Logic Error
...

Workarounds

  1. Disable Vectorization: #pragma novector immediately before the loop.

  2. Volatile Divisor: Making the divisor (e.g., 4) a volatile variable breaks the pattern recognition optimization.

  3. Non-Power-of-2: Changing the modulo to % 3 or % 5 produces correct code.

0 Kudos
1 Reply
Viet_H_Intel
Moderator
293 Views

The issue is known, and it will be fixed in the next release. 

$ icx -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2025.3.2 Build 20260112

$icx -O2 vec-bug.c && ./a.out
Index 4: Expected 3, Got 2
Index 8: Expected 3, Got 2
Index 12: Expected 3, Got 2
Index 16: Expected 3, Got 2
Total Failures: 4

$ icx vec-bug.c -O2 &&./a.out
$ icx -V
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Upcoming Release.

0 Kudos
Reply