This would be true except that the user has a struct of size 128 bytes (32 ints)
and the user has an array of these structs
The performance varies greatly dependent upon the number of these structs in his array of structs.
(not the number of ints within each struct)
The relative cache line alignment is the same regardless of the size of array of structs.
This is not to say you are wrong about the prefetch as the two tasks may be in lock-step (working in same struct) or may be skewed (working in different structs), and the numbers of these structs alter the lock-step/skew situation. Running VTune or other profiler that detects cache line evictsion would confirm or disclaim the hypothesis.
I would like to see his reply to running with the array of structs bounded on 4096 byte boundry. (i.e. to potentially reduce the number of TLB's required to map the array of structs).