If I were to make an experienced guess, I would venture to guess that the movement of the code (f1 and f2) affected the alignment for the instruction cache (of one or both loops). This can be confirmed by looking at the dissassembly of the two loops under the two different values for n. Produce the address reports from the very same code that exhibits the difference in performance.
Also, you might consider an alignment of 4096 (typical VM page size), this will (may) reduce the number TLBs required for data from 3 to 2.
This would be true except that the user has a struct of size 128 bytes (32 ints)
and the user has an array of these structs
The performance varies greatly dependent upon the number of these structs in his array of structs.
(not the number of ints within each struct)
The relative cache line alignment is the same regardless of the size of array of structs.
This is not to say you are wrong about the prefetch as the two tasks may be in lock-step (working in same struct) or may be skewed (working in different structs), and the numbers of these structs alter the lock-step/skew situation. Running VTune or other profiler that detects cache line evictsion would confirm or disclaim the hypothesis.
I would like to see his reply to running with the array of structs bounded on 4096 byte boundry. (i.e. to potentially reduce the number of TLB's required to map the array of structs).