Whoa! There's so much more to consider here even before you start splitting this task between threads. Consider the innermost loop:
[cpp]for(k=0; k * arr2;
}[/cpp]
C arrays are in row-major order, so each successive multiplicand is being pulled from the next row of arr2. Even with floats, the row length for the suggested 2000x2000 array size is 2000 x 4 = 8000 bytes. Since a TLB (Translation Lookaside Buffer) only covers 4K bytes (assuming typical small page use), that means each successive computation in this loop will require a fresh TLB entry (even the arr1 accesses will typically take two TLB entries, sometimes three). Even on an Intel Core i7 processor with a 64-entry 4-way set associative DTLB, after a maximum of 64 x 4 = 256 rows all the TLBs will be used and the processor will need to start discarding the old ones (actually less than 256 since TLB entries are needed for the other arrays). TLB page walks are not cheap.
How to fix this? One method is
blocking, dividing the array up into blocks small enough to be able to reuse TLB entries to maximize both read and write performance. Blocking is usually done by adding more nested loops. agalex hints at this solution with the introduction of Cuda terminology in a subsequent reply.
Jim's suggested transposition is another way to approach this problem, the transposition changing the arr2 accesses so that both arrays are read along their natural memory order, but sadly this solution only translates the problem to the writes of arr
, though it does reduce the frequency of the problemsince it removes the need for a fresh TLB entry from every execution of the innermost loop.
But, adding threading poses even a greater problem with this approach. Since the threading in the original code is across the index i, multiple threads will be accessing adjacent entries of array arr and likely causing lots of thrashing between the threads as they compete for who owns the valid version of the cache line.