I'm testing out TBB right now. One of the algorithms I want to parallelize works on a row-major 2-D matrix by processing a column at a time.
How can I code this in TBB such that my algorithm can work on a column at a time?
Assuming that: matrix is M (row) x N (column), and given pointer float* that points to the (0,0) of the matrix. Assume further that the matrix is contiguous.
Originally, I hand-coded this such that for each element in the column, I stride by i*N to get to the next row.
Taking this a step further: how can we parallelize a column strip (i.e. if an algorithm works on multiple columns at a time)?
What are the typical array dimensions?
The easiest way to operate on columns at the same time is to use a parallel_for.
False sharingmay be a problem, particularly if the operations writes to columns of the matrix. One cache line on Core 2 Duo is 64 bytes. That's 16 floats. If you have thousands of columns, it might pay to process 16 columns at a time. Doing so would require writing range object similar to blocked_range, but always stays on multiple-of-16 boundaries. (If we ever add a fourth template parameter to blocked_range, it should be for this "multiple of" feature.) Or use the normal blocked_range, but let each index in it represent a group of 16 columns. That complicates the indexing calculations slightly but should not be unwieldly.
If there are too few columns for that scheme to work, you might actually be better off (at least with regard to cacheline traffic) copying each column to a temporary vector, operate on the temporary, and then copy back. Likewise copying might pay off if there are so many rows that a group of 16 columns will not fit in outer-level cache.
The parallel_for template tends to distribute indices relatively far apart to different threads, so if there are many more columns than worker threads, the false sharing issue might not be a big issue. I suggest trying the obvious straightforward indexing first.