[cpp] tbb::task_scheduler_init init;
As Raf remarked before, operating over columns incurs significant cache overhead. Each cache line is typically 64 bytes. If an element of the array is only 4 bytes, the processor still brings in the entire 64 bytes, so your paying a memory bandwith tax of 16x. Even the serial code might benefit significantly if the array could be transposed so that physical operations are over rows.
http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_4.pdf has a nice introduction to basic cache issues. It might be worth experimenting with changing the code to use a transposed form of the matrix. Even if you can't use that in the production version, timing the transposed form would indicate whether cache issues are the bottleneck.
Over the last two decades, processor speeds have increased much faster than memory bandwidth. A cache miss is on the order of a hundred cycles. So designers of programming interfaces really do need to consider this when choosing data layouts.
"The transposition experiment should be done but the payback will come only if the transposed data is referenced a sufficient number of times to recover the cost of rewriting this data (effectively you are performing a read one stream (cache line packed data) - write 20 streams then 20x read stream and operate(). Without additional information regarding what will be done it would be premature to speculate as to the effectiveness or the transposition."
I think the assumption is an architectural transposition, with data immediately written that way, maybe in a number of blocks.