- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have an i7 965 cpu running on win7 and using tbb 3.0.

The attached program, is designed to gather into a (clumn format) matrix elements that live in different buffers

(in this case a bigger matrix).

I am trying to see how many processing units (out of 8?) are engaged in the calculation. From the task mgr

I see at most 4.

[cpp]//stl #includeIs this a reliable test? Is there a better one than eyeballing it ;)) ?using std::vector; //tbb //tbb #include "tbb\parallel_for.h" #include "tbb\parallel_reduce.h" #include "tbb\blocked_range.h" #include "tbb\blocked_range2d.h" // function to gather data into a single matrix as columns template< typename _M, typename _T> void gather_data_into_matrix_columns( _M & m, vector<_T*> const & dataStart, vector const & dataStride ){ struct column_gatherer{ _M & m_; vector<_T*> const & dataStart_; vector const & dataStride_; column_gatherer( _M & m, vector<_T*> const & dataStart, vector const & dataStride ) :m_(m), dataStart_(dataStart), dataStride_(dataStride) {} void operator()( const tbb::blocked_range2d< size_t > & r ) const { size_t rowStart = r.rows().begin(); size_t rowEnd = r.rows().end(); size_t colStart = r.cols().begin(); size_t colEnd = r.cols().end(); for ( size_t colIdx = colStart; colIdx != colEnd; ++colIdx ){ size_t stride = dataStride_[colIdx]; _T const * ptr = dataStart_[colIdx] + rowStart * stride; for ( size_t rowIdx = rowStart; rowIdx != rowEnd; ++rowIdx ){ m_( rowIdx, colIdx ) = *ptr; ptr += stride; } } } }; static const size_t GRAIN_SIZE_ROW = 10; static const size_t GRAIN_SIZE_COLUMN = 10; column_gatherer cg( m, dataStart, dataStride ); tbb::parallel_for( tbb::blocked_range2d ( 0, m.size1(), GRAIN_SIZE_ROW, 0, m.size2(), GRAIN_SIZE_COLUMN ), cg, tbb::auto_partitioner() ); } #include using std::cout; using std::endl; #include using boost::numeric::ublas::matrix; bool test_GatherScatter(){ const size_t nRows = 80; const size_t nCols = 20; matrix< double, boost::numeric::ublas::column_major> m(nRows,nCols); const size_t nRowsBig = 800; const size_t nColsBig = 100; matrix< double, boost::numeric::ublas::column_major> mBig(nRowsBig,nColsBig); for ( size_t i = 0; i != nRowsBig; ++i ){ for ( size_t j = 0; j != nColsBig; ++j ){ mBig(i,j) = double(rand())/double(RAND_MAX) - 0.5L; } } vector< double *> colStart( nCols ); vector< size_t > strides( nCols ); for ( size_t i = 0; i != nCols; ++i ) { colStart = &(mBig(0,3*i)); strides= 5; } gather_data_into_matrix_columns( m, colStart, strides ); for( size_t i = 0; i != nRows; ++i ){ for( size_t j = 0; j != nCols; ++j ){ if ( m(i,j) != mBig( 5*i, 3*j) ){ cout << "(i, j) = (" << i << ", " << j << ")" << endl; return false; } } } return true; } int main(){ bool b; for ( size_t i = 0; i != 1000; ++i ) b = test_GatherScatter(); } [/cpp]

The code is a concatenation of various files. Conceptually gives the right result. Please, let me know if there is something more you might need.

Thank you very much for your help,

Petros

ps: the motivation for this is to bring into one place (typically a matrix) data that will constitute the rhs of lapack linear equation solver.

Link Copied

22 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for the quick reply. The external loop was only there to fascilitate my inspecting on what shows in the task manager's performance tab.

I followed your advice and added bogus code inside the operator call of the nested class (repeated the assignment a good 1,000 times) and saw all 8 (logical) processors, fire up. The coarsening of the grain size had only partial results. Increasing the matrices past a certain limit would probably be unrealistic for the purposes of a sngle memory buffer living within the boost matrix.

Does this mean that this strategy cannot be improved for this size of problems? (typically nRows = 40 and nCols

~=1000 )

Fnally, coming back to my initial question, is there some utility within tbb to inspect such statistics? - if not, it might be a useful addition for tunn9ing purposes.

Again, thank you very much for your help,

Petros

ps: did not understand why was redirected. Was I in the wrong place?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Scaling code apparently remains challenging even with a task-based framework... Unless you have an opportunity to parallelise at a higher level, this may be the best you can do, and if it's important to do better you may have to try to be creative.

Intel offers developer products that incorporate both TBB and diagnostic tools, but the latter are not part of the former.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I spent a good chunk of time today going over these ideas.

I do have Intel's Parallel Studio and tried a few experiments with the Parallel Amplifier (PA) as well, but I am afraid all this was inconclusive.

In order to get something meaningful through the PA, I had to break up the code into smaller pieces (so, for example the filling of the big source matrix does not mask the performance - also tried an mkl mt random number generator.

Interestingly, PA even showed the vml/vsl functions as poorly threaded - I guess to make me feel a bit better about all this ;)).

Now, there is the element that I might be doing something less than clever w/the PA.

It is noteworthy that, if I do a loop of 100 reps on the gather function, the total tick_count is approximately twice (!!) the time for a single call. Now, I do realize that this is wall clock time, but is there any chance that most of the time is spent in setting up the thread pool rather than running it? If this is the case, is there a way to avoid this toll?

Finally, I ran this for very large grain sizes - assuming that this renders it single-threaded - correct?. It turns, out that my total gain is approx. 50%, which with 8 processors and nothing else happenning on my pc is rather suboptimal ;-)).

Any ideas you might have will be greatly appreciated.

Thank you very much for all your help,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I was going for some kind of generalization of the pack/unpack utilities of the vml library (part of mkl).

Hence the effort.

The reason for this is that, in PDEs you very often need to solve systems of equations at each time step.

In the class of methods I am considering, in the case of multidimensional problems, one further has to solve systems at every dimension (as many as the product of sizes of the rest of the dimensions). Hence this snippet that calculates in a flash as you say, is to be used thousand of times. Ideally the lapack routines would allow for this, allowing for some stride info of the vector arguments. Alas, LAPACK does notprovide for this. So copying among different structures becomes indispensable (sometimes there are other organizations of the problem that bypass this, but not always).

This, as a note that I did not mean to waste your time on an inane issue ;-)).

Thank you very much for all your help,

Have a great Weekend,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"Hence this snippet that calculates in a flash as you say, is to be used thousand of times."*

I have found it is often more efficient to speculate than to have many rounds of Q&A before a fearful first attempt at a suggestion, but in light of this additional background information I admit it does seem rather difficult to parallelise successive time steps... (Note that I really meant "challenges" instead of just silly "distractions".)

*"Ideally the lapack routines would allow for this, allowing for some stride info of the vector arguments. Alas, LAPACK does notprovide for this."*

Hmm, with LAPACK supposedly designed for more efficient memory access patterns and relying on BLAS, so that layout may be linked to performance, I wouldn't dare comment on that myself.

I don't really understand what you're doing, mixing LAPACK and boost uBLAS (I would suppose LAPACK uses its own BLAS?), but here's my ultimate attempt, in the form of a question anyway (consider it rhetorical): if you are doing a gathering operation to use only part of the big matrix, wouldn't other parts be gathered and processed too, and wouldn't that be the level at which to parallelise instead of the individual gathering operations? To avoid parallel overhead, it's typically best to aim as high as possible.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your resonse. A few clarifying points:

You are right on LAPACK basing on BLAS and therefore the memory layout to be dicated thereoff- I, too. am not in a position to comment on other possible implementations, especially if all that is parallelized is the BLAS substrate and not the drivers themselves, as it now seems to me to be the case.

Hence the "ideally" in my comment. LAPACK was created with different hardware constraints and, therefore,

objectives than today's available options.

I do not mix uBLAS and LAPACK/BLAS. I only use the matrix structures from uBLAS- I consider the effort of rewritting BLAS in C++ largely misdirected and futile in the presence of vendor and freely provided mt implementations.

You are absolutely right that the result of this discussion seems to be that the organization of the parallelizqation has to happen at a higher level. However, this calls for a much more complicated design - especially if one wants to use many PDE problems based on the same PDE at once the difference among tem being the initial and boundary conditions).

Naturally, one tries the easier slution first. I was hoping there was one to my "problem", since admittedly, it seems like a rather simple one, as you also noted in a previouscomment. Alas..

Again thank you very much for all your help and your input.

Best Regards,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"LAPACK was created with different hardware constraints and, therefore, objectives than today's available options."*

Could you briefly elaborate on that?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Many BLAS functions can be built with C or C++ after automated translation of the open Fortran source, but you can just as well compile them in Fortran and link together with C++. The Fortran source, of course, is far too old to come with ISO C binding, which you would need to add to get portable "extern C" linkage.

As you can see, I'm stabbing in the dark as to whether I can relate to the current question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

What I meant is for example choices of in-place substitution of results which was largely dictated by the fact that memory was way more "expensive" , forcing the user to copy around instead of using pre-allocated buffers.

Also, having been written in FORTRAN it cannot really provide for strided-based types (yes it could, but at the expense of a lot of code/argument -ists overhead. And yes, I understand that there are issues with efficiency, but this was not, at the time at least, the underlying concern).

As far as I am concerned, one of LAPACK's main clients are PDE solvers and with the possibility of many more cores these days, problems that would be very slow in shared memory configurations (like my pc, these days and hopefully next year) are now tractable.

So, for example, when I try to solve a 4-d PDE using ADI (maybe even higher when the knights show up) have the unknown function as a boost nulti-array and when solving for each dimension at every time step I do need to make unnecessay copies to solve the linear systems. When I was able to do efficiently 2 dimensions the overhead was less of a fraction than with more dimensions. As a coder it is just a constant headache ;-))

There is no intention here of knocking down LAPAC, of course ;-)).

For all design choices there are pros and cons. The value of LAPACK lies, as far as I am concerned, with the robustness of the mathematical algorithms rather than its design (please note I only program in C++ and am FORTRAN illiterate - and wish to stay this way ;-)) ).

Nice talking to you, Thank you again,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As far as threading goes, I thought GOTO BLAS does address it and it is free (no?).

Thank you for bringing up the constraint of using MKL and TBB at the same time. I was not clear on this and this is a really useful piece of information (although somewhat dissappointing..),

It so happens that in my code, I do not mix the two (i.e. my TBB calls do not include MKL ones, by choice, and , of course, the opposite is simply not possible) -had in mind to look into it in the future, though.

My response earlier had to do with af's observation that I was potentially mixing uBLAS and MKL for BLAS calls,

which I don't - have my own tbb-based version of vector expressions for pure BLAS-like (i.e. not LAPACK-feeding) calls - luckily I do not need to multiply matrices that much.

Thank you for the intervention, since I did learn something trully valuable.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"What I meant is for example choices of in-place substitution of results which was largely dictated by the fact that memory was way more "expensive" , forcing the user to copy around instead of using pre-allocated buffers."*

On http://www.netlib.org/lapack/faq.html I read: "The original goal of the LAPACK project was to make the widely used EISPACK and LINPACK libraries run efficiently on shared-memory vector and parallel processors. On these machines, LINPACK and EISPACK are inefficient because their memory access patterns disregard the multi-layered memory hierarchies of the machines, thereby spending too much time moving data instead of doing useful floating-point operations." Would this be consistent with a design constrained by expensive memory? How expensive was memory in 1992, LAPACK's first public release? Also, it has been demonstrated elsewhere that it is sometimes more efficient to copy data, operate on it, and copy back the result, as mentioned here, e.g., by Arch Robison (original architect of TBB), if I remember well.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

~=1000 )

This is likely too small of a data set to divide up 8-ways (number of threads) and amortize the task/thread start/stop overhead. Can you re-think your problem such that you can have 8 of these 40x1000 (sub-)arrays worked on at once?

Alternative suggestion. Once you rearrange the data I imagine you intend to do some work with it. If so then see if you can code such that some threads rearrange more pieces than "some threads" and as each piece is rearranged perform (enqueue) the additional work using the remaining threads. IOW pipeline the operation (not necessarily using parallel_pipeline).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your response.

Let me state from the start that I am not qualified to talk on "multi-layered memory architectures" - this is getting to low in what actually happens for me to have an educated opinion.

Naively speking and from my experience, LAPACK calls are only a part of bigger operations, i.e. solving a PDE and doing something with the results at every (time)step of the solution. The allocation of data in other big data structures is then no longer optional. Benchmarking just the LAPACK call then is probably not enough.

Again, I do not know enough to have an educated opinion.

The same document, for example says abit layer how LAPACK can solve very efficiently, as a result, tridiagonal systems with many rhs vectors. In practice though, unless solving for Dirichlet type problems where the one-dimensional operators are homogeneous wrt therest of the dimensions i.e there is no mixing(a situation very common in ADI type methods), I have never encountered a useful application for this ability/strength. It is very possible that I don't deal with the kind of problems for which this would be useful, however, andtherefore have a limited view.

The Robinson result sounds interesting and would like to know more about its rational (is by any chance a pointer to the paper easy to produce?)

Finally, memory in 1992 was much more expensive than today. At the time, at least on PCs, it was not even shipped in GBs ;-)) - yes I realize that LAPACK was most likely not designed for usage with the PCs ;-)).

At this point, de facto will have tocopy and maybe I am luckyand thisis the best choice.\

Thank you for your comments,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am thinking now along these lines.

I have come to realize that, typically in a 4 dimensional equation with 50 points in each dimension,

I would have 50 * 50^3 points and then,I think I would be fine.

For a 2-dimensional problem would be suboptimal, but maybe who cares.

Your suggestion is definitely the right way to design this but would prefer to avoid the coding -and maintenance- overhead of it.

The inquiry sprang from a toy example I ran and was "struck" from the fact that I could not "take advantage" of all the computing power I had. I guess I was much more naive before this discussion.

Thank you very much,

Petros

ps: will also look into these blog and site for eduactional reasons ;-))

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*"The Robinson result sounds interesting and would like to know more about its rational (is by any chance a pointer to the paper easy to produce?)"*

I thought I remembered Arch Robison (sic) mentioning something like that on this forum, but I only found this, which to me seems to be a lot more because of applicable sort algorithms than because of memory access reasons. Perhaps James has something to say about appropriate strategies on NUMA systems?

*"Finally, memory in 1992 was much more expensive than today."*

I do recall choosing a computer configured with 8 MB rather than the maximum around that time (later I splurged for an upgrade to 20 MB), but it was still a serial system, so I don't know what the issues were on the hardware supposedly considered for LAPACK, and how big a typical problem really is compared to the available memory. But here is where I should probably leave the discussion to any others with the relevant knowledge and experience.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

thank you very much for the suggestion - was not aware of parallel_invoke (probably have an older version of the book). Will look into it.

Thank you again,

Petros

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Apologies for misspelling Robison's name..(from him too, actually)

Thank you for your help,

Petros

Thank you for your help,

Petros

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page