Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Threading Building Blocks & Intel® Threading Building Blocks
- cpu performance w/a simple gather utility

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-16-2011
07:10 PM

32 Views

cpu performance w/a simple gather utility

I have an i7 965 cpu running on win7 and using tbb 3.0.

The attached program, is designed to gather into a (clumn format) matrix elements that live in different buffers

(in this case a bigger matrix).

I am trying to see how many processing units (out of 8?) are engaged in the calculation. From the task mgr

I see at most 4.

[cpp]//stl #includeIs this a reliable test? Is there a better one than eyeballing it ;)) ?using std::vector; //tbb //tbb #include "tbb\parallel_for.h" #include "tbb\parallel_reduce.h" #include "tbb\blocked_range.h" #include "tbb\blocked_range2d.h" // function to gather data into a single matrix as columns template< typename _M, typename _T> void gather_data_into_matrix_columns( _M & m, vector<_T*> const & dataStart, vector const & dataStride ){ struct column_gatherer{ _M & m_; vector<_T*> const & dataStart_; vector const & dataStride_; column_gatherer( _M & m, vector<_T*> const & dataStart, vector const & dataStride ) :m_(m), dataStart_(dataStart), dataStride_(dataStride) {} void operator()( const tbb::blocked_range2d< size_t > & r ) const { size_t rowStart = r.rows().begin(); size_t rowEnd = r.rows().end(); size_t colStart = r.cols().begin(); size_t colEnd = r.cols().end(); for ( size_t colIdx = colStart; colIdx != colEnd; ++colIdx ){ size_t stride = dataStride_[colIdx]; _T const * ptr = dataStart_[colIdx] + rowStart * stride; for ( size_t rowIdx = rowStart; rowIdx != rowEnd; ++rowIdx ){ m_( rowIdx, colIdx ) = *ptr; ptr += stride; } } } }; static const size_t GRAIN_SIZE_ROW = 10; static const size_t GRAIN_SIZE_COLUMN = 10; column_gatherer cg( m, dataStart, dataStride ); tbb::parallel_for( tbb::blocked_range2d ( 0, m.size1(), GRAIN_SIZE_ROW, 0, m.size2(), GRAIN_SIZE_COLUMN ), cg, tbb::auto_partitioner() ); } #include using std::cout; using std::endl; #include using boost::numeric::ublas::matrix; bool test_GatherScatter(){ const size_t nRows = 80; const size_t nCols = 20; matrix< double, boost::numeric::ublas::column_major> m(nRows,nCols); const size_t nRowsBig = 800; const size_t nColsBig = 100; matrix< double, boost::numeric::ublas::column_major> mBig(nRowsBig,nColsBig); for ( size_t i = 0; i != nRowsBig; ++i ){ for ( size_t j = 0; j != nColsBig; ++j ){ mBig(i,j) = double(rand())/double(RAND_MAX) - 0.5L; } } vector< double *> colStart( nCols ); vector< size_t > strides( nCols ); for ( size_t i = 0; i != nCols; ++i ) { colStart = &(mBig(0,3*i)); strides= 5; } gather_data_into_matrix_columns( m, colStart, strides ); for( size_t i = 0; i != nRows; ++i ){ for( size_t j = 0; j != nCols; ++j ){ if ( m(i,j) != mBig( 5*i, 3*j) ){ cout << "(i, j) = (" << i << ", " << j << ")" << endl; return false; } } } return true; } int main(){ bool b; for ( size_t i = 0; i != 1000; ++i ) b = test_GatherScatter(); } [/cpp]

The code is a concatenation of various files. Conceptually gives the right result. Please, let me know if there is something more you might need.

Thank you very much for your help,

Petros

ps: the motivation for this is to bring into one place (typically a matrix) data that will constitute the rhs of lapack linear equation solver.

Link Copied

22 Replies

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-16-2011
11:14 PM

31 Views

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2011
04:24 AM

31 Views

Thank you for the quick reply. The external loop was only there to fascilitate my inspecting on what shows in the task manager's performance tab.

I followed your advice and added bogus code inside the operator call of the nested class (repeated the assignment a good 1,000 times) and saw all 8 (logical) processors, fire up. The coarsening of the grain size had only partial results. Increasing the matrices past a certain limit would probably be unrealistic for the purposes of a sngle memory buffer living within the boost matrix.

Does this mean that this strategy cannot be improved for this size of problems? (typically nRows = 40 and nCols

~=1000 )

Fnally, coming back to my initial question, is there some utility within tbb to inspect such statistics? - if not, it might be a useful addition for tunn9ing purposes.

Again, thank you very much for your help,

Petros

ps: did not understand why was redirected. Was I in the wrong place?

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2011
05:36 AM

31 Views

Scaling code apparently remains challenging even with a task-based framework... Unless you have an opportunity to parallelise at a higher level, this may be the best you can do, and if it's important to do better you may have to try to be creative.

Intel offers developer products that incorporate both TBB and diagnostic tools, but the latter are not part of the former.

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2011
10:44 AM

31 Views

I spent a good chunk of time today going over these ideas.

I do have Intel's Parallel Studio and tried a few experiments with the Parallel Amplifier (PA) as well, but I am afraid all this was inconclusive.

In order to get something meaningful through the PA, I had to break up the code into smaller pieces (so, for example the filling of the big source matrix does not mask the performance - also tried an mkl mt random number generator.

Interestingly, PA even showed the vml/vsl functions as poorly threaded - I guess to make me feel a bit better about all this ;)).

Now, there is the element that I might be doing something less than clever w/the PA.

It is noteworthy that, if I do a loop of 100 reps on the gather function, the total tick_count is approximately twice (!!) the time for a single call. Now, I do realize that this is wall clock time, but is there any chance that most of the time is spent in setting up the thread pool rather than running it? If this is the case, is there a way to avoid this toll?

Finally, I ran this for very large grain sizes - assuming that this renders it single-threaded - correct?. It turns, out that my total gain is approx. 50%, which with 8 processors and nothing else happenning on my pc is rather suboptimal ;-)).

Any ideas you might have will be greatly appreciated.

Thank you very much for all your help,

Petros

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-17-2011
12:56 PM

31 Views

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-20-2011
11:27 AM

31 Views

I was going for some kind of generalization of the pack/unpack utilities of the vml library (part of mkl).

Hence the effort.

The reason for this is that, in PDEs you very often need to solve systems of equations at each time step.

In the class of methods I am considering, in the case of multidimensional problems, one further has to solve systems at every dimension (as many as the product of sizes of the rest of the dimensions). Hence this snippet that calculates in a flash as you say, is to be used thousand of times. Ideally the lapack routines would allow for this, allowing for some stride info of the vector arguments. Alas, LAPACK does notprovide for this. So copying among different structures becomes indispensable (sometimes there are other organizations of the problem that bypass this, but not always).

This, as a note that I did not mean to waste your time on an inane issue ;-)).

Thank you very much for all your help,

Have a great Weekend,

Petros

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-20-2011
08:32 PM

31 Views

I have found it is often more efficient to speculate than to have many rounds of Q&A before a fearful first attempt at a suggestion, but in light of this additional background information I admit it does seem rather difficult to parallelise successive time steps... (Note that I really meant "challenges" instead of just silly "distractions".)

I don't really understand what you're doing, mixing LAPACK and boost uBLAS (I would suppose LAPACK uses its own BLAS?), but here's my ultimate attempt, in the form of a question anyway (consider it rhetorical): if you are doing a gathering operation to use only part of the big matrix, wouldn't other parts be gathered and processed too, and wouldn't that be the level at which to parallelise instead of the individual gathering operations? To avoid parallel overhead, it's typically best to aim as high as possible.

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2011
09:04 AM

31 Views

Thank you for your resonse. A few clarifying points:

You are right on LAPACK basing on BLAS and therefore the memory layout to be dicated thereoff- I, too. am not in a position to comment on other possible implementations, especially if all that is parallelized is the BLAS substrate and not the drivers themselves, as it now seems to me to be the case.

Hence the "ideally" in my comment. LAPACK was created with different hardware constraints and, therefore,

objectives than today's available options.

I do not mix uBLAS and LAPACK/BLAS. I only use the matrix structures from uBLAS- I consider the effort of rewritting BLAS in C++ largely misdirected and futile in the presence of vendor and freely provided mt implementations.

You are absolutely right that the result of this discussion seems to be that the organization of the parallelizqation has to happen at a higher level. However, this calls for a much more complicated design - especially if one wants to use many PDE problems based on the same PDE at once the difference among tem being the initial and boundary conditions).

Naturally, one tries the easier slution first. I was hoping there was one to my "problem", since admittedly, it seems like a rather simple one, as you also noted in a previouscomment. Alas..

Again thank you very much for all your help and your input.

Best Regards,

Petros

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2011
11:17 AM

31 Views

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2011
12:55 PM

31 Views

Many BLAS functions can be built with C or C++ after automated translation of the open Fortran source, but you can just as well compile them in Fortran and link together with C++. The Fortran source, of course, is far too old to come with ISO C binding, which you would need to add to get portable "extern C" linkage.

As you can see, I'm stabbing in the dark as to whether I can relate to the current question.

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2011
05:49 PM

31 Views

What I meant is for example choices of in-place substitution of results which was largely dictated by the fact that memory was way more "expensive" , forcing the user to copy around instead of using pre-allocated buffers.

Also, having been written in FORTRAN it cannot really provide for strided-based types (yes it could, but at the expense of a lot of code/argument -ists overhead. And yes, I understand that there are issues with efficiency, but this was not, at the time at least, the underlying concern).

As far as I am concerned, one of LAPACK's main clients are PDE solvers and with the possibility of many more cores these days, problems that would be very slow in shared memory configurations (like my pc, these days and hopefully next year) are now tractable.

So, for example, when I try to solve a 4-d PDE using ADI (maybe even higher when the knights show up) have the unknown function as a boost nulti-array and when solving for each dimension at every time step I do need to make unnecessay copies to solve the linear systems. When I was able to do efficiently 2 dimensions the overhead was less of a fraction than with more dimensions. As a coder it is just a constant headache ;-))

There is no intention here of knocking down LAPAC, of course ;-)).

For all design choices there are pros and cons. The value of LAPACK lies, as far as I am concerned, with the robustness of the mathematical algorithms rather than its design (please note I only program in C++ and am FORTRAN illiterate - and wish to stay this way ;-)) ).

Nice talking to you, Thank you again,

Petros

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-22-2011
06:03 PM

31 Views

As far as threading goes, I thought GOTO BLAS does address it and it is free (no?).

Thank you for bringing up the constraint of using MKL and TBB at the same time. I was not clear on this and this is a really useful piece of information (although somewhat dissappointing..),

It so happens that in my code, I do not mix the two (i.e. my TBB calls do not include MKL ones, by choice, and , of course, the opposite is simply not possible) -had in mind to look into it in the future, though.

My response earlier had to do with af's observation that I was potentially mixing uBLAS and MKL for BLAS calls,

which I don't - have my own tbb-based version of vector expressions for pure BLAS-like (i.e. not LAPACK-feeding) calls - luckily I do not need to multiply matrices that much.

Thank you for the intervention, since I did learn something trully valuable.

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-23-2011
11:56 AM

31 Views

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
11:58 AM

31 Views

~=1000 )

This is likely too small of a data set to divide up 8-ways (number of threads) and amortize the task/thread start/stop overhead. Can you re-think your problem such that you can have 8 of these 40x1000 (sub-)arrays worked on at once?

Alternative suggestion. Once you rearrange the data I imagine you intend to do some work with it. If so then see if you can code such that some threads rearrange more pieces than "some threads" and as each piece is rearranged perform (enqueue) the additional work using the remaining threads. IOW pipeline the operation (not necessarily using parallel_pipeline).

Jim Dempsey

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
01:13 PM

31 Views

Thank you for your response.

Let me state from the start that I am not qualified to talk on "multi-layered memory architectures" - this is getting to low in what actually happens for me to have an educated opinion.

Naively speking and from my experience, LAPACK calls are only a part of bigger operations, i.e. solving a PDE and doing something with the results at every (time)step of the solution. The allocation of data in other big data structures is then no longer optional. Benchmarking just the LAPACK call then is probably not enough.

Again, I do not know enough to have an educated opinion.

The same document, for example says abit layer how LAPACK can solve very efficiently, as a result, tridiagonal systems with many rhs vectors. In practice though, unless solving for Dirichlet type problems where the one-dimensional operators are homogeneous wrt therest of the dimensions i.e there is no mixing(a situation very common in ADI type methods), I have never encountered a useful application for this ability/strength. It is very possible that I don't deal with the kind of problems for which this would be useful, however, andtherefore have a limited view.

The Robinson result sounds interesting and would like to know more about its rational (is by any chance a pointer to the paper easy to produce?)

Finally, memory in 1992 was much more expensive than today. At the time, at least on PCs, it was not even shipped in GBs ;-)) - yes I realize that LAPACK was most likely not designed for usage with the PCs ;-)).

At this point, de facto will have tocopy and maybe I am luckyand thisis the best choice.\

Thank you for your comments,

Petros

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
01:22 PM

31 Views

I am thinking now along these lines.

I have come to realize that, typically in a 4 dimensional equation with 50 points in each dimension,

I would have 50 * 50^3 points and then,I think I would be fine.

For a 2-dimensional problem would be suboptimal, but maybe who cares.

Your suggestion is definitely the right way to design this but would prefer to avoid the coding -and maintenance- overhead of it.

The inquiry sprang from a toy example I ran and was "struck" from the fact that I could not "take advantage" of all the computing power I had. I guess I was much more naive before this discussion.

Thank you very much,

Petros

ps: will also look into these blog and site for eduactional reasons ;-))

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
03:09 PM

31 Views

Jim Dempsey

RafSchietekat

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
03:16 PM

31 Views

Petros_Mamales

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-24-2011
03:41 PM

31 Views

thank you very much for the suggestion - was not aware of parallel_invoke (probably have an older version of the book). Will look into it.

Thank you again,

Petros

For more complete information about compiler optimizations, see our Optimization Notice.