hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

create_token and needless copying

Marc_G_2
Beginner
557 Views

Hello,

I am trying to use parallel_pipeline in my code. I was using a large type as the output of a filter, and I noticed that it is copied right after construction:

u_helper::create_token(my_body(t_helper::token(temp_input)))

where create_token does:

return new (output_t) T(source);

It would be nice if this could be changed to create the output in place (or at least in C++11 move instead of copy, but I would really prefer the first solution, even if gcc currently still generates an extra move (clang doesn't)).

I was also surprised that TBB keeps allocating / deallocating tokens, I was kind of hoping it would recycle memory automatically, but that might not be so easy to do automatically in the general case. If I understand correctly, the filters are really meant for pointers, where we manually handle allocation on the side (using knowledge of our specific pipeline). However, even if the examples in the doc are indeed about pointers, I didn't find any note in the doc giving this advice (not to use large types directly), so I am not sure how recommended it is.

0 Kudos
4 Replies
RafSchietekat
Valued Contributor III
557 Views

The code has "// TODO: add move-constructors, move-assignment, etc. where C++11 is available.", so that's the plan all right. I'd get right on that if I weren't concocting something else already. :-)

The original pipeline only handled pointers, but parallel_pipeline has a more sophisticated API, with some unfortunate overhead for larger objects (smaller objects are still handed over as efficiently as pointers). Perhaps there is an opportunity for pipelines using large tokens between some filters to, e.g., determine a maximum size and buffer memory locally at filters that have them coming in and going out, instead of at least dispatching (de)allocations calls to the correct bin size even if using the scalable allocator, but it is not a priori clear how beneficial that would be, because the scalable allocator is quite efficient already.

I agree that the documentation could be improved, perhaps targeting multiple levels of experience, and if you have a single object that conceptually passes through the pipeline you should certainly use a pointer.

0 Kudos
Marc_G_2
Beginner
557 Views

Raf Schietekat wrote:

The code has "// TODO: add move-constructors, move-assignment, etc. where C++11 is available.",

Ah, good, I'd missed that. It is probably a general comment that applies to a lot of TBB.

because the scalable allocator is quite efficient already.

Do you mean linking with libtbbmalloc? I had tried it, and it makes my code significantly slower than glibc's default allocator (which isn't reputed for being the fastest allocator around).

if you have a single object that conceptually passes through the pipeline you should certainly use a pointer.

Sadly, I don't. My pipeline consists of:

1) a serial_in_order counter, that gives integers 0, 1, ... up to a billion or so.

2) a parallel filter that, for an integer, gives a structure with contains essentially a static_vector (it has to walk through a graph to find them, that's the slow part)

3) a final serial_in_order filter that processes those numbers and uses the static_vector as a cache for fast access to that data.

It would be simpler to do the allocation in the serial part 1), then I could use a circular buffer and pass a pointer through. But for a first version, I was trying to push as much code as possible to the parallel part.

To minimize the overhead, I am actually working with batches, 1) gives integers 0, 1000, 2000, 3000 etc, 2) returns a vector of 1000 such structures, and 3 iterates on it. This can give good performance, but it is extremely sensitive to the parameters. If I don't explicitly restrict the number of threads in task_scheduler_init (optimal seems to be 4), even if the first argument of parallel_pipeline is low, performance drops (I've seen it become more than 3 times slower on a machine with many cores).

Thank you for your helpful reply.

0 Kudos
RafSchietekat
Valued Contributor III
557 Views

"Do you mean linking with libtbbmalloc? I had tried it, and it makes my code significantly slower than glibc's default allocator (which isn't reputed for being the fastest allocator around)."

That right there would be cause for immediate concern. Do you have a reproducer and benchmark data to support your claim?

 

0 Kudos
Marc_G_2
Beginner
557 Views

Raf Schietekat wrote:

"Do you mean linking with libtbbmalloc? I had tried it, and it makes my code significantly slower than glibc's default allocator (which isn't reputed for being the fastest allocator around)."

That right there would be cause for immediate concern. Do you have a reproducer and benchmark data to support your claim?

Hmm, I don't remember on which machine and with which version of the code I got that. I tried to reproduce on a couple versions, and in all cases LD_PRELOAD=libtbbmalloc.so had no effect on the timing. On my dual-core laptop, timings are quite stable, but on the many-core server, they vary quite wildly (the same program takes 12s once and 18s the next time), so it is possible that I only ran the program a couple times with libtbbmalloc, got unlucky timings and gave it up. In that case, sorry for the false alarm. I'll keep trying, and if I get a reproducible slow-down, I'll make sure I freeze the image and contact you.

0 Kudos
Reply