Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
2421 Discussions

Performance Tuning YetiSim For Parallel Processing

AJ13
New Contributor I
137 Views

Alright, after a lot of effort I have rewritten YetiSim from scratch with new components, and a lot of template programming. Here are the timings from execution with gprof (I have yet to get VTune to work without crashing). The SimExecutionObject::execute that you see taking 27% of time.... that's the beef, so the more processing time the better there. BTW, this is on a dual-quad core Xeon... working on getting the code onto an Itanium2-128 processor monster, but having compilation issues.

Question is... the measurement for tbb::internal::start_for.... what exactly is that? I mean, I know what it does in the code... but does this mean that the overhead of running parallelization takes 18% of execution time?


I will be trying "Thread Checker" soon... it keeps giving me a floating point exception when I run it... so I'll have to play with it later today.

Any suggestions on general analysis of parallel code for performance would be appreciated. This was compiled with Intel C++ compiler 10.1, with -O2 and -lirc -ltbb -ltbbmalloc flags.

AJ




Flat profile:


Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

27.61 18.28 18.28 3624777 0.01 0.01 SimExecutionObject >, tcc::ptr_vector_default> >::execute()

18.44 30.49 12.21 tbb::internal::start_for<:CONCURRENT_VECTOR> >::generic_range_type<:INTERNAL::VECTOR_ITERATOR><:CONCURRENT_VECTOR> >, ExecutionObject*> >, ParallelExecutionObjectRunner, tbb::affinity_partitioner>::execute()

15.92 41.03 10.54 30523145 0.00 0.00 boost::iterator_facade<:TRANSFORM_ITERATOR><:INTERNAL_EDGE_TO_VERTEX_EDGE_PAIR><:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> > >, __gnu_cxx::__normal_iterator<:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >**, std::vector<:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >*, std::allocator<:INTER nal_edge=""> >, tcc::internal_vertex >, tcc::ptr_vector_default> >*> > >, boost::use_default, boost::use_default>, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> >, boost::random_access_traversal_tag, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> >, int>::operator->() const

15.57 51.34 10.31 3435987 0.00 0.00 Clock::tick()

7.37 56.22 4.88 3325636 0.00 0.00 lessThanMinute(Clock&)

4.44 59.16 2.94 20885637 0.00 0.00 boost::function1 >::operator void (boost::function1 >::dummy::*)()() const

1.60 60.22 1.06 tbb::task_scheduler_init::~task_scheduler_init()

1.28 61.07 0.85 3355148 0.00 0.00 boost::function1 >::operator()(Clock&) const

1.27 61.91 0.84 tcc::ptr_vector, std::vector*, std::allocator*> >, tbb::spin_mutex, boost::checked_deleter > >::~ptr_vector()

1.13 62.66 0.75 3485931 0.00 0.00 SimLink::getDouble()

1.08 63.37 0.72 3577582 0.00 0.00 SimLink::getFunction()

1.08 64.09 0.72 3451658 0.00 0.00 boost::detail::function::void_function_obj_invoker1<:_BI::BIND_T>, boost::_bi::list1<:ARG><1> > >, void, Clock&>::invoke(boost::detail::function::any_pointer, Clock&)

0.89 64.68 0.59 3314952 0.00 0.00 boost::detail::function::function_invoker1::invoke(boost::detail::function::any_pointer, Clock &)

0.86 65.25 0.57 7097027 0.00 0.00 SimNode >::callExitFunction(Clock&)

0.20 65.38 0.13 main

0.17 65.49 0.11 1000000 0.00 0.00 tcc::execution_graph::execute(Clock&)

0.16 65.59 0.11 __gnu_cxx::new_allocator<:_BI::BIND_T>, boost::_bi::list1<:ARG><1> > > >::destroy(boost::_bi::bind_t, boost::_bi::list1<:ARG><1> > >*)

0.15 65.69 0.10 1000000 0.00 0.00 _ZN14GraphExecutionI5ClockN3tcc15internal_vertexI7SimNodeI7SimLinkIS0_EENS1_18ptr_vector_defaultEEEEC9ERS0_RS8_

0.14 65.78 0.09 1000000 0.00 0.00 GraphExecution >, tcc::ptr_vector_default> >::addExecutionObject(SimExecutionObject >, tcc::ptr_vector_default> >*)


0 Kudos
1 Reply
Alexey_K_Intel3
Employee
137 Views

aj.guillon@gmail.com:
Question is... the measurement for tbb::internal::start_for.... what exactly is that? I mean, I know what it does in the code... but does this mean that the overhead of running parallelization takes 18% of execution time?

It can happen that the compiler inlined the operator() of the body object into the execute() method of the parallel_for task.

Reply