Performance Tuning YetiSim For Parallel Processing

AJ13 · ‎03-18-2008

Alright, after a lot of effort I have rewritten YetiSim from scratch with new components, and a lot of template programming. Here are the timings from execution with gprof (I have yet to get VTune to work without crashing). The SimExecutionObject::execute that you see taking 27% of time.... that's the beef, so the more processing time the better there. BTW, this is on a dual-quad core Xeon... working on getting the code onto an Itanium2-128 processor monster, but having compilation issues.

Question is... the measurement for tbb::internal::start_for.... what exactly is that? I mean, I know what it does in the code... but does this mean that the overhead of running parallelization takes 18% of execution time?

I will be trying "Thread Checker" soon... it keeps giving me a floating point exception when I run it... so I'll have to play with it later today.

Any suggestions on general analysis of parallel code for performance would be appreciated. This was compiled with Intel C++ compiler 10.1, with -O2 and -lirc -ltbb -ltbbmalloc flags.

AJ

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

27.61 18.28 18.28 3624777 0.01 0.01 SimExecutionObject >, tcc::ptr_vector_default> >::execute()

18.44 30.49 12.21 tbb::internal::start_for<:CONCURRENT_VECTOR> >::generic_range_type<:INTERNAL::VECTOR_ITERATOR><:CONCURRENT_VECTOR> >, ExecutionObject*> >, ParallelExecutionObjectRunner, tbb::affinity_partitioner>::execute()

15.92 41.03 10.54 30523145 0.00 0.00 boost::iterator_facade<:TRANSFORM_ITERATOR><:INTERNAL_EDGE_TO_VERTEX_EDGE_PAIR><:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> > >, __gnu_cxx::__normal_iterator<:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >**, std::vector<:INTERNAL_EDGE> >, tcc::internal_vertex >, tcc::ptr_vector_default> >*, std::allocator<:INTER nal_edge=""> >, tcc::internal_vertex >, tcc::ptr_vector_default> >*> > >, boost::use_default, boost::use_default>, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> >, boost::random_access_traversal_tag, tcc::vertex_edge_pair<:INTERNAL_VERTEX> >, tcc::ptr_vector_default> >, int>::operator->() const

15.57 51.34 10.31 3435987 0.00 0.00 Clock::tick()

7.37 56.22 4.88 3325636 0.00 0.00 lessThanMinute(Clock&)

4.44 59.16 2.94 20885637 0.00 0.00 boost::function1 >::operator void (boost::function1 >::dummy::*)()() const

1.60 60.22 1.06 tbb::task_scheduler_init::~task_scheduler_init()

1.28 61.07 0.85 3355148 0.00 0.00 boost::function1 >::operator()(Clock&) const

1.27 61.91 0.84 tcc::ptr_vector, std::vector*, std::allocator*> >, tbb::spin_mutex, boost::checked_deleter > >::~ptr_vector()

1.13 62.66 0.75 3485931 0.00 0.00 SimLink::getDouble()

1.08 63.37 0.72 3577582 0.00 0.00 SimLink::getFunction()

1.08 64.09 0.72 3451658 0.00 0.00 boost::detail::function::void_function_obj_invoker1<:_BI::BIND_T>, boost::_bi::list1<:ARG><1> > >, void, Clock&>::invoke(boost::detail::function::any_pointer, Clock&)

0.89 64.68 0.59 3314952 0.00 0.00 boost::detail::function::function_invoker1::invoke(boost::detail::function::any_pointer, Clock &)

0.86 65.25 0.57 7097027 0.00 0.00 SimNode >::callExitFunction(Clock&)

0.20 65.38 0.13 main

0.17 65.49 0.11 1000000 0.00 0.00 tcc::execution_graph::execute(Clock&)

0.16 65.59 0.11 __gnu_cxx::new_allocator<:_BI::BIND_T>, boost::_bi::list1<:ARG><1> > > >::destroy(boost::_bi::bind_t, boost::_bi::list1<:ARG><1> > >*)

0.15 65.69 0.10 1000000 0.00 0.00 _ZN14GraphExecutionI5ClockN3tcc15internal_vertexI7SimNodeI7SimLinkIS0_EENS1_18ptr_vector_defaultEEEEC9ERS0_RS8_

0.14 65.78 0.09 1000000 0.00 0.00 GraphExecution >, tcc::ptr_vector_default> >::addExecutionObject(SimExecutionObject >, tcc::ptr_vector_default> >*)

Alexey-Kukanov · ‎03-18-2008

aj.guillon@gmail.com:
Question is... the measurement for tbb::internal::start_for.... what exactly is that? I mean, I know what it does in the code... but does this mean that the overhead of running parallelization takes 18% of execution time?

It can happen that the compiler inlined the operator() of the body object into the execute() method of the parallel_for task.