Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Newbie

Kimman_B_
Beginner
580 Views

I searched several times before posting this.

We have a workflow application server with BerkeleyDB running on Windows 2012 - it is used for some critical back office operations at a leading Bank with upto 1000 concurrent users . The application runs on a machine with 

The performance is sometimes sluggish when usage is heavy and we are trying to figure out where the bottlenecks are. Our internal code analysis helped us get a 30% improvement at the 50 percentile of response analysis but we are looking for further improvements since the work load is going to increase - the peak system throughput will be pushed for a longer number of hours per day.

Can VTune be run in a production environment? Or will its data collection 'drag' the system to becoming unusable in a production environment? In test runs using automated tools, the performance is much better than in production.

How long should one run the collection for? Our application usage is highly repititive - users are processing items in a queue and moving it to the next queue in the workflow. 

Appreciate the help for a newbie.

Thanks in advance

Kimman

 

0 Kudos
3 Replies
TimP
Honored Contributor III
580 Views

We have sometimes used artificial fixed workloads under VTune in order to be able to correlate repeated runs, possibly adjusting to where it just triggers bottlenecks.  Ideal profiled run length would be just a few minutes if that covers the event paths you need.  As you lengthen the run or increase number of threads, you must increase sample after values (possibly indirectly, via the Advanced expected run time duration setting) or increase sampling interval so as to keep .tb5 file size manageable.

Assuming the production workload is non-repeatable, you may want to choose a minimal group of events for a given run, as you will not  be able anyway to rely on correlating several events e.g. clock ticks vs. cache misses.  Running VTune with any aggressive setting may increase the risk of failure.

Locks and waits analysis may prove useful if those are a factor in limiting throughput for heavy workload, but I haven't seen any expert discussions on that.

Possibly having source code for your DB will make VTune analysis more interesting than it might be otherwise.

0 Kudos
Kimman_B_
Beginner
580 Views

Thanks very much. Let me share this with the folks at the Bank and see whether they can allow this.

I do suspect that locks and waits are a likely source of the delays - we have the same application sequence involving 5-7 DB transactions taking a few hundred msecs even at peak times to several seconds.

Based on documents on the Internet we are changing over to using Critical Sections instead of Mutexes in our application code.

Using BerkeleyDB means that we have the source code for the DB. Most of the locks / waits happen in there.

Thanks once again

Kimman

 

 

0 Kudos
David_A_Intel1
Employee
580 Views

Hi Kimman:

Ideally, if your app has a "steady state" and you can profile that for 1-2 minutes, you should get a good idea of what the code is doing.  True that a production environment means it's not repeatable, but if you are seeing slowness then you will still be able to capture that in a profile.  If you don't see it in a 1-2 minute collection, you can try incrementally increasing the collection time - being sure to set the "Estimated time duration" appropriately.

Using any of the "Advanced" analysis types or Advanced Hotspots should not "drag" the system - the hardware-based sampling has very low overhead.  If you select the option to include "stacks", then overhead increases, but YMMV.

You can use a couple of methods to "wait" and start collecting data once you get into the transaction processing phase of your app: use the Start Paused button and then "Resume" button when ready to collect, use the Collection Control API to instrument your code to resume data collection at the appropriate point and "Start Paused", or use "system-wide" sampling and start the session once that app reaches its steady state.

Tim's comment regarding events is valid, but if you limit yourself to Advanced Hotspots to identify code executing during the transaction processing you shouldn't need to worry as all events are collected at the same time (depending on your processor, however, which you haven't specified, so I can't address).

0 Kudos
Reply