Large number of BE_EXE_BUBBLE-GRALL in application

tangouniform · ‎02-24-2009

Here is my situation...
Hardware: 8 core Itanium 2 1.6 Ghz box w/18 Mb cahe (SGI Altix)
OS: SUSE 10 with propack 5
Intel 11.0 compiler
Application is 501 Mb (a lot of lines of code) simulation
Using Vtune 9.1 for linux
I use the command line (vtl) on the itanium and vtlex on a Xeon box (RHEL)

It is imperative I make the simulation real time but we are lagging by 100s of seconds...
vtune profiling shows the following
Profile 1:
across 19 threads I get

CPU_OP_CYCLES = 9.32E11; IA64_INST_RETIRED-THIS=1.16E12; BACK_END_BUBBLE-ALL = 5.63E11; FE_BUBBLE-ALL = 6.71E11

which to me seems bad, but I am new to this, so an experience opinion would be nice..
Wen I drive down into one of the Mains(thread) I find that vmlinux is the top dog, I would love to post an excel file, but due to issues beyond my control this is difficult... So I will post a bit by hand/

Module CPU_OP_CYCLE IA64_INST_RET BACK_END_BUBBLE-ALL FE_BUBBLE-ALL
vmlinux 2.98E11 4.85E11 1.29E112.03E11
Main 1.46E11 1.72E119.76E10 1.09E11
libc-2.4.so 4.62E104.49E10 3.09E10 3.37E10
.
.
.
Of all the BE BUBBLEs over half are attributed to GRALL and of those, over half are in the kernel
Is this telling me my application is spending a lot of time in system calls and worse yet the kernel is causing most of my stalls? When I run Performance co pilot pro (Strace blows up) I am not seeing that many system calls.Should I always expectthis?My kernel is not compiled with -g option. I have talked to the SAs and they will do it, for a fee, but Dove bars are too expensivetorisk based on my own opinion :)

Also when I look at my application and add up all the stalling versus CPU_OPs it seems to me I am spending 80% of my time stalled unless I follow the article by Mr. Greco which says I am spending more time stalled than I not stalled... If that is the case, one would think I should have low hanging fruit opportunities all over the place as long as I understand what vtune is telling me! Which sould be nice.

As always, any input is appreciated.
Tango

geoffrey-burling · ‎02-25-2009

Quoting - tangouniform

Here is my situation...
Hardware: 8 core Itanium 2 1.6 Ghz box w/18 Mb cahe (SGI Altix)
OS: SUSE 10 with propack 5
Intel 11.0 compiler
Application is 501 Mb (a lot of lines of code) simulation
Using Vtune 9.1 for linux
I use the command line (vtl) on the itanium and vtlex on a Xeon box (RHEL)

It is imperative I make the simulation real time but we are lagging by 100s of seconds...
vtune profiling shows the following
Profile 1:
across 19 threads I get

CPU_OP_CYCLES = 9.32E11; IA64_INST_RETIRED-THIS=1.16E12; BACK_END_BUBBLE-ALL = 5.63E11; FE_BUBBLE-ALL = 6.71E11

which to me seems bad, but I am new to this, so an experience opinion would be nice..
Wen I drive down into one of the Mains(thread) I find that vmlinux is the top dog, I would love to post an excel file, but due to issues beyond my control this is difficult... So I will post a bit by hand/

Module CPU_OP_CYCLE IA64_INST_RET BACK_END_BUBBLE-ALL FE_BUBBLE-ALL
vmlinux 2.98E11 4.85E11 1.29E112.03E11
Main 1.46E11 1.72E119.76E10 1.09E11
libc-2.4.so 4.62E104.49E10 3.09E10 3.37E10
.
.
.
Of all the BE BUBBLEs over half are attributed to GRALL and of those, over half are in the kernel
Is this telling me my application is spending a lot of time in system calls and worse yet the kernel is causing most of my stalls? When I run Performance co pilot pro (Strace blows up) I am not seeing that many system calls.Should I always expectthis?My kernel is not compiled with -g option. I have talked to the SAs and they will do it, for a fee, but Dove bars are too expensivetorisk based on my own opinion :)

Also when I look at my application and add up all the stalling versus CPU_OPs it seems to me I am spending 80% of my time stalled unless I follow the article by Mr. Greco which says I am spending more time stalled than I not stalled... If that is the case, one would think I should have low hanging fruit opportunities all over the place as long as I understand what vtune is telling me! Which sould be nice.

As always, any input is appreciated.
Tango

Just a couple of thoughts here, Tango, seeing how none of the real experts have weighed in yet on your question.

First, remember that if your application is running for any non-trivial period of time, there will be a lot of calls to vmlinux due to the overhead of the operating system --hardware interrupts, daemons being awoken & put back to sleep, etc. -- whichalways take priority & will make processes running in userland wait.This is the case even on a stripped-down system with few other activities running at the same time, & I would guess this is why that other application is showing so few syscalls coming from your applicatoin. The trick is to identify as many of these irrelevant syscalls & remove them from the mix so you can focus on the ones you are interested in.

Another consideration is just how much of the processor is being used during your run. If a tool like top shows that the processor is not very active, then that's why vmlinux is using so many cycles: this is simply the O/S maintaining its environment.You wouldthen need to look inside your application to find what is keeping from running efficiently. On the other hand, if the processor is busy (say pushing 90, 95%), then vmlinux is handling various operations handed to it by your application; if these syscalls are the most efficient way for your application to handle its work, then you don't have anything to worry about here.

Is there a reason you aren't using vtlec? Vtlec would give you far more detailed information on exactly what is happening here, & tell you why vmlinux is handling so many processor cycles. You can use vtl to collect the data, then view it from within vtlec, so you aren't dealing with irrelevent X Windows processes in your analysis.

Since I mentioned one GUI application for Linux, let me mention another: OpenOffice, which comes bundled with many Linux distributions. It allowthe user to create spreadsheets just like Excel,as well as import text files in csv format, thenallow you to massage your data. (I don't know whatthese limitations are that you alluded to, but thought I'd make that suggestion in case you didn't know about this tool. There are so many tools now for Linux that even those who are very familiar with that platform either don't know -- or have forgotten -- about them.)

A last comment. I always thought that SAs were best bribed with their $BEVERAGE_OF_CHOICE -- which is not always alcoholic, but almost always expensive. If they're willing to do favors for Dove Bars, they are definitely working below the market rate. Consider yourself lucky!

Geoff

tangouniform · ‎02-25-2009

Geoff,
First of all thank you for your reply, I am on a project of about 150 people, 60 developers, 89 managers and 1 person trying to figure out how to run effectively on an Itanium (me). Sadly, I have the most knowledge of the Itanium architecture, and that is not much.

That being said, I don't run vtlec on the IA64 box because as bad as it runs on the Xeon box, it is worse on the Itaniums. The GUI will come up and core usually about 3 times before it stays up on the Xeon box. On the Itanium it is more like 10 (configuration issue?) I may try again today to get vtlec to run on the itanium to get a better idea of what is going on in the kernel. I was able to use histx to find that for both the CPU_OP_CYCLES and the BE_EXE_BUBBLE.ALL, the culprit is kernel:schedule 20% for both. HMMM

I collected the data for 455 secs which is when the most intense processing is being done, so I believe this is a non-trivial length of time. Which brings me to a question. Since I am in the module view, would'nt the events attributed to "hardware interrupts, daemons being awoken & put back to sleep, etc" be shown on a different thread? When I run top in the individual CPU view I see that I spend a whole lot of time in the sys on the node that I am running main.

As far as processor usage, the simulation framework (SPEEDES) tries to consume 100% of the CPU the entire length of the run, some of this is realtime, some is optimistic and the rest is busy work(GetTimeOfDay and MSG_PEEK). They do this to try to stay ahead of real-time and enforce processor affinity. The frame work also uses shared memory and process forking to allow it to handle larger number of simulation elements. The runs I am looking at are simplified and run on "1" CPU (this is not necessarily true, because we also allow threading)

Sadly, I have to type in data because moving data from our isolated system to one hooked to the internet is not trivial. I have calc and excel on my linux box which helps me statistically prove whether or not something has changed the performance of our sim.

As far as the SA,considering myself lucky is an understatement!

Thanks again, and if I have said something stupid or even mildly wrong please feel free to comment. I would really like to learn this stuff

Tango

robert-reed · ‎02-25-2009

Just curious. How much physical memory is on this box? Does top show any swap space being used? How many processes and how many threads total are operating when the machine is fully loaded?

p.s., your manager/developer ratio seems way outta whack. ;-)

tangouniform · ‎02-25-2009

Quoting - Robert Reed (Intel)

Just curious. How much physical memory is on this box? Does top show any swap space being used? How many processes and how many threads total are operating when the machine is fully loaded?

p.s., your manager/developer ratio seems way outta whack. ;-)

The box I am currently running on has 8 cores and 16Gb mem, no swap space is being used. I am new to profiling with threads, its a new feature of SPEEDES, but top shows only 1 Main running. I just ran in a cpuset, by monitoring with top I can see that both CPUs I allot for the cpuset are being utilized and that on average the cpu is doing half User and half sys.... I am not sure what system calls are being executed when the sim is getting busy. Any ideas on how to find this out?

"p.s., your manager/developer ratio seems way outta whack. ;-)" - developer complained about to many managers once, they added more to asses the situation :)

robert-reed · ‎02-25-2009

Quoting - tangouniform

The box I am currently running on has 8 cores and 16Gb mem, no swap space is being used. I am new to profiling with threads, its a new feature of SPEEDES, but top shows only 1 Main running.

Sorry, I'm just trying to get a mental picture of your simulation environment. Your previous comment, "The frame work also uses shared memory and process forking" seemed to suggest lots of processes competing for shared memory and communicating via the kernel, which might explain the high kernel times.

So is SPEEDES a product you're developing, or one that you're trying to use? Is it spawning processes and within them threads?

tangouniform · ‎02-26-2009

Quoting - Robert Reed (Intel)

Sorry, I'm just trying to get a mental picture of your simulation environment. Your previous comment, "The frame work also uses shared memory and process forking" seemed to suggest lots of processes competing for shared memory and communicating via the kernel, which might explain the high kernel times.

So is SPEEDES a product you're developing, or one that you're trying to use? Is it spawning processes and within them threads?

We use SPEEDES as our framework, I do have the luxury of the source so I can modify it when need arises. Yes, It spawns processes and within them threads. I am trying to simplify things by only using one process right now. I will add processes after I feel I have addressed the "easy" issues. Using strace I believe that clock_gettime(CLOCK_REALTIME) is my kernel culprit. Any suggestions of a faster way to get time? (It calls this 100s of time per sec)

robert-reed · ‎02-27-2009

Quoting - tangouniform

We use SPEEDES as our framework, I do have the luxury of the source so I can modify it when need arises. Yes, It spawns processes and within them threads. I am trying to simplify things by only using one process right now. I will add processes after I feel I have addressed the "easy" issues. Using strace I believe that clock_gettime(CLOCK_REALTIME) is my kernel culprit. Any suggestions of a faster way to get time? (It calls this 100s of time per sec)

I finally had a chance to look up SPEEDES and discovered that NASA and others have put a lot of work into this framework and think pretty highly of its efficiency on high speed computers. I suspect the high kernel times might be either a configuration problem, or an indication that things are idling--no work available for the simulation--so all the actors are of spinning and looking for something to do. Rather than trying to find a replacement for the clock_gettime calls, perhaps what you need is someone expert in configuring SPEEDES. Unfortunately, that's not me. Is there a SPEEDES user community to which you might turn?

tangouniform · ‎03-02-2009

Quoting - Robert Reed (Intel)

I finally had a chance to look up SPEEDES and discovered that NASA and others have put a lot of work into this framework and think pretty highly of its efficiency on high speed computers. I suspect the high kernel times might be either a configuration problem, or an indication that things are idling--no work available for the simulation--so all the actors are of spinning and looking for something to do. Rather than trying to find a replacement for the clock_gettime calls, perhaps what you need is someone expert in configuring SPEEDES. Unfortunately, that's not me. Is there a SPEEDES user community to which you might turn?

It is a great framework, I have been working with SPEEDES for 10 years, and the project I work on pushes it to its limits... So part of my job is to make it so we can push it further. We have ported it(SPPEDES) from the MIPS IRIX boxes to Linux on the Itaniums and are trying to work out the kinks. The biggest issue I see with both our code and SPEEDES is pointer chasing, I just need to figure out how to releave that performance hit. Currently the 8 core Xeon boxes out perform the Itaniums... to a point but when more than 8 cores are neede they both suffer. My goal is to get the Itanium on par with the Xeons up to 8 cores so when the more intense scenarios we run are executed on itanium based machines it fullfills our needs.
The reason I would like the most efficient means to get the time is 2 fold. SPEEDES will collect performance metrics durin the run which requires timing, but more importantly it need to get the time to decide if it should advance the sim time, and it does this billions of times in a 10 min run. I need to avoid context switches if at all possible and timimg calls should be as lightweight as possible.

As always thanks for your input, I am always looking for new approaches and theories.

Tango

robert-reed · ‎03-02-2009

Quoting - tangouniform

It is a great framework, I have been working with SPEEDES for 10 years, and the project I work on pushes it to its limits... So part of my job is to make it so we can push it further. We have ported it(SPPEDES) from the MIPS IRIX boxes to Linux on the Itaniums and are trying to work out the kinks. The biggest issue I see with both our code and SPEEDES is pointer chasing, I just need to figure out how to releave that performance hit. Currently the 8 core Xeon boxes out perform the Itaniums... to a point but when more than 8 cores are neede they both suffer. My goal is to get the Itanium on par with the Xeons up to 8 cores so when the more intense scenarios we run are executed on itanium based machines it fullfills our needs.
The reason I would like the most efficient means to get the time is 2 fold. SPEEDES will collect performance metrics durin the run which requires timing, but more importantly it need to get the time to decide if it should advance the sim time, and it does this billions of times in a 10 min run. I need to avoid context switches if at all possible and timimg calls should be as lightweight as possible.

As always thanks for your input, I am always looking for new approaches and theories.

Tango

Your mention of pointer chasing and failure to scale above eight cores sound like classic memory contention is forcing a lot of serial time for sychronization/thrashing across the threads. It's hard to avoid such inefficiencies on programs large enough to overwhelm cache sizes, but not impossible. It really requires a lot of understanding of the data and control flows, and trying to localize data reuse so that there's a chance the data will stay in cache until it's no longer needed, at least for the current quantum.

The lightest weight timing mechanism I know is RDTSC (Read Time Stamp Counter, and the corresponding ar.tsc register on Itanium), which is a simple instruction that does not require a kernel call and has at least the accuracy of the bus clock frequency on your platform. Each HW thread has its own counter, which may present complications in terms of skew between the time stamp counters on different cores but may given you more precision and less overhead than the real time clock.