Solved: thread execution speed jitter

agks · ‎03-12-2018

Hi,

I am developing a computationally intensive application on a Dell R730 server with 2 Xeon E5 cpus (24 cores in total) running on Linux Debian 64 bits Jessie.

My application is structured as follows: it is made of a succession of approx 20 processing blocks in cascade, each block reads data from an input circular buffer to a local buffer, processes the data on the local buffer, and writes the result to an output circular buffer. Each circular buffer is connected to the output of 1 block and to the input of the following block.

Each block runs on one thread. I have further chosen to bind most computationally intensive blocks to dedicated CPU (though affinity). When running alone, each block has a throughput well above my requirement and rather constant. When the application is running (ie all blocks working in their threads), some threads running almost alone on a CPU core have a very high jitter in execution speed, although they are not starving data and are not blocked to write the output.

The result is that the overall throughput of my application is well under my requirement and and not constant from one run to another run: to process the same amount of data it takes sometimes 30 sec (which is satisfying) and sometimes up to 45 sec (not satisfying).

I have run Intel VTune but I still do not manage to identify the cause of the jitter...

Any suggestion on how to investigate the issue ?

Thanks

Agks

agks · ‎03-14-2018

Thanks for your suggestions

Numastat returns no num_miss so it seems not to be a NUMA issue. So I disabled HyperThreading and now I get much more deterministic and predictable behavior of my threads. It is not a final conclusion because I do not have the exact same HW environment as in my first post. As soon as I get it back, I do the tests wo HyperThreading and keep you posted about the conclusion. In any case it seems your conclusion is correct ! Thanks a lot !!

View solution in original post

McCalpinJohn · ‎03-12-2018

The first thing I would try would be pinning all of the processing threads to specific cores to see if the variability persists. If you can't do this, at least pin the threads to the socket where you think it should run. If the threads can't be bound to specific cores, rebooting with HyperThreading disabled will give the OS less opportunity to screw up the scheduling.

I would also use the numa facilities (either using "numactl --membind" or using the equivalent libnuma API) to ensure that the memory being used for the buffers is allocated on the chip where I think it belongs.

If the variability persists, VTune should be able to help you distinguish between slowdowns that are due to contention for compute resources and slowdowns that are due to contention for memory resources.

agks · ‎03-13-2018

Thank for your reply

I have more threads than CPU cores so I pinned all CPU consuming threads to specific cores, the remaining few threads being activated with low periodicity and being non critical

When you say socket, do you mean Numa node ? I have actually already pinned the threads to the Numa node it should belong to, to avoid memory transfer from one Numa node to the other, that is first part of cascaded blocks on Numa node 0, and second part on Numa node 1. I am not sure if this is what you mean

I did not know about numa facility, I will try it as soon as I have access to the server again (1 week ?) and keep you posted about the result.

Same for HyperThreading, I will try and keep you informed

Thanks

Agks

McCalpinJohn · ‎03-13-2018

Sorry for being imprecise -- when I said "socket", I did mean "NUMA node".

The default memory allocation policy of Linux is "local", so most of the time if you pin a thread to a NUMA node, the memory that thread instantiates will be allocated locally. However, this is not guaranteed, and there is no warning provided if the kernel is unable to allocate local pages. We have seen cases where the filesystem cache used a lot more of the memory than we expected on socket 0, resulting in non-local page allocation. This is unlikely to happen unless you are trying to use more than ~1/2 of the memory on socket 0, but since it is somewhat unpredictable, it is helpful to know how to look for the problem and how to deal with it.

In our production systems, we drop the caches before each job, which eliminates the problem in the overwhelming majority of cases.
The "numastat" command can be used to track NUMA allocation successes and failures. You can run it before and after a job and take the differences of the counts, or you can point it at an existing process to find out how much memory the process has allocated on each NUMA node.
For benchmarking, I use "numactl --membind=<node_number>" to force memory to be allocated exactly where I want, as described below.

Forcing memory binding: Using either numactl or the NUMA API ("numa_set_membind()" and related calls), you can force pages to be allocated on a target node. This will cause the kernel to "work harder" to free up pages on the local node to allocate, and will abort the job if it cannot allocate all the requested pages on the target node. This is not good for production, but is helpful for debugging NUMA affinity failures.

Since you have more threads than you have cores, I am thinking that poor OS scheduling (with respect to HyperThreads) is more likely than NUMA failures to be the source of the performance variability....

agks · ‎03-14-2018

Thanks for your suggestions

Numastat returns no num_miss so it seems not to be a NUMA issue. So I disabled HyperThreading and now I get much more deterministic and predictable behavior of my threads. It is not a final conclusion because I do not have the exact same HW environment as in my first post. As soon as I get it back, I do the tests wo HyperThreading and keep you posted about the conclusion. In any case it seems your conclusion is correct ! Thanks a lot !!

agks · ‎03-22-2018

Hi John

After doing some more tests, I confirm you were right regarding HyperThreading: it is not well suited at all in our case.

I currently have 2 x Xeon E5 2643 v4. I would therefore like to change them with 2 x Xeon E5 2687W v4 since the socket looks compatible and they offer twice more physical cores with slightly lower frequency. Do you confirm this is a good choice ? Besides, since these dissipate 160 W instead of 135 W for the 2643, do I need to upgrade the radiators on the CPUs ? Power supply is currently 2 x 750 W so I guess it is sufficiant.

Thanks again !

Agks

McCalpinJohn · ‎03-22-2018

You should have enough power, but I would strongly recommend that you check to see if your cooling solution is rated for 160W parts.

The maximum Turbo speed at the number of cores you are using is more important than either the nominal frequency or the maximum single-core Turbo frequency, but the Table 5 in the Xeon E5 v4 specification update (document 333811) is completely screwed up -- showing all of the 4, 6, and 8-core parts as having 10 cores -- so it is not clear that any of the data in the table can be trusted.

Depending on exactly how many cores you need, you might find an 8 or 10 core part that gives enough cores and enough performance, but is closer to your current cooling budget.