Can I see how '#pragma omp parallel for' translates internally

Jun_Hyun_S_ · ‎10-21-2014

Hello,

I'm writing a simple lookup function using OpenMP/C++ programming model.

Below is the code block where I offload from host to my 5110P Phi device.

        #pragma offload target(mic:0) \
            in(batch_size) \
            nocopy(inputbuf:length(inputbuf_sz/sizeof(uint32_t)) \
                    free_if(0) alloc_if(0) align(CACHE_LINE_SIZE)) \
            nocopy(TBL24:length(TBL24_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0)) \
            nocopy(TBLlong:length(TBLlong_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0)) \
            nocopy(resultbuf:length(resultbuf_sz/sizeof(uint16_t)) \
                    free_if(0) alloc_if(0) align(CACHE_LINE_SIZE))
        {   
            //#pragma omp parallel for private(index) num_threads(120)
            #pragma omp parallel for num_threads(120)
            for (int index = 0; index < batch_size; index += 16) {
            #ifdef __MIC__
                __m512i     v_daddr =           _mm512_load_epi32 (inputbuf + index);
                __m512i     v_daddr_shift8 =    _mm512_srli_epi32 (v_daddr, 8); 
                __m512i     v_temp_dest =       _mm512_i32extgather_epi32 (v_daddr_shift8, TBL24, 
                                                                        _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
                __m512i     v_ignored_ip =      _mm512_set_epi32 (REPEAT_16(IGNORED_IP));
                __m512i     v_zero =            _mm512_setzero_epi32 (); 
                __mmask16   m_is_not_ignored =  _mm512_cmp_epu32_mask (v_daddr, v_ignored_ip, _MM_CMPINT_NE);
                __m512i     v_0x8000 =          _mm512_set_epi32 (REPEAT_16(0x8000));
                __m512i     v_0x7fff =          _mm512_set_epi32 (REPEAT_16(0x7fff));
                __m512i     v_0xff =            _mm512_set_epi32 (REPEAT_16(0xff));
                __mmask16   m_top_bit_set =     _mm512_cmp_epu32_mask (_mm512_and_epi32 (v_temp_dest, v_0x8000), 
                                                                        v_zero, _MM_CMPINT_NE);
                __mmask16   m_both_cond_met =   _mm512_kand (m_is_not_ignored, m_top_bit_set);
                __m512i     v_index2 =          _mm512_add_epi32 (_mm512_slli_epi32 (_mm512_and_epi32 (v_temp_dest, v_0x7fff), 8), 
                                                                _mm512_and_epi32 (v_daddr, v_0xff));
                __m512i     v_result =          _mm512_mask_i32extgather_epi32 (v_temp_dest, m_both_cond_met, v_index2, 
                                                                TBLlong, _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
                                                _mm512_mask_extstore_epi32(resultbuf, m_is_not_ignored, v_result, 
                                                                _MM_DOWNCONV_EPI32_UINT16, _MM_HINT_NT);
            #endif
            }   
        }

In sequential execution, the for loop in this code block should result in O(n),
where n is the total number of iterations (batch_size) in this case.

I, however, have used #pragma omp parallel for with num_threads(120) in order to trigger GPU-like SIMD behavior,
which should (approximately and hopefully) divide the number of sequential iterations by some figure up to 120,
and I used explicit vectorization intrinsics to further reduce it by factor of 16 (16 inputs per iteration, not 1).

Yet, I would up with kernel exec. time of around 90us, which is 30 times what it takes in NVIDIA GPU kernel.

So I started reading vectorization reports, which read:
(FYI, Line 145 is where the for statement is)

dryrun_shared.cc(145): (col. 4) remark: *MIC* loop was not vectorized: existence of vector dependence
dryrun_shared.cc(164): (col. 13) remark: *MIC* vector dependence: assumed FLOW dependence between resultbuf line 164 and inputbuf line 147
dryrun_shared.cc(147): (col. 26) remark: *MIC* vector dependence: assumed ANTI dependence between inputbuf line 147 and resultbuf line 164
dryrun_shared.cc(147): (col. 26) remark: *MIC* vector dependence: assumed ANTI dependence between inputbuf line 147 and resultbuf line 164
dryrun_shared.cc(164): (col. 13) remark: *MIC* vector dependence: assumed FLOW dependence between resultbuf line 164 and inputbuf line 147

I wasn't really sure what the dependence was, but since I used explicit vectorization, I didn't really care.

However, when I read the disassembled the offloaded code block with icpc -S, I found the following lines:

..LN376:
.loc 1 145 is_stmt 1
addq $16, %rcx #145.44 c13

Line 145 col 44 is where the third of the three-part for segment is, i.e. index += 16.

I'm not 100% sure, but I took it to mean that the MIC is running this loop sequentially.

Given this situation, my questions are as follows:

1) Is there a way to 'make sure' each iteration in this for loop runs in 120 separate threads evenly distributed across 60 cores?

2) Also, is there a way to confirm how the code is executed corewise and threadwise?
(From what I know, the code only reflects what is executed, not where and how)

3) Since it's a very small piece of code per iteration, would it be not worth the extra overhead if I used pthread in device to run it?
What's the difference between calling pthreads API and OpenMP threading in terms of how they are executed internally?

4) When I traced the execution with OFFLOAD_REPORT env. variable set,
I observed that copying the results BACK from device is WAY costlier.
Also, copying back to host seemed to consume both host and device CPU time,
while writing to device cost no MIC CPU time at all.
I have implemented the same app in Intel OpenCL and it took 28us/5us to write-to/read-from device,
and in OpenMP it was 8us/51us, which is strange.
Is there any way to make it 8us/5us without changing the programming model?
(FYI, I assumed that both programming models use user-level SCIF API calls underneath,
and I am currently working on a new app with separate host and mic code,
where MIC code acts as a SCIF server daemon and host code acts as a client that connects to device
and copies blocks of data that needs to be processed.)

5) Why is the device showing suboptimal performance under num_threads(180) and num_threads(240)?
I tried the num_threads in multiples of 60 and 120 showed best performance.
From what I read, you should be running 2 threads per core to make sure there is no idle cycles,
and there should not be any context switching overheads when running 2~4 threads per core.
Strange part is that when I monitored execution with micsmc, it showed full utilization in all cores when it was 240,
but it showed the worst performance in terms of execution time.
Later I realized that the core utilization percentage loosely translates to the number of running threads per core.
(25% if 1 thread, 50% if 2 threads, and so on)

6) VTune XE GUI doesn't seem to support profiling of offloaded region.
(Which is another reason I'm writing separate codes for host and device)
Am I doing something wrong? Or has it always been?
I ran Knights Corner profiling to see how my code fares, but it didn't show the hotspots in offloaded region.
It only showed a bunch of scif_wait()s from the host.

Thank you for your attention, and I welcome any insights or information that might aid my situation.

Jun

TimP · ‎10-21-2014

In your compiler generated .s code, you should see where OpenMP regions are and what are the internal calls from your application to OpenMP library. OpenMP is built on top of pthreads; you could refer to the Intel open source OpenMP library to study it.

You won't have a stable distribution across the cores or get much advantage from multiple threads per core unless you set affinity, e.g. by OMP_PROC_BIND or KMP_PLACE_THREADS. Even then, the L2 cache behavior of your application could limit scaling. You can observe how work is distributed by displaying the micsmc-gui bar graph and by setting KMP_AFFINITY=verbose (for MIC native). In my own observations I couldn't get much advantage from 2 threads per core in offload mode, although I did in native mode.

Jun_Hyun_S_ · ‎10-21-2014

Thank you, Tim. Great pointers. I didn't realize OpenMP was built on pthreads.
If I may rant a little, I find OpenMP threading and vectorization pragmas to be quite a bit implicit and wishy-washy
in performance critical optimization scenarios, but surely I'll find some answers looking into library docs and source codes.

I'll definitely try your suggestions and post back if I find some answers.

jimdempseyatthecove · ‎10-21-2014

Jun,

The loop you illustrated in post #1 is using the _mm512... vector intrinsics. These instructions are vector instructions, meaning there is no remaining sequential code for the compiler to add to vectorization of this loop. Thus the report that it cannot vectorize this loop.

In addition to Tim's excellent advice you can also consider KMP_AFFINITY=scatter (or since this is an offload model, MIC_KMP_AFFINITY=scatter). Then you can easily test with 60, 120, 180 threads. However, in offload mode, when there is a lot of instances of offloads you should consider reserving 1 core (or at least 1 thread) for purposes of managing the offloads.

Additional information. The first offload typically suffers the overhead of injecting the code and starting the OpenMP thread pool. For timing purposes you may want to consider performing an offload outside the timed offload region. You can add this overhead back on if this is important to your timing information.

Jim Dempsey

Jun_Hyun_S_ · ‎10-23-2014

Thank you, Jim.

It eventually dawned upon me, too, that might have been it.
However, I still found it interesting that the compiler had discovered the dependencies even when the code
was written in what should be a direct mapping to machine code.

Another valid point which I will experiment with.
But isn't the 61st core of Xeon Phi already devoted to handling microkernel operations which includes threading?

I have averaged each write, execution, and read time for every 10,000 offloads,
so initial overheads should have been amortized. And yes, the average elapsed time for the first 10,000 was always the highest.

Currently I'm browsing through Intel OpenMP runtime documents
and at the same time pondering whether I should have 60*k number of pthreads
pre-spawned across cores and waiting in the MIC daemon,
which should activate the necessary amount of pthreads upon client's (host's) scif_vwrite() input.

I'll be posting how my approaches fare.

Jun

jimdempseyatthecove · ‎10-24-2014

The number of cores is dependent on the particular model of Xeon Phi. On 5110P there are 60 cores, 3120A has 57 cores. This core is not hidden. When your application has very low offload overhead time verses compute load inside the offload, then it may make sense to use all the cores. On the flip side, when your application has relatively high offload overhead time verses compute load inside the offload, then it may make sense to reserve one of the cores (when using asynchronous offloads).

Depending on OpenMP implementation, it may spawn 0, 1, or more threads than the thread pool. And the O/S may need to use some threads (independent of offload) for virtual console and NFS file I/O. This is no different from a desktop or server for that matter.

Jim Dempsey

TimP · ‎10-24-2014

Offload mode, by default, reserves an entire core for system operations (kernel, scif data transfer, ....) You have an option to over-ride this, but there are good reasons for the setting. If you don't over-ride it, but set a number of threads which don't fit on one less than the full number of cores, threads will wrap around, resulting in uneven loading. You might hope to see this when running the micsmc-gui visualization.

Offload mode defaults to running 4 threads on each of the non-reserved cores. I haven't had any success in running so many threads. MIC_KMP_PLACE_THREADS is one of the simpler methods for choosing which cores to use and placing a specific number of threads per core (but requires that you take into account the number of non-reserved cores).

In case one of you is alluding to hidden threads (probably one for pthreads library, and one for Intel OpenMP library), I think it's reasonable to expect those run (infrequently) on the reserved core. VTune profiling adds to the load on the reserved core, so it's difficult to use VTune to get insights there.

In the early days, when MIC didn't have as many cores, I frequently found an advantage by running 1 worker thread on core 0 (but not when running offload mode or VTune). On KNC, there are frequent cases where performance peaks when I leave an additional core idle, beyond the one which is expected to be running system tasks. I don't know any rule to find out about this other than to experiment.

In KNC hardware thread numbering, thread 0 and the last 3 are attached to the core which is reserved by default from offload.

jimdempseyatthecove · ‎10-24-2014

>>In KNC hardware thread numbering, thread 0 and the last 3 are attached to the core which is reserved by default from offload.

That doesn't make any sense. That doesn't follow any of the APIC/APIC2 numbering schemes.

The OpenMP logical processor numbers will float(no affinity) or vary core location based on (MIC_)KMP_AFFINITY, (MIC_)KMP_PLACE_THREADS, etc...

If (stressed if) the OpenMP implementation for KMP_AFFINITY=compact, in offload mode, is designed to reserve 1 hardware thread of core 0 and 3 hardware threads of last core then I'd like to hear the justification for why this was done. Doing it this way means your loads per core are unbalanced. Even with KMP_AFFINITY=scatter, the cores would be unbalanced due to the last core getting 1 thread assigned on the first go around, and none on the remainder. The unbalanced loading results in a "weakest link" situation where the performance may be limited to the worst performing core.

IMHO, for offload, highest core should be reserved. (you could do lowest core too, my preference is to reserve the highest core)

Jim Dempsey

McCalpinJohn · ‎10-24-2014

The core numbering on Xeon Phi is certainly weird, but at least it has been clearly documented from the beginning. It works well using KMP_AFFINITY=compact for offload codes using up to 4*(N-1) threads (where N is the number of logical processors on the Xeon Phi) because the OpenMP threads are placed on logical processors starting with 1 rather than 0. So the first 4 threads go on physical core 0 (which runs logical processors 1,2,3,4), etc. The highest numbered core (N-1) runs logical processors 0, 4*N-3, 4*N-2, 4*N-1.

The default behavior is not so well suited for KMP_AFFINITY=scatter or KMP_AFFINITY=balanced. Each of these allocates OpenMP threads on physical processor N-1 (assuming 0-based numbering) as soon as the number of threads requested exceeds N-1.

This issue led to the introduction of KMP_PLACE_THREADS, which makes it much easier to avoid accidentally using the highest numbered core. Of course by that time I had already converted most of my codes to use KMP_AFFINITY with an explicit "proclist" parameter listing the specific logical processors I wanted to use.

jimdempseyatthecove · ‎10-24-2014

Given the above description, it would appear that your (someone's) specification of the logical processor numbering assignment is an O/S dependency and has nothing to do with the relationship of the APIC/APIC2 id's the you receive when querying with CPUID. This makes for non-symmetry between the physical and logical processor mappings (independent of the OpenMP team member numbers).

Using O/S system calls, it now makes it a requirement to know this (barring use of CPUID), then rotate the bitmask in order to properly place the (il)logical processor numbers in a meaningful position (for congregating core to HT mappings). The code I use, incorporates CPUID to properly survey the system and to map the O/S logical processor to physical core and HT. This would all be unnecessary had they not chosen this weird format.

Note, to the responsible party. Assuming that physical core 0, HT 0 (lowest APIC/APIC2 number) is hard wired to handle interrupts and other communication details. And that you desired to reserve this core, then the O/S should have been tweaked to start the logical processor numbering at core 1, HT 0, as O/S logical processor 0, do a wrap at last core, to core 0, and those become the last logical processors. Doing so would only impact one person in one place and this would be in the O/S kernel. As configured now, it can impact everyone else.

For the OpenMP crew, if not already implemented KMP_AFFINITY=usersChoice,SKIP_OS_CORE (or SKIP_CORE:0)

On non-Phi systems SKIP_OS_CORE would be a NOP.

2ct

Jim Dempsey

TimP · ‎10-24-2014

jimdempseyatthecove wrote:

Given the above description, it would appear that your (someone's) specification of the logical processor numbering assignment is an O/S dependency and has nothing to do with the relationship of the APIC/APIC2 id's the you receive when querying with CPUID.

Yes, use of logical processor 0 by OS was said to be a requirement of linux. For us old Fortran people, it's not too difficult to get used to the logical processors available for workers starting at 1. No Fortran programmer in their right mind tries to use I/O unit 0 either, although it exists on many systems.

McCalpinJohn · ‎10-24-2014

Just for fun I ran the CPUID tool on a Xeon Phi SE10P to see how the reported Core number and APIC numbers lined up.

The following values correspond to "CPU 0" through "CPU 244"

      process local APIC physical ID = 0xf0 (240)
      process local APIC physical ID = 0x0 (0)
      process local APIC physical ID = 0x1 (1)
      process local APIC physical ID = 0x2 (2)
      process local APIC physical ID = 0x3 (3)
      process local APIC physical ID = 0x4 (4)
      process local APIC physical ID = 0x5 (5)
      process local APIC physical ID = 0x6 (6)
      process local APIC physical ID = 0x7 (7)
      process local APIC physical ID = 0x8 (8)

.....

      process local APIC physical ID = 0xed (237)
      process local APIC physical ID = 0xee (238)
      process local APIC physical ID = 0xef (239)
      process local APIC physical ID = 0xf1 (241)
      process local APIC physical ID = 0xf2 (242)
      process local APIC physical ID = 0xf3 (243)

The information on threading says that each thread has three neighbors sharing cache+core, but it does not list which three "CPU" numbers those neighbors correspond to.

The "cpuid" tool I am using (2012-06-01 revision) reports the same core/thread layout that I described above:

   (APIC synth): PKG_ID=-252180480 CORE_ID=60 SMT_ID=0
   (APIC synth): PKG_ID=16254976 CORE_ID=0 SMT_ID=0
   (APIC synth): PKG_ID=33032192 CORE_ID=0 SMT_ID=1
   (APIC synth): PKG_ID=49809408 CORE_ID=0 SMT_ID=2
   (APIC synth): PKG_ID=66586624 CORE_ID=0 SMT_ID=3
   (APIC synth): PKG_ID=83363840 CORE_ID=1 SMT_ID=0
   (APIC synth): PKG_ID=100141056 CORE_ID=1 SMT_ID=1
   (APIC synth): PKG_ID=116918272 CORE_ID=1 SMT_ID=2
   (APIC synth): PKG_ID=133695488 CORE_ID=1 SMT_ID=3
   (APIC synth): PKG_ID=150472704 CORE_ID=2 SMT_ID=0
   (APIC synth): PKG_ID=167249920 CORE_ID=2 SMT_ID=1
   (APIC synth): PKG_ID=184027136 CORE_ID=2 SMT_ID=2
   (APIC synth): PKG_ID=200804352 CORE_ID=2 SMT_ID=3
   ...

   (APIC synth): PKG_ID=-386398208 CORE_ID=58 SMT_ID=0
   (APIC synth): PKG_ID=-369620992 CORE_ID=58 SMT_ID=1
   (APIC synth): PKG_ID=-352843776 CORE_ID=58 SMT_ID=2
   (APIC synth): PKG_ID=-336066560 CORE_ID=58 SMT_ID=3
   (APIC synth): PKG_ID=-319289344 CORE_ID=59 SMT_ID=0
   (APIC synth): PKG_ID=-302512128 CORE_ID=59 SMT_ID=1
   (APIC synth): PKG_ID=-285734912 CORE_ID=59 SMT_ID=2
   (APIC synth): PKG_ID=-268957696 CORE_ID=59 SMT_ID=3
   (APIC synth): PKG_ID=-235403264 CORE_ID=60 SMT_ID=1
   (APIC synth): PKG_ID=-218626048 CORE_ID=60 SMT_ID=2
   (APIC synth): PKG_ID=-201848832 CORE_ID=60 SMT_ID=3

jimdempseyatthecove · ‎10-24-2014

Apparently you have a 61 core processor.

The table you listed contradicts the statements you made (at least my interpretation).

On your system, logical processor 0 is not on core 0, rather it is on the last core (as HT 0), the last three logical processors are also located on the last core.

In the numbering scheme in your table above KMP_AFFINITY=compact (no other monkey business) places the OpenMP first level thread teams correctly (IMHO)

Team Member 0 = core 0, HT0
Team Member 1 = core 0, HT1
Team Member 2 = core 0, HT2
Team Member 3 = core 0, HT3
Team Member 4 = core 1, HT0
...

The last core, if used, continues numbering with HT1, HT2, HT3, and then HT0 (if you are so inclined to use it).

This is a sane way of numbering (though I'd of been tempted to place the service thread on HT3 of last core to keep the numbering scheme consistent).

For KMP_AFFINITY=scatter you are mostly fine, excepting that you may experience some cache issues for the threads landing on the last core interacting (sharing cache) with the MPSS service thread. Therefore, it might be handy to exclude the last core:

KMP_AFFINITY=compact
KMP_PLACE_THREADS=60c,3t

This would use 3 threads per core, reserving the last core (61st on your MIC, use different number for MICs of differing core count).

Jim Dempsey

McCalpinJohn · ‎10-25-2014

I don't see any inconsistency between my earlier posting and the APIC values, but there might be a typographical error that I am auto-correcting in my head.

Physical core 0 gets logical processors 1,2,3,4, and KMP_AFFINITY=compact maps OpenMP threads 0,1,2,3 to these four logical processors, etc.

For KMP_AFFINITY=compact, nothing is placed on place on physical core 60 for OMP_NUM_THREADS between 1 and 240. This is very convenient for the offload model, since logical processor 0 (on core 60) is fairly busy with the Xeon Phi side of the offload runtime library.

For KMP_AFFINITY=scatter or KMP_AFFINITY=balanced, a thread gets bound to physical core 60 for any value of OMP_NUM_THREADS greater than 60. With "granularity=fine" you eventually get a thread bound to logical processor 0, which maximizes the degree of conflict. Of course, KMP_PLACE_THREADS=60c,4t takes care of this problem and allows all three KMP_AFFINITY options (compact, scatter, balanced) to be distributed across the first 60 cores (staying off core 60) for all values of OMP_NUM_THREADS.

I don't usually run offload codes, so I don't need to stay away from logical processor 0. The OS can run anywhere, though I see the largest number of interrupts on logical processor 0 (and on two of the other three logical processors on physical core N-1). Under very carefully controlled situations I can get better performance using 61 cores than by using 60, but the performance downside of having conflicts when the OS gets in the way is generally much larger than the performance upside of having one extra core.

Jun_Hyun_S_ · ‎11-21-2014

Hi all, I appreciate all of your comments and advices.

I have done some work since then, and I hit a little snag with the core pinning using pthread.

I'm moving on to the new topic, which is here: https://software.intel.com/en-us/forums/topic/536182