I have an operation on a large array written in array notation. Since the array is large, what I really want is the work to be split up across many cores, and each core to use SIMD units to perform its work. Is there an easy way to specify that the work should be divided up among however many threads there are in the machine?
Or do I need to rewrite the array notation as a loop and turn it into a cilk_for loop? In that case, would placing a #pragma simd before the cilk_for loop succeed in telling the compiler that the work is vecotrizable, or does that only work for regular loops?
Thanks in advance for any info.
cilk_for _Simd specifies a loop to be split among workers with simd vectorization of the chunks. Prior to the introduction of this notation, it was necessary to split the loop explicitly to get simd vectorization. Array notation by itself doesn't allow for splitting among workers.
You're probably aware of this, but an array operation suitable for cilk_for _Simd normally can't use more than 1 worker per core effectively.
It's not a restriction, it's simply that you would normally get best performance with each worker running on a separate core (on CPUs other than Xeon Phi KNC).
I don't know anything you can do on KNC beyond testing with various settings of CILK_NWORKERS. I usually get best performance of cilk_for around CILK_NWORKERS=118 on my very old 61-core KNC. KNC allows each hardware thread access to VPU at most every other clock cycle, so that 2 workers per core should have some advantage (but the default of 4 workers per core is usually slower than 2).
The design of Xeon Phi KNL permits a single worker per core to achieve full performance (more like host CPU), but I don't know whether any option will be implemented to assure that cilk_plus(tm) workers are distributed evenly across cores. I don't have sufficient access to KNL to get experience with cilk_plus(tm) on that CPU.
It seems that affinity mechanisms to facilitate peak multi-core performance are incompatible with the "composability" feature. Mechanisms such as KMP_HW_SUBSET for OpenMP are predicated on a single job having exclusive use of the processor.
Tim is correct that array notation does not take advantage of multiple cores. The problem with cilk_for _Simd is that it uses a very primitive algorithm for breaking a parallel loop into SIMD chunks. You might get better performance by explicitly tiling your array operation into 2-dimensional vector-aligned chunks that fit in L1 cache. Use a cilk_for outer loop to loop over the tiles and a SIMD inner loop to process each tile. Some experimentation with tile sizes will certainly be needed.
I agree with Tim that you need to experiment with CILK_NWORKERS. If your processor has P cores, the best values for CILK_NWORKERS is often P, P - 1, 2*P, 2*P - 1, 4*P, or 4*P - 4. If you are mixing code with OpenMP or if there are other jobs running on your system, then you might be tempted to use smaller values of CILK_NWORKERS, and maybe you should, BUT -- our experience is that there is not much penalty for making CILK_NWORKERS too large. A 2x oversubscription is typically not a problem for Cilk code, except for the disappointment of not getting the 2x performance boost you might be hoping for.
Performance peaking at CILK_NWORKERS = 2*P -1 for P cores with hyperthreading is typical for host CPUs. I assume this is because it keeps all cores busy. It resembles the performance of gnu OpenMP on Windows (7,8,...) because the gnu OpenMP for Windows doesn't support affinity.
That brings up the question of gnu cilkplus, which unfortunately still falls short of being capable of running my benchmarks (even on linux).
People mean various things by the term "over-subscription." Setting more workers than available logical processors clearly justifies the term. More workers than available cores is a normal expectation for KNC (but on KNC it's typical for 1 core to be unavailable to the application). The number of available VPU slots might be a number to shoot for (or something equivalent on other CPUs), but that's not a well defined concept, nor is there a scheduler facility for cilk_plus(tm) to spread the workers out so as to match available VPU slots.
We ran tests on KNC which showed that TBB affinity facilities could boost performance over cilk_plus(tm), but (in my view) this makes TBB more difficult to use than OpenMP, so it's understandable that cilk_plus(tm) doesn't offer such facilities
In the 5 cases of cilk_for _Simd in my benchmarks at https://github.com/tprince/lcd , performance is at least 60% of C + Openmp 4.0 with Intel C++ on HSW 2 core HT laptop (CILK_NWORKERS=3), so I think it's satisfactory. There are significant gains from both simd and multiple workers.
On my KNC, cilk_for performance generally is less than 40% of OpenMP, but I suppose it still meets the goals of cilk_plus(tm).
I haven't seen any new publications assessing cilk_plus performance on MIC since we published at TACC 2012 symposium.
There are a few cases where cilk_plus(tm) array notation performance has improved since then, but I haven't seen much as far as multi-core performance is concerned.
Intel didn't want to put any such papers up on this site when we were working on KNC. I don't think matching performance of OpenMP on MIC was a goal for cilk_plus(tm).