Software Archive
Read-only legacy content
17061 Discussions

changes in CEAN implications re vectorization in next release?

TimP
Honored Contributor III
374 Views

I note that premier.intel.com is open for submissions for the first time in months, so I submitted a ticket.  It still looks like setting up premier for beta 16 has not been completed, as the form requires choosing compiler version 15.0 or earlier from the pull-down.

I'm curious whether Cilk(tm) Plus is under deprecation, in view of the comments at IDF last year that Intel would not sponsor publications on it, and the lack of follow-through to make gcc -fcilkplus viable.

As to changes in behavior in the beta test:

CEAN seems to require vectorization now even when the compiler recognizes that its vector code is not competitive with scalar (opt-report shows a speedup factor of 0.68 in the case I submitted).

CEAN no longer vectorizes a case where the compiler diagnoses a possible (but non-existent) "output dependence."  It used to be that CEAN was equivalent to aggressive pragmas like ivdep or simd.  In spite of the announcement that OpenMP 4 pragmas will now be allowed for cilk_for, none of the pragmas are accepted, or, if accepted, have any effect, when applied to a CEAN assignment.

CEAN code with ? operator equivalent to C++ std::max is now considered vectorizable in some cases.  This is a good step, as the max and min functions are an important locus of performance incompatibility between C and C++ and between Intel and gnu.  In fact, I got a reply recently from a gnu expert agreeing that the gcc options needed for vectorization can't be recommended.

The new cilk_for _Simd keyword is advertised as an important feature, but I would think it would do something useful in my cases if Cilk(tm) Plus were to be considered important for the future.  In my cases, it is either ignored (with a diagnostic the vectorization is not attempted), or it acts differently from #pragma omp simd in C or C++ and produces either slow vgather code or strange comments about invoking lambda processing, which also is slow.

0 Kudos
3 Replies
Hideki_I_Intel
Employee
374 Views

Tim Prince wrote:

It still looks like setting up premier for beta 16 has not been completed, as the form requires choosing compiler version 15.0 or earlier from the pull-down

Hope that'll be resolved soon.

Tim Prince wrote:

I'm curious whether Cilk(tm) Plus is under deprecation, in view of the comments at IDF last year that Intel would not sponsor publications on it, and the lack of follow-through to make gcc -fcilkplus viable.

To the best of my knowledge, 16.0 compiler continues to support Cilk(tm) Plus and we attempt to improve its support. I cannot comment on the GCC side since I do not work on that project.

Tim Prince wrote:

CEAN seems to require vectorization now even when the compiler recognizes that its vector code is not competitive with scalar (opt-report shows a speedup factor of 0.68 in the case I submitted).

For the record, Cilk(tm)Plus Array Notation is the proper terminology for CEAN.

This is an expected behavior (i.e., feature). If vectorizer refuses to vectorize array notation code due to cost modeling, that's a bug. So, it looks like we fixed a bug.

Tim Prince wrote:

CEAN no longer vectorizes a case where the compiler diagnoses a possible (but non-existent) "output dependence."  It used to be that CEAN was equivalent to aggressive pragmas like ivdep or simd.

We haven't changed these basic aspects of implementation. Please file a bug report so that we can take a look at what happened.

Tim Prince wrote:

In spite of the announcement that OpenMP 4 pragmas will now be allowed for cilk_for, none of the pragmas are accepted, or, if accepted, have any effect, when applied to a CEAN assignment.

cilk_for and array notation are two completely different constructs.

Tim Prince wrote:

In my cases, it is either ignored (with a diagnostic the vectorization is not attempted), or it acts differently from #pragma omp simd in C or C++ and produces either slow vgather code or strange comments about invoking lambda processing, which also is slow.

Again, please file bug reports. We certainly cannot promise OMP SIMD and Array Notation to produce the same performance since they often look differently to the vectorizer, but we can investigate why and try to improve if we can.

Thanks.

0 Kudos
TimP
Honored Contributor III
374 Views

new premier issue 6000100605 includes 2 examples of cilk_for _Simd which invoke slow hidden lambda function code, so I guessed that it could be called a c++14 issue.

I believe the effective omp parallel for simd implementation of these cases involves strip mining of stride 1 vector chunks, but one of the cilk_for _Simd cases reports vgather indirect access, so it seems memory access pattern may be part of the problem.

0 Kudos
TimP
Honored Contributor III
374 Views

The generated code for the lambda functions does appear in a saved asm file.  In several examples, lambda code looks to be of fair quality, although not in all examples.  I'm not judging by whether the compiler chooses insertps or vgather instructions; it makes the same choice when compiling for omp simd.  Running under VTune, it appears that time spent in cilkrts may be excessive, particularly when running on multiple CPU platform, or with HyperThreading active, or on MIC.

According to my tests, when running on the dual core + HT platform, CILK_NWORKERS=3 is optimum.  It seems that with 2 workers, one core may be idle 50% of the time, while there is no gain (and possible turbo clock-down) if using all logical threads in floating point code.  I have just one case where cilk_for _Simd runs faster than either plain cilk_for or CPAN single thread vector on the HT platform.

0 Kudos
Reply