The abstraction penalty will do nothing to close any performance gaps between CUDA. SYCL requires a magic compiler that intelligently and automatically segments device code from host code. Nothing like it currently exists and I expect it will be some time before it's available on any platform.
Well I can't see any reason for a gap in performance between CUDA and OpenCL (except bad implementation).
All the wrapping from C++ to OpenCL kernels and call will have to do exactly the same work as the CUDA compiler is currently doing so I would not see it as a penalty since it is done at compile time. And since there are pragmas to guide the compiler on what to extract it is not really that hard. CUDA is doing exactly that so it is not really true that nothing else exists :)
In both cases you are going to have to write specific code if you want any sort of good performance. Memory management and access being the main issue as far as I have seen in all my tests, for example memory handling in CUDA for now is very different from what I used with the HD graphics since Intel supports shared memory between CPU and GPU.
I think that this part won't change greatly and it will still need special attention based on the data and workflow.
With SYCL we will gain easier access (for the developers, I personally greatly prefer C++ with templates than C code with String defined kernels) to OpenCL which is opening more devices and platform to us than CUDA is currently providing.
Please see http://lists.llvm.org/pipermail/cfe-dev/2019-January/060811.html.
You can also try the CodePlay ComputeCpp SYCL compiler using Intel OpenCL. I've tried it on CPU and GPU with good results.