>overhead of the Open CL API, which was built with the assumption of physically separate memory between CPU and GPU
Actually OpenCL defines enqueueMapBuffer function. Intel advises to use it. And, by the way, the competetor with its recent processors with integrated GPUs also advise using this function.
Do you mean thatOpenCL showssignificantly lower perfromance in processing sin(float a) function in comparison with generic host code? Well, if it is truewhencalling this function1,000,000 times (for different elements) then it is strange and investigation is required. But if you are talking about just single call, then it is fine. OpenCL is for relatevily large parallel computation tasks.