Vector_Add example with 4 Compute Units

I am trying to understand the use of "num_compute_units(N)" attribute by using "Vector_Add" example . I have not a physical board so I am using the emulator. I attach an image of the OpenCL code with the above attribute. I have several doubts regards its functioning because of the kernel execution time without the attribute is better than the code with "num_compute_units(4)" attribute (as in the attached image). I expect that the code with four CUs reduces the execution time. Do I have to do some changes in the OpenCL code?  


Thanks for your help 


Marco Montini
Run time in the emulator is *not* a representative of run time on the hardware; in fact, run time in the emulator *does not mean anything whatsoever*. Altera's emulator is not timing-accurate and hence, should not be used for anything other than *functional verification*. 


Furthermore, I already explained here as to why there is no point in using multiple compute units for the vector add example: