Are you sure the Nvidia compile did a full rebuild? I've found that it will cache previous build(s) with the same hash of the source file. For example, if you undo some changes, or modify the file to the same as a previous build, a full rebuild will not be triggered.
One of my kernels takes more than 10 seconds to compile by Nvidia, but with Intel it is less than two seconds. (clBuildProgram) My results seem to vary though with Nvidia, so I am considering precompiling the .cl source to .ptx using clCreateProgramWithBinary rather than clCreateProgramWithSource.
NVidia definately caches builds and this has been a discussion on this forum in the past. There seem to vary based on the content of the loops, for example Loop unrolling on the GPU expecially in AMD's compiler can take extreammly long when compiling or the GPU.