I've some experience with nVidia CUDA and some low-level optimizations regarding their GPUs. Recently I had to port that to OpenCL in order to make the code we write running on more hardware platforms. I've checked the Intel white paper regarding the Intel integrated GPUs, but so far I haven't manage to find what exactly (if any) is the difference between nVidia SIMT model and the SIMD model implemented in the Intel's integrated GPUs. Further more, I've failed to find any video lectures regarding the Intel GPU architecture (in contrast, there are the Stanford course in iTunes U, which is quite helpful).
Any help regarding those will be greatly appreciated.
You can find a new and detailed architecture presentation here: The Compute Architecture of Intel® Processor Graphics Gen8.
There are more PDFs on the Intel site. Check out some of the (Haswell) Iris Pro OpenCL presentations as they do a good job enumerating the speeds and feeds of GEN 7.5.
I've been developing on CUDA for 5 years and over the last year did a deep dive into the Intel GEN architecture and, as far as I can tell, GEN is also a SIMT device and CUDA is also a SIMD device. :)
However, with Intel IGP there is some new terminology to learn. To squeeze out performance you should think in terms of EUs, EU hardware threads and their register files and, finally, slices and sub-slices.
In recent IGPs, each relatively narrow EU has 7 hardware threads each with a rather generous register file. The same generosity extends to shared memory and each sub-slice gets up to 64KB of shared local memory.
Another noticeable difference is that workgroups (blocks) are relatively small at ~256-512 threads.
My current belief is that if you're already a GPU optimization wizard then you're going to be surprised at the amount of register and shared memory resources available per narrow EU. Multiply that by the number of EUs in your IGP and you should begin to realize there are new optimization opportunities.