I think your actual questions may be obscured by buzzwords, but I'll take a stab.
Parallel programming models continue to be developed which are intended to cover the range of architectures you mention.
Intel compilers have supported OpenMP as well as Windows or pthreads parallelism, for several years. The TBB threading model for C++ has also become well established. These work with multiple vendors' products. Current Intel C++ has added Cilk+ and ABB namespaces for threaded parallelism. They are being extended to the Intel Many-core MIC architecture.
For a degree of portability across GPGPU-like architectures, among the available alternatives are extended OpenMP models with traditional programming languages, such as PGI cuda compilers and similar Intel MIC compilers, and OpenCL. This area is evolving rapidly, with most vendors aiming to converge in some fashion with "big CPUs" over the next 5 years.
When you speak of web scale, a possibility which comes to mind is Hadoop clustering on Java based model. Big strides in expanding deployment and efficiency have been made this year. It seems likely that serious efforts to incorporate massive parallelism will be undertaken over 2 or 3 years, but the software and hardware components probably don't exist yet, so this clearly falls in the research topic area.