>>Third did you try to partially specialize TDataSet class for SSE and for AVX data types by using manual
Partially Yes. But, there are so many Intel Intrinsic domains, like MMX, SSE, SSE2, SSE4, AVX, AVX2, etc, that full support of all these Intrinsic domains is absolutely useless. For example, nobody is interested in MMX or SSE at the moment.
Also, a current state of Intel Intrinsic domains I would consider as a very messy and there are lots of inconsistencies. In my thread devoted to Intergration of Watcom C++ compiler I've already expressed my point of view that direct usage of Intel Intrinsics does not solve all performance problems.
[ Computer System used for performance evaluations ]
** Dell Precision Mobile M4700 **
Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit SP1
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Display resolution: 1366 x 768