- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Have been using templates to write efficient, natural-looking maths that can be compiled against a variety of instruction sets and word widths. The basic idea is to write template classes which take a datatype & math operations class as a template argument. The type & math op class implements those operations for a specific instruction set and layout (whether that be x87 FPU, SSE, AVX etc.).
I have both SSE and AVX versions working well for single-element processing - however, when I extended the code to support higher order parallelism, the compiler continues to generate great code for SSE, but bogs right down for AVX - generating lots of seemingly unnecessary loads, stores and moves that it's not able to optimize out. Which has a huge detrimental impact on performance.
Running ICC 13 beta, VS2012 RC, Win 7. Command line options for compiler as follows:-
/MP /GS- /Qftz /W3 /QxAVX /Gy /Zc:wchar_t /Zi /Ox /Ob2 /fp:fast /D "__INTEL_COMPILER=1300" /Qip /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MT /EHsc /nologo /FAs /Ot
Have attached C++ source, and SSE & AVX code generated by the compiler. (NB: have done search-and-replace on a few symbols for clarity, but is otherwise unmodified).
SSE code looks pretty much as I'd expect, aside from the compiler being smart enough to figure out that myTest2 is a constant expression & it doesn't need to load it in multiple times; and interleaving an add in with the multiplies to help IPC along a bit. Really happy with being able to write natural C++ looking code, and the compiler doing the hard work here!
Unfortunately the AVX code - same compiler version, same level of optimization, very similar MathOps base classes - is far from optimal. Dozens of move instructions, switching between YMM and XMM registers etc.. I'll attach the relevant snippets of the MathOps template classes below.
C++ source - see attachments for SSE & AVX MathOps base classes
template <class MathOps> class MyFilter : public MathOps
{
public:
static void TestFunction()
{
_ReadWriteBarrier(); // prevent optimizations bleeding later/earlier
vec_float myTest1;
myTest1.m[0] = _mm_set1_ps(125.f); // (switch these for _mm256_set1_ps for AVX version)
myTest1.m[1] = _mm_set1_ps(126.f);
myTest1.m[2] = _mm_set1_ps(127.f);
myTest1.m[3] = _mm_set1_ps(128.f);
vec_float myTest2 = 135.f;
myTest1 += (myTest1 * myTest2);
static vec_float myTest_Out = myTest1; // store it in a static so the compiler won't discard the calculations entirely.
_ReadWriteBarrier();
};
};
int main()
{
MyFilter<MathOps_SSEx4>::TestFunction();
// MyFilter<MathOps_AVXx4>::TestFunction();
};
SSE code generated by compiler
;;; vec_float myTest1;
;;; myTest1.m[0] = _mm_set1_ps(125.f);
;;; myTest1.m[1] = _mm_set1_ps(126.f);
;;; myTest1.m[2] = _mm_set1_ps(127.f);
;;; myTest1.m[3] = _mm_set1_ps(128.f);
;;; vec_float myTest2 = 135.f;
vmovups xmm0, XMMWORD PTR [_2il0floatpacket.1741] ;535.13
;;; myTest1 += (myTest1 * myTest2);
vmulps xmm1, xmm0, XMMWORD PTR [_2il0floatpacket.1737] ;536.3
vmulps xmm2, xmm0, XMMWORD PTR [_2il0floatpacket.1738] ;536.3
vmulps xmm3, xmm0, XMMWORD PTR [_2il0floatpacket.1739] ;536.3
vaddps xmm5, xmm1, xmm1 ;536.3
vmulps xmm4, xmm0, XMMWORD PTR [_2il0floatpacket.1740] ;536.3
vaddps xmm0, xmm2, xmm2 ;536.3
vaddps xmm1, xmm3, xmm3 ;536.3
vaddps xmm2, xmm4, xmm4 ;536.3
;;; static vec_float myTest_Out = myTest1; // store it in a static so the compiler won't discard the calculations entirely.
vmovups XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+16], xmm0 ;537.20
vmovups XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+32], xmm1 ;537.20
vmovups XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A+48], xmm2 ;537.20
vmovups XMMWORD PTR [?myTest_Out@?1@Z@4Vvec_float@Mathops_SSEx4@4@A], xmm5 ;537.20
AVX code generated by compiler
;;;
;;; vec_float myTest1;
;;; myTest1.m[0] = _mm256_set1_ps(125.f);
vmovups ymm0, YMMWORD PTR [_2il0floatpacket.1745] ;531.3
$LN1856:
;;; myTest1.m[1] = _mm256_set1_ps(126.f);
vmovups ymm2, YMMWORD PTR [_2il0floatpacket.1746] ;532.3
$LN1857:
;;; myTest1.m[2] = _mm256_set1_ps(127.f);
vmovups ymm4, YMMWORD PTR [_2il0floatpacket.1747] ;533.3
$LN1858:
;;; myTest1.m[3] = _mm256_set1_ps(128.f);
vmovups ymm6, YMMWORD PTR [_2il0floatpacket.1748] ;534.3
$LN1859:
;;; vec_float myTest2 = 135.f;
vmovups ymm7, YMMWORD PTR [_2il0floatpacket.1749] ;535.13
$LN1860:
vmovups YMMWORD PTR [esp], ymm0 ;531.3
$LN1861:
vmovups YMMWORD PTR [32+esp], ymm2 ;532.3
$LN1862:
vmovups YMMWORD PTR [64+esp], ymm4 ;533.3
$LN1863:
vmovups YMMWORD PTR [96+esp], ymm6 ;534.3
$LN1864:
;;; myTest1 += (myTest1 * myTest2);
vmulps ymm1, ymm0, ymm7 ;536.3
$LN1865:
vmulps ymm3, ymm2, ymm7 ;536.3
$LN1866:
vmulps ymm5, ymm4, ymm7 ;536.3
$LN1867:
vmulps ymm0, ymm6, ymm7 ;536.3
$LN1868:
vmovups YMMWORD PTR [128+esp], ymm1 ;536.3
$LN1869:
vmovups YMMWORD PTR [160+esp], ymm3 ;536.3
$LN1870:
vmovups YMMWORD PTR [192+esp], ymm5 ;536.3
$LN1871:
vmovups YMMWORD PTR [224+esp], ymm0 ;536.3
$LN1872:
vmovups xmm1, XMMWORD PTR [144+esp] ;536.3
$LN1873:
vmovups XMMWORD PTR [16+esp], xmm1 ;536.3
$LN1874:
vmovups xmm2, XMMWORD PTR [160+esp] ;536.3
$LN1875:
vmovups XMMWORD PTR [32+esp], xmm2 ;536.3
$LN1876:
vmovups xmm3, XMMWORD PTR [176+esp] ;536.3
$LN1877:
vmovups XMMWORD PTR [48+esp], xmm3 ;536.3
$LN1878:
vmovups xmm4, XMMWORD PTR [192+esp] ;536.3
$LN1879:
vmovups XMMWORD PTR [64+esp], xmm4 ;536.3
$LN1880:
vmovups xmm5, XMMWORD PTR [208+esp] ;536.3
$LN1881:
vmovups XMMWORD PTR [80+esp], xmm5 ;536.3
$LN1882:
vmovups xmm6, XMMWORD PTR [224+esp] ;536.3
$LN1883:
vmovups XMMWORD PTR [96+esp], xmm6 ;536.3
$LN1884:
vmovups xmm7, XMMWORD PTR [240+esp] ;536.3
$LN1885:
vmovups XMMWORD PTR [112+esp], xmm7 ;536.3
$LN1886:
vmovups xmm0, XMMWORD PTR [128+esp] ;536.3
$LN1887:
vmovups XMMWORD PTR [esp], xmm0 ;536.3
$LN1888:
; LOE eax edx esi
.B48.3: ; Preds .B48.2
$LN1889:
vmovups ymm0, YMMWORD PTR [esp] ;536.3
$LN1890:
vmovups ymm2, YMMWORD PTR [32+esp] ;536.3
$LN1891:
vmovups ymm4, YMMWORD PTR [64+esp] ;536.3
$LN1892:
vmovups ymm6, YMMWORD PTR [96+esp] ;536.3
$LN1893:
vaddps ymm1, ymm0, ymm0 ;536.3
$LN1894:
vaddps ymm3, ymm2, ymm2 ;536.3
$LN1895:
vaddps ymm5, ymm4, ymm4 ;536.3
$LN1896:
vaddps ymm7, ymm6, ymm6 ;536.3
$LN1897:
vmovups YMMWORD PTR [128+esp], ymm1 ;536.3
$LN1898:
vmovups YMMWORD PTR [160+esp], ymm3 ;536.3
$LN1899:
vmovups YMMWORD PTR [192+esp], ymm5 ;536.3
$LN1900:
vmovups YMMWORD PTR [224+esp], ymm7 ;536.3
$LN1901:
vmovups xmm0, XMMWORD PTR [144+esp] ;536.3
$LN1902:
vmovups XMMWORD PTR [16+esp], xmm0 ;536.3
$LN1903:
vmovups xmm1, XMMWORD PTR [160+esp] ;536.3
$LN1904:
vmovups XMMWORD PTR [32+esp], xmm1 ;536.3
$LN1905:
vmovups xmm2, XMMWORD PTR [176+esp] ;536.3
$LN1906:
vmovups XMMWORD PTR [48+esp], xmm2 ;536.3
$LN1907:
vmovups xmm3, XMMWORD PTR [192+esp] ;536.3
$LN1908:
vmovups XMMWORD PTR [64+esp], xmm3 ;536.3
$LN1909:
vmovups xmm4, XMMWORD PTR [208+esp] ;536.3
$LN1910:
vmovups XMMWORD PTR [80+esp], xmm4 ;536.3
$LN1911:
vmovups xmm5, XMMWORD PTR [224+esp] ;536.3
$LN1912:
vmovups XMMWORD PTR [96+esp], xmm5 ;536.3
$LN1913:
vmovups xmm6, XMMWORD PTR [240+esp] ;536.3
$LN1914:
vmovups XMMWORD PTR [112+esp], xmm6 ;536.3
$LN1915:
vmovups xmm7, XMMWORD PTR [128+esp] ;536.3
$LN1916:
vmovups XMMWORD PTR [esp], xmm7 ;536.3
$LN1917:
; LOE eax edx esi
.B48.4: ; Preds .B48.3
$LN1918:
;;; static vec_float myTest_Out = myTest1; // store it in a static so the compiler won't discard the calculations entirely.
movzx ecx, BYTE PTR [??_B?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@3@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@51] ;537.20
$LN1919:
bts ecx, 0 ;537.20
$LN1920:
jc .B48.6 ; Prob 40% ;537.20
$LN1921:
; LOE eax edx ecx esi
.B48.5: ; Preds .B48.4
$LN1922:
mov BYTE PTR [??_B?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@3@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@51], cl ;537.31
$LN1923:
vmovups xmm0, XMMWORD PTR [16+esp] ;537.20
$LN1924:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+16], xmm0 ;537.20
$LN1925:
vmovups xmm1, XMMWORD PTR [32+esp] ;537.20
$LN1926:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+32], xmm1 ;537.20
$LN1927:
vmovups xmm2, XMMWORD PTR [48+esp] ;537.20
$LN1928:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+48], xmm2 ;537.20
$LN1929:
vmovups xmm3, XMMWORD PTR [64+esp] ;537.20
$LN1930:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+64], xmm3 ;537.20
$LN1931:
vmovups xmm4, XMMWORD PTR [80+esp] ;537.20
$LN1932:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+80], xmm4 ;537.20
$LN1933:
vmovups xmm5, XMMWORD PTR [96+esp] ;537.20
$LN1934:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+96], xmm5 ;537.20
$LN1935:
vmovups xmm6, XMMWORD PTR [112+esp] ;537.20
$LN1936:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A+112], xmm6 ;537.20
$LN1937:
vmovups xmm7, XMMWORD PTR [esp] ;537.20
$LN1938:
vmovups XMMWORD PTR [?myTest_Out@?1??ProcessBufferLoop@?$SAVXF@VIOGather_Original_AVXx4@VDSP@@@@SAXPBVSongInfo@VDSP@@PBUVoiceInfo@4@AAV?$VArray1@PAVVDSPVoice@VDSPNode@VDSP@@@VFoundation@@@Z@4Vvec_float@Mathops_AVXx4@4@A], xmm7 ;537.20
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page