Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Performance of sqrt

Christian_M_2
Beginner
12,727 Views

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

0 Kudos
101 Replies
SergeyKostrov
Valued Contributor II
1,541 Views
>>...Do you have some idea?.. Let me take a look at updated sources and I'll post updated project some time next week. Thank you!
0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>...Unfortunately I see a problem. You use very high iteration counts... If I use a number that is less then 2^24 then tests for AVX-sqrt will be executed in less then 15 ticks. I have Intel Core i7-3840QM ( Ivy Bridge ) and it is very fast. So, there is nothing wrong here and you can use a lower number. That is why there is a piece of code like: [cpp] ... // int iNumberOfIterations = 16777216; // 2^24 // int iNumberOfIterations = 33554432; // 2^25 // int iNumberOfIterations = 67108864; // 2^26 // int iNumberOfIterations = 134217728; // 2^27 int iNumberOfIterations = 268435456; // 2^28 ... [/cpp]
0 Kudos
Christian_M_2
Beginner
1,541 Views

Ah this sounds quite good.

Let me know when you have the updated project. I will let it run on two Sandy Bridge Systems (one mobile i5 and desktop high class i7).

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
Here is updated Visual Studio 2008 project ( it is a Professional Edition ) with Intel C++ compiler set by default. Please do a code review and, if everything looks good, we're ready to do testing. Best regards, Sergey
0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
[ Example of Output ] 64-bit Windows platform Notes: - Processing is Normalized - Tests calculate 8 sqrt values per iteration - Number of iterations is 33554432 Tests started CRT Sqrt - float Calculating the Square Roots - Done in xx ticks 625.000^0.5 = 25.000 SSE Sqrt - float Calculating the Square Roots - Done in xx ticks 625.000^0.5 = 25.000 AVX Sqrt - float Calculating the Square Roots - Done in xx ticks 625.000^0.5 = 25.000 STL vector size: 67108864 ( float elements ) Number of tests: 4 STL vector: STL Sqrt - float Calculating the Square Roots Test 1: xxx ticks Test 2: xxx ticks Test 3: xxx ticks Test 4: xxx ticks Average: xxx ticks STL vector: SSE Sqrt - float Calculating the Square Roots Test 1: xx ticks Test 2: xx ticks Test 3: xx ticks Test 4: xx ticks Average: xx ticks STL vector: AVX Sqrt - float Calculating the Square Roots Test 1: xx ticks Test 2: xx ticks Test 3: xx ticks Test 4: xx ticks Average: xx ticks Tests completed Press ESC to Exit...
0 Kudos
Christian_M_2
Beginner
1,541 Views

Think code is quite good now, so lets start the tests and see what we get.

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>...Think code is quite good now, so lets start the tests and see what we get. I'll post my results today in the afternoon. Thanks, Christian.
0 Kudos
Christian_M_2
Beginner
1,541 Views

Thanks, Sergey.

I think the whole combination of our test scenarios gives quite an good overview.

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>...I think the whole combination of our test scenarios gives quite an good overview. If you wish I could e-mail my test results in a private message. Would you be able to create a combined report before posting it?
0 Kudos
Christian_M_2
Beginner
1,541 Views

You mean I run the test on another machine and then we post it together?

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>You mean I run the test on another machine and then we post it together? Yes. - I do the test on Ivy Bridge and email you results - You do the test on Sandy Bridge and create a combined report - You post results to the thread Does it look good?
0 Kudos
Christian_M_2
Beginner
1,541 Views

Yes, thats fine. I think I can run the tests that evening and post them results tomorrow.

And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler?

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>...And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler? Yes. I'll build binaries ( 32-bit and 64-bit Release Configurations ) for tests and pack them into a zip-archive. My test results also will be included. Note: You will need run-time DLLs ( Redistributable Package ) for Visual Studio 2008 and you can get it from Download.Microsoft.com.
0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
>>>>...And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler? >> >>Yes. I'll build binaries ( 32-bit and 64-bit Release Configurations ) for tests and pack them into a zip-archive. My test >>results also will be included. Done. Please check your private messages. Best regards, Sergey
0 Kudos
Christian_M_2
Beginner
1,541 Views

Here you find the test results, based on the project provided above. All additional information can be found in the output itself.

///////////////////////////////////////////////////////////////////////////////
    CONSOLE APPLICATION : SqrtTestApp Project Overview
///////////////////////////////////////////////////////////////////////////////

Release Notes:

    6. Tests on Sandy Dridge system:

    >> 32-bit <<

        32-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 172 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 327 ticks
                Test  2: 343 ticks
                Test  3: 344 ticks
                Test  4: 327 ticks
                Average: 335 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 93 ticks
                Test  2: 94 ticks
                Test  3: 78 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 94 ticks
                Test  2: 78 ticks
                Test  3: 93 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        Tests completed

        Press ESC to Exit...


    >> 64-bit <<

        64-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 187 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 16 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 328 ticks
                Test  2: 343 ticks
                Test  3: 343 ticks
                Test  4: 343 ticks
                Average: 339 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 78 ticks
                Test  2: 78 ticks
                Test  3: 93 ticks
                Test  4: 78 ticks
                Average: 81 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 78 ticks
                Test  2: 94 ticks
                Test  3: 93 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        Tests completed

        Press ESC to Exit...


    5. Sandy Bridge system:

        Betriebssystemname    Microsoft Windows 7 Home Premium
        Version    6.1.7601 Service Pack 1 Build 7601
        Zusätzliche Betriebssystembeschreibung     Nicht verfügbar
        Betriebssystemhersteller    Microsoft Corporation
        Systemname    DANIELA-LAPTOP
        Systemhersteller    SAMSUNG ELECTRONICS CO., LTD.
        Systemmodell    RV420/RV520/RV720/E3530/S3530/E3420/E3520
        Systemtyp    x64-basierter PC
        Prozessor    Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz, 2301 MHz, 2 Kern(e), 4 logische(r) Prozessor(en)
        BIOS-Version/-Datum    Phoenix Technologies Ltd. 03PQ, 08.07.2011
        SMBIOS-Version    2.6
        Windows-Verzeichnis    C:\Windows
        Systemverzeichnis    C:\Windows\system32
        Startgerät    \Device\HarddiskVolume1
        Gebietsschema    Österreich
        Hardwareabstraktionsebene    Version = "6.1.7601.17514"
        Benutzername    Daniela-Laptop\Daniela
        Zeitzone    Mitteleuropäische Zeit
        Installierter physikalischer Speicher (RAM)    6,00 GB
        Gesamter realer Speicher    5,98 GB
        Verfügbarer realer Speicher    4,28 GB
        Gesamter virtueller Speicher    12,0 GB
        Verfügbarer virtueller Speicher    10,3 GB
        Größe der Auslagerungsdatei    5,98 GB
        Auslagerungsdatei    C:\pagefile.sys


    4. Tests on Ivy Dridge system:

    >> 32-bit <<

        ..\SqrtTestApp\Release>SqrtTestApp32.exe
        32-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 62 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 109 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 343 ticks
                Test  2: 359 ticks
                Test  3: 343 ticks
                Test  4: 359 ticks
                Average: 351 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 62 ticks
                Test  2: 47 ticks
                Test  3: 47 ticks
                Test  4: 62 ticks
                Average: 54 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 62 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 50 ticks

        Tests completed

    >> 64-bit <<

        ..\SqrtTestApp\x64\Release>SqrtTestApp64.exe
        64-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 109 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 359 ticks
                Test  2: 343 ticks
                Test  3: 359 ticks
                Test  4: 343 ticks
                Average: 351 ticks
        
        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 62 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 50 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 47 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 47 ticks

        Tests completed

    3. Ivy Bridge system:

        OS Name                            Microsoft Windows 7 Professional
        Version                            6.1.7601 Service Pack 1 Build 7601
        Other OS Description             Not Available
        OS Manufacturer                    Microsoft Corporation
        System Name                        DELLPM
        System Manufacturer                Dell Inc.
        System Model                    Precision M4700
        System Type                        x64-based PC
        Processor                        Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
        BIOS Version/Date                Dell Inc. A05, 08/10/2012
        SMBIOS Version                    2.7
        Windows Directory                C:\Windows
        System Directory                C:\Windows\System32
        Boot Device                        \Device\HarddiskVolume2
        Locale                            Canada
        Hardware Abstraction Layer        Version = "6.1.7601.17514"
        User Name                        DellPM\Admin
        Time Zone                        Mountain Standard Time
        Installed Physical Memory (RAM)    16.0 GB
        Total Physical Memory            15.9 GB
        Available Physical Memory        14.3 GB
        Total Virtual Memory            47.9 GB
        Available Virtual Memory        46.3 GB
        Page File Space                    32.0 GB
        Page File                        C:\pagefile.sys

    2. When int iNumberOfIterations = 268435456 ( 2^28 ) there is Microsoft C++
       exception: std::length_error ( Vector is too long ) and application crashes

       Fixed. A different way of processing is used now.

    1. Renamed aligned_alloc.h to AlignedAlloc.h

///////////////////////////////////////////////////////////////////////////////

0 Kudos
SergeyKostrov
Valued Contributor II
1,541 Views
Here is a short overview of the test implemented by Christian and Sergey in order to test performance of calculation of Square Roots: - Different sqrts are tested on two systems: Sandy Bridge and Ivy Bridge - Sandy Bridge configuration: Processor: Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz, 2301 MHz, 2 Core(s), 4 Logical Processor(s) OS Name: Microsoft Windows 7 Home Premium ( 64-bit ) Physical Memory (RAM): 6.00 GB - Ivy Bridge configuration: Processor: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s) OS Name: Microsoft Windows 7 Professional ( 64-bit ) Physical Memory (RAM): 16.00 GB - Both systems support AVX instruction set - On both systems the same executable was executed ( compiled on Ivy Bridge ) in order to make results as consistent as possible ( Thanks, Christian for that idea! ) - There are 6 tests in total and here are consolidated results for a quick comparison: Sandy Bridge vs. Ivy Bridge - 32-bit configuration 1. CRT Sqrt - float - Done in 47 ticks vs. 62 ticks 2. SSE Sqrt - float - Done in 172 ticks vs. 109 ticks 3. AVX Sqrt - float - Done in 31 ticks vs. 31 ticks 4. STL vector: STL sqrt - float - Average: 335 ticks vs. 351 ticks 5. STL vector: SSE sqrt - float - Average: 89 ticks vs. 54 ticks 6. STL vector: AVX sqrt - float - Average: 89 ticks vs. 50 ticks Sandy Bridge vs. Ivy Bridge - 64-bit configuration 1. CRT Sqrt - float - Done in 47 ticks vs. 47 ticks 2. SSE Sqrt - float - Done in 187 ticks vs. 109 ticks 3. AVX Sqrt - float - Done in 16 ticks vs. 31 ticks 4. STL vector: STL sqrt - float - Average: 339 ticks vs. 351 ticks 5. STL vector: SSE sqrt - float - Average: 81 ticks vs. 50 ticks 6. STL vector: AVX sqrt - float - Average: 89 ticks vs. 47 ticks - Tests 1, 2, 3 demonstrate what a developer could expect when only a couple of sqrts need to be calculated ( 1 value, or 4 values, or 8 values ) - Tests 4, 5, 6 demonstrate what a developer could expect when sqrts of a large vector need to be calculated - Attached is a zip-file with source codes ( Visual Studio 2008 Professional Edition and Intel C++ compiler XE 13.0.0.089 is set )
0 Kudos
SergeyKostrov
Valued Contributor II
1,511 Views
I'd like to clarify a couple things: - Tests 1, 2 and 3 executed 33554432 ( 2^25 ) times - STL vector size: 67108864 float elements ( 64MB of single-precision floating point values ) - Win32 API function GetTickCount used to measure time intervals - 1 sec = 1000 ticks Please see the codes for more details.
0 Kudos
Christian_M_2
Beginner
1,511 Views

Sergey, thanks for the describing all the important details!

0 Kudos
Reply