Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Will Intel Intrinsics really help here?

Michael_L_
Beginner
89 Views

 

I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.

There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?

Thanks for any advice!

 

        /* for each chunk of rules, i.e. 64 at a time */
        unsigned int end = conf->_num_chunks * 2;
        for (j = 0; j < end; ++j,++j) {
                long int rule_match = 0xFFFFFFFFFFFFFFFF;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        rule_match &= *((long int*)(conf->_match_table[ packet ] + j));
                        if (!rule_match)
                                goto next; /* don't need to proceed, no match */
                }
                return ffsl(rule_match);
        next: ;
        }

 

128 bit intrinsic version below:

        /* for each chunk of rules, i.e. 128 at a time */
        for (j = 0; j < conf->_num_chunks2; ++j) {
                /* initial 128 bit wide value */
                __m128i rule_match_128 = max;
                unsigned short jump = j * 4;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        uint8_t seg = packet;
                        /* copy 128 bit index into comparison */
                        __m128i *match_128 = (__m128i*)(conf->_match_table[seg] + jump);
                        /* perform &= on 128 bit wide comparison */
                        rule_match_128 = _mm_and_si128(rule_match_128, *match_128);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535)
        			goto next;
                }

        	/* Only returning first match for now */
                for (i = 0; i < 128; ++i) {
                        __m128i cmp = _mm_and_si128(rule_match_128, lut);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) {
				return i;
                        }
		}
        next: ;
        }

 

 

0 Kudos
1 Reply
Bernard
Black Belt
89 Views

 

Can you perform VTune analysis of both test cases and post the screenshots?

Reply