Will Intel Intrinsics really help here?

Michael_L_ — Wed, 30 Apr 2014 20:57:25 GMT

I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.

There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?

Thanks for any advice!

        /* for each chunk of rules, i.e. 64 at a time */
        unsigned int end = conf->_num_chunks * 2;
        for (j = 0; j < end; ++j,++j) {
                long int rule_match = 0xFFFFFFFFFFFFFFFF;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        rule_match &= *((long int*)(conf->_match_table[ packet ] + j));
                        if (!rule_match)
                                goto next; /* don't need to proceed, no match */
                }
                return ffsl(rule_match);
        next: ;
        }

128 bit intrinsic version below:

        /* for each chunk of rules, i.e. 128 at a time */
        for (j = 0; j < conf->_num_chunks2; ++j) {
                /* initial 128 bit wide value */
                __m128i rule_match_128 = max;
                unsigned short jump = j * 4;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        uint8_t seg = packet;
                        /* copy 128 bit index into comparison */
                        __m128i *match_128 = (__m128i*)(conf->_match_table[seg] + jump);
                        /* perform &= on 128 bit wide comparison */
                        rule_match_128 = _mm_and_si128(rule_match_128, *match_128);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535)
        			goto next;
                }

        	/* Only returning first match for now */
                for (i = 0; i < 128; ++i) {
                        __m128i cmp = _mm_and_si128(rule_match_128, lut);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) {
				return i;
                        }
		}
        next: ;
        }

Bernard — Thu, 01 May 2014 17:51:45 GMT

Can you perform VTune analysis of both test cases and post the screenshots?

topic in Software Tuning, Performance Optimization & Platform Monitoring

Will Intel Intrinsics really help here?