- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.
There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?
Thanks for any advice!
/* for each chunk of rules, i.e. 64 at a time */
unsigned int end = conf->_num_chunks * 2;
for (j = 0; j < end; ++j,++j) {
long int rule_match = 0xFFFFFFFFFFFFFFFF;
/* For each table */
for (i = 0; i < conf->_num_tables; ++i) {
rule_match &= *((long int*)(conf->_match_table[ packet ] + j));
if (!rule_match)
goto next; /* don't need to proceed, no match */
}
return ffsl(rule_match);
next: ;
}
128 bit intrinsic version below:
/* for each chunk of rules, i.e. 128 at a time */
for (j = 0; j < conf->_num_chunks2; ++j) {
/* initial 128 bit wide value */
__m128i rule_match_128 = max;
unsigned short jump = j * 4;
/* For each table */
for (i = 0; i < conf->_num_tables; ++i) {
uint8_t seg = packet;
/* copy 128 bit index into comparison */
__m128i *match_128 = (__m128i*)(conf->_match_table[seg] + jump);
/* perform &= on 128 bit wide comparison */
rule_match_128 = _mm_and_si128(rule_match_128, *match_128);
if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535)
goto next;
}
/* Only returning first match for now */
for (i = 0; i < 128; ++i) {
__m128i cmp = _mm_and_si128(rule_match_128, lut);
if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) {
return i;
}
}
next: ;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you perform VTune analysis of both test cases and post the screenshots?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page