- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.
There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?
Thanks for any advice!
/* for each chunk of rules, i.e. 64 at a time */ unsigned int end = conf->_num_chunks * 2; for (j = 0; j < end; ++j,++j) { long int rule_match = 0xFFFFFFFFFFFFFFFF; /* For each table */ for (i = 0; i < conf->_num_tables; ++i) { rule_match &= *((long int*)(conf->_match_table[ packet ] + j)); if (!rule_match) goto next; /* don't need to proceed, no match */ } return ffsl(rule_match); next: ; }
128 bit intrinsic version below:
/* for each chunk of rules, i.e. 128 at a time */ for (j = 0; j < conf->_num_chunks2; ++j) { /* initial 128 bit wide value */ __m128i rule_match_128 = max; unsigned short jump = j * 4; /* For each table */ for (i = 0; i < conf->_num_tables; ++i) { uint8_t seg = packet; /* copy 128 bit index into comparison */ __m128i *match_128 = (__m128i*)(conf->_match_table[seg] + jump); /* perform &= on 128 bit wide comparison */ rule_match_128 = _mm_and_si128(rule_match_128, *match_128); if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535) goto next; } /* Only returning first match for now */ for (i = 0; i < 128; ++i) { __m128i cmp = _mm_and_si128(rule_match_128, lut); if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) { return i; } } next: ; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you perform VTune analysis of both test cases and post the screenshots?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page