Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Cache and AVX

GGil
Beginner
760 Views

Hi,

I wrote down a piece of code which searches a file for one or two character matches using AVX. Source code is here: https://github.com/gilshm/avx-search/blob/master/avx-search.c

Three main things I used: _mm256_cmpeq_epi8, _mm256_movemask_epi8 and _tzcnt_u64.

When I flush the page cache and run the program I get great bandwidth of ~1GB/sec from the SSD (Intel's P3700). But on the second run, when data is hot in the cache there is no major performance improvement.

I compare my performance to GNU Grep. When I run it "cold" performance is half than mine (~500MB/sec SSD BW), but when caches are "hot" Grep's performance is better than my application.

I do not understand why. Does fetching data from caches to the AVX registers takes more time than fetching data to the regular 64bit registers? I can't think of anything else, because data is in the cache.

Update #1 - I've noticed that _tzcnt_u64 takes plenty of time, that might be the issue.

Update #2 - Now I've managed to upgrade my code (reduced to total number of instructions), both Grep and my tool works pretty much the same when data is hot, and to be honest I can't explain it, I mean, I expected for a better performance with my AVX implementation both when data is on the disk only and when data is in the page cache. Any advise?

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
760 Views

Have you looked at the binary code generated for the relevant loop in the GNU grep utility?   A lot of people have put a lot of thought into the optimization of these codes, so it may already be very effective.

Another issue: Recent processors are very fast compared to their memory subsystems, so once the code is fast enough to be limited by memory access rather than by instruction execution, then further "improvements" in the instruction count may make no visible difference.    The break-even point will be different for data in L1, L2, L3, memory, SSD, or spinning disk.

0 Kudos
Travis_D_
New Contributor II
760 Views

You should post the assembly of your routine to have any hope of getting specific tips on optimization.

0 Kudos
GGil
Beginner
760 Views

My comparison was not fair. I've used a bigger buffer in my code, so my I/O accesses were less frequent and more efficient. That's why I got better performance when things were cold and the same when things were hot. Grep is indeed well optimized.

Thank you for your replies.

 

0 Kudos
Reply