- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I wrote down a piece of code which searches a file for one or two character matches using AVX. Source code is here: https://github.com/gilshm/avx-search/blob/master/avx-search.c
Three main things I used: _mm256_cmpeq_epi8, _mm256_movemask_epi8 and _tzcnt_u64.
When I flush the page cache and run the program I get great bandwidth of ~1GB/sec from the SSD (Intel's P3700). But on the second run, when data is hot in the cache there is no major performance improvement.
I compare my performance to GNU Grep. When I run it "cold" performance is half than mine (~500MB/sec SSD BW), but when caches are "hot" Grep's performance is better than my application.
I do not understand why. Does fetching data from caches to the AVX registers takes more time than fetching data to the regular 64bit registers? I can't think of anything else, because data is in the cache.
Update #1 - I've noticed that _tzcnt_u64 takes plenty of time, that might be the issue.
Update #2 - Now I've managed to upgrade my code (reduced to total number of instructions), both Grep and my tool works pretty much the same when data is hot, and to be honest I can't explain it, I mean, I expected for a better performance with my AVX implementation both when data is on the disk only and when data is in the page cache. Any advise?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you looked at the binary code generated for the relevant loop in the GNU grep utility? A lot of people have put a lot of thought into the optimization of these codes, so it may already be very effective.
Another issue: Recent processors are very fast compared to their memory subsystems, so once the code is fast enough to be limited by memory access rather than by instruction execution, then further "improvements" in the instruction count may make no visible difference. The break-even point will be different for data in L1, L2, L3, memory, SSD, or spinning disk.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should post the assembly of your routine to have any hope of getting specific tips on optimization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My comparison was not fair. I've used a bigger buffer in my code, so my I/O accesses were less frequent and more efficient. That's why I got better performance when things were cold and the same when things were hot. Grep is indeed well optimized.
Thank you for your replies.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page