- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm looking to optimize this loop:
framesize /= 2; for (auto i = 0; i < framesize; ++i){ buf0 = static_cast<unsigned char>(buf0[i * 2] >> 4 & 0x0f | buf0[i * 2 + 1] & 0xf0); }
It reduces an array of bytes to nibbles.
The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?
The size of the code may not be an issue, what matters is how fast it runs. Without additional information (like the alignment of the buffer), the compiler will have to generate run in code if it vectorizes.
If you are interested in the assembler produced by various compilers, you can easily look at that at Godbolt's compiler Explorere site, for instance https://godbolt.org/g/76zqbK , but really you should be interested in performance, not the instructions generated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's significantly slower than the LZ4 compression library I'm using, I can only use up to SSE 4.2 for compatibility with an older CPU in another system.
Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CommanderLake wrote:
Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.
What he meant by "alignment" is that for the buffer which is not guaranteed to be aligned, and especially for a buffer whose length is not divisible by the length of a vector register without remainder, the compiler needs to generate (sometimes complicated) pre/post-roll code to deal with said remainder to be able to use vector registers and instructions for the loop itself.
If Penryn is your target architecture try using separate write buffer and align both buffers to 32 bytes.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page