Quote:CommanderLake wrote

DLake1 · ‎11-07-2017

I'm looking to optimize this loop:

framesize /= 2;
for (auto i = 0; i < framesize; ++i){
	buf0 = static_cast<unsigned char>(buf0[i * 2] >> 4 & 0x0f | buf0[i * 2 + 1] & 0xf0);
}

It reduces an array of bytes to nibbles.

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

James_C_Intel2 · ‎11-07-2017

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

The size of the code may not be an issue, what matters is how fast it runs. Without additional information (like the alignment of the buffer), the compiler will have to generate run in code if it vectorizes.

If you are interested in the assembler produced by various compilers, you can easily look at that at Godbolt's compiler Explorere site, for instance https://godbolt.org/g/76zqbK , but really you should be interested in performance, not the instructions generated.

DLake1 · ‎11-07-2017

It's significantly slower than the LZ4 compression library I'm using, I can only use up to SSE 4.2 for compatibility with an older CPU in another system.

Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

levicki · ‎02-13-2018

CommanderLake wrote:
Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

What he meant by "alignment" is that for the buffer which is not guaranteed to be aligned, and especially for a buffer whose length is not divisible by the length of a vector register without remainder, the compiler needs to generate (sometimes complicated) pre/post-roll code to deal with said remainder to be able to use vector registers and instructions for the loop itself.

If Penryn is your target architecture try using separate write buffer and align both buffers to 32 bytes.

Convert bytes to nibbles