Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Convert bytes to nibbles

DLake1
New Contributor I
901 Views

I'm looking to optimize this loop:

framesize /= 2;
for (auto i = 0; i < framesize; ++i){
	buf0 = static_cast<unsigned char>(buf0[i * 2] >> 4 & 0x0f | buf0[i * 2 + 1] & 0xf0);
}

It reduces an array of bytes to nibbles.

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

0 Kudos
3 Replies
James_C_Intel2
Employee
901 Views

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

The size of the code may not be an issue, what matters is how fast it runs. Without additional information (like the alignment of the buffer), the compiler will have to generate run in code if it vectorizes.

If you are interested in the assembler produced by various compilers, you can easily look at that at Godbolt's compiler Explorere site, for instance https://godbolt.org/g/76zqbK , but really you should be interested in performance, not the instructions generated.

0 Kudos
DLake1
New Contributor I
901 Views

It's significantly slower than the LZ4 compression library I'm using, I can only use up to SSE 4.2 for compatibility with an older CPU in another system.

Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

0 Kudos
levicki
Valued Contributor I
901 Views

CommanderLake wrote:
Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

What he meant by "alignment" is that for the buffer which is not guaranteed to be aligned, and especially for a buffer whose length is not divisible by the length of a vector register without remainder, the compiler needs to generate (sometimes complicated) pre/post-roll code to deal with said remainder to be able to use vector registers and instructions for the loop itself.

If Penryn is your target architecture try using separate write buffer and align both buffers to 32 bytes.

 

0 Kudos
Reply