Community
cancel
Showing results for 
Search instead for 
Did you mean: 
DLake1
New Contributor I
107 Views

Convert bytes to nibbles

I'm looking to optimize this loop:

framesize /= 2;
for (auto i = 0; i < framesize; ++i){
	buf0 = static_cast<unsigned char>(buf0[i * 2] >> 4 & 0x0f | buf0[i * 2 + 1] & 0xf0);
}

It reduces an array of bytes to nibbles.

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

0 Kudos
3 Replies
James_C_Intel2
Employee
107 Views

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

The size of the code may not be an issue, what matters is how fast it runs. Without additional information (like the alignment of the buffer), the compiler will have to generate run in code if it vectorizes.

If you are interested in the assembler produced by various compilers, you can easily look at that at Godbolt's compiler Explorere site, for instance https://godbolt.org/g/76zqbK , but really you should be interested in performance, not the instructions generated.

DLake1
New Contributor I
107 Views

It's significantly slower than the LZ4 compression library I'm using, I can only use up to SSE 4.2 for compatibility with an older CPU in another system.

Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

ILevi1
Valued Contributor I
107 Views

CommanderLake wrote:
Alignment has virtually no impact on performance with modern CPU's, but please prove me wrong if you can.

What he meant by "alignment" is that for the buffer which is not guaranteed to be aligned, and especially for a buffer whose length is not divisible by the length of a vector register without remainder, the compiler needs to generate (sometimes complicated) pre/post-roll code to deal with said remainder to be able to use vector registers and instructions for the loop itself.

If Penryn is your target architecture try using separate write buffer and align both buffers to 32 bytes.

 

Reply