Recently I meet some problems with icl's auto-vectorization. In my application, there are some loops, their body contian float-BYTEs conversion. But there isn't float-BYTE conversion SIMD instruction., there have float-int conversion. But there isn't int-BYTE concertion SIMD instruction too.
In this situation, What I think of is use temp memory to store float-ints conversion result, and then myself transform ints to bytes using PACK instructions.
If there is int-byte or float-byte conversion SIMD, the temp memory is not necessary, and the performance must be better.
I have tests, when come to byte-int stream conversion, the icl will use unpack instruction to unpack bytes, and then setthe unpacked results to the int destination.
I think, as the icl can use unpack to unpack BYTEs, why it cant use PACK to PACK ints? Or create one pragma to tell the compiler to PACK ints?
There is a newer PSHUFB - shuffle bytes - instruction that would let you pack together the low bytes of a float-to-int conversion, but it's SSSE3, which might not be an option for you or the people running your program.
With SSE2, you could use the float to int conversion, then mask out the bottom bytes (AND with 0x000000ff000000ff...), then shift a copy of that right 3 bytes, or it with the original, shift the copy right 3 more bytes, or with the original, then one more time, and then your 4 bytes will be all together in the bottom dword. Would be nine instructions, excluding the loads and store. Not sure if it's worth it, but would save a trip to memory.
There might be a cleverer way to accomplish that in fewer instructions