This is the first time I use vTune, to tune a quite complex bit of C-code. All it does is basically "Calculate x and add it to an unsined char, and clip it to 255", for a lot of pixels. Because of the complex nature of the code, its hardly possible to optimize it :-/
vTune tells me a lot of time is used for modifying the data itself:
READ/WRITE are bacially pointer-access wrapper macros, clip255 is simple clipping method.
Any ideas why so many cycles are spent here?
Furthermore, is this really the assembler generated for the C code, or does vTune mix things up? I am only able to read assembler a bit, but clip255 should generate at least some kind of conditional operation like cmov or a compare+jump, but I don't see something like this in the code.