I'm unclear how your pipeline is parallelized. You suggest that all the attribute calculations could be run in parallel, but the note above doesn't make clear what is the pipeline structure, beyond having a serial input and a serial output filter.
Are each of the attribute calculations substantial? Do they vary in the amount of work (time) they perform? In my experiments, the more work there is relative to the read and write times, the greater parallelism should be exposed.
Have you tried running something like Intel Thread Profiler to visualize what is really happening?