you are right, implementation can vary for functions and platforms. There is comment from our expert:
For these particular functions calculations are performed in 32s accumulator (32sc_16sc) and 32f accumulator (32fc_16sc). BUT: these functions (with direct suffix) are not optimized it is better to use (from the performance point of view) FIR functions with the state structure. For them precision is the same, but they are significantly faster.