NumberY and NumberX are around 8000. When I run in single thread, it runs 2 times faster than running in multithreading using TBB. I have tried to adjust grain size, or init tbb first. none of them helps.
This is a function apply to a matix, MyFunction is an interpolate function. I think it doesn't speed up as the load is too small. But this is as much as I can divide the work and I still would like something running faster than the single thread code.