"chunk size was 12 despite how I change the task_scheduler_init. I am using TBB version 3. Does it have a bug?"
Now you make me doubt (see below)...
"Well.. If I know exactly what would happen then it is easier me to develop my application."
Since you would have extremely limited parallel overhead related to chunk sze at this level, simple_partitioner and grainsize will reliiably allow you to directly control chunk size: the input range will be split recursively while subranges are at least of size grainsize, so "grainsize" is a bit misleading in my opinion, as it will be the upper limit for chunk size, with the lower limit about grainsize/2 for large-enough input range size.
"I read some papers sumbitted to ieee by intel tbb team. It says about P x
V multiplication where P is number of cores and V is a constant set to
4. I am not sure if it still valid for TBB v 3."
Hmm, 96 divided by 4x4=16 is still only 6, so I don't see how you would get to chunk size 12... maybe somebody from the TBB team could shed some clarity on what always was and what might have changed? Somehow I was thinking of an initial factor 16, but I'm not sure anymore, and since the new values don't make sense either it's probably easier to just ask (sorry).