The code worked correctly. At first, I ran the program with one thread on a dual-core machine (tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads (init.initialize(2)).
What did additional 61 seconds come from? I tested the time for just lu performance for a specific block, that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit synchronization at the end of parallel_for.
Could anybody tell me the reason and how to improve the performance?
I modified the code as per your suggestion. Unfortunately, no improvement. Lots of malloc are called inside mat.lu(). I guess that could be the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.
You may try to use following brute-force approach. Run single-threaded version under profiler. Run multi-threaded version under profiler. Compare profiles. Identify what parts of the code execute longer in multithreaded version. For example: single-threaded version: func1() - 40% func2() - 30% func3() - 30%