Since that was written, the scheduler has been changed so that task depth is no longer a consideration. A full answer probably needs to consider NUMA issues, and should be based on more than just intuition, but not favouring breadth-first scheduling increases the frequency of task stealing, which is rumoured to be very costly.
You do not need full-fledged threads for that. Light-weight fibers will do. Each worker thread need some limited amount of worker fibers to switch between them on blocking. On Windows fiber switch is very lightweight, it just saves current registers in current fiber context and restores registers from new fiber context, on par with ~20 cycles. Unfortunately, on Linux fiber switch is considerably heavier weight, because it needs to do a syscall in order to switch the signal mask.