Given a large number of tasks of several types. Tasks of the same type share the same data. Therefore, it makes sense if each physical thread runs tasks of one type until all tasks of that type are completed. Then the thread switches to a different task type and runs only those types until they are exhausted. And so on.
Are all tasks known ahead of time? If all remaining tasks are of the same type, should two different threads be working on them?
If tasks are known ahead of time and each type should be handled by a different thread, put the tasks in bins and apply a parallel_for with an iteration per bin.
If it's okay for more than one thread to be working on tasks of the same type, I'd put the tasks in one-dimensional array, sort the array by type, and then loop over it with a parallel_for. The parallel_for tends to spread the threads out over the index space. That would cause threads to generally work on tasks of different type, but not restrictively.
If new tasks are arriving on the fly, the problem is harder. Is that your case?
New tasks are arriving on the fly, but there is a large pool of waiting tasks. A thread shouldn't change its task type unless the pool doesn't have any tasks of the same type. The idea is to maximize the reuse cash data.
Is the number of types known in advance? If so, I'd be inclined to have a concurrent_queue+spin_mutex per type. A thread looking for work could:
Randomly pick a queue
Try to acquire a lock on it using try_acquire.
If not successful, go back to step 1.
If successful, munch on the queue until it is empty
Release the lock
New items would be appended to the appropriate queue without any locking.
TBB's concurrent_queue is overkill for this, because the lock restricts the queue to the single-consumer multiple-producer case. There are efficient ways to implement such a queue with little or no locking. E.g., inside the TBB scheduler, class mail_outbox implements such a queue with only a tiny bit of blocking. (It uses a little locking to avoid some headaches that full lock free would introduce.)
The single-consumer multiple-producer restriction enables the use of some structures that would otherwise suffer the ABA problem. For example, the stack cited in that article actually works when restricted to a single consumer.
A variant on the stack that might be more efficient in your case is to replace the queue with a singly-linked list like the stack example, but have the consumer swipe the entire list instead of popping an individual element.That reduces the number of compare-and-swap operations that the consumer has to execute. The TBB scheduler uses this technique to return blocks of memory to the thread that allocated them (look for "return_list").