I would guess that at the second stage day 3 was processed before day 2. But the third stage, which is ordered, can't start "reducing" day 3 before day 2. I think you could easily check this guess extending the information collected at debug points with some data-specific info (e.g. the number of the day being processed).
As far as I understand, hyperthreads aren't quite "even"; one of the threads only has chances to execute when some processor units aren't used by the other one. So I do not wonder if in your case the main thread started to process day 1, the TBB worker thread took day 2 but made slow progress (due to HT), then the main thread completed day 1, took day 3, and having kind of priority on the processor resources completed day 3 before the worker thread finished day 2.
If that's the case, I wonder if adding a pause/yield point right before taking a new token from the pipeline would help the second thread take priority on processor resources and complete its job earlier.
I believe that adding yield or pauseoperations to the main thread won't help. OS does not discern logical CPUs in HT systems (at least it was so some time ago). Therefore when the main thread relinquishes its time quantum, the system will see that another thread is already working, and so it will resume the main thread. During all this time the processing will be happening in the same (main) pipeline of the CPU and so the second thread will remain in the secondary (low priority) CPU pipeline.
I think the problem could be solved by increasing the maximal number of tokens in flight. E.g. if the hyper-thread works at 15% of the main one speed, than 7 or 8 tokens will assure the acceptable balance. You couldplay with number of tokens in the range 6-15 and find the value resulting in the maximal throughput.