You increment the counter twice for each invocation, but reduce it only by one; no wonder that it increases. If you want to see how much real parallelism is there, you should use another atomic counter for that.
For the second question, I would use debugger to see where the threads are stuck. The things you need to remember: the function process() should be a) thread safe, i.e. allow being called concurrently from multiple threads, and b) tolerable to be called multiple times even after returning 0. The last requirement is because your input filter is parallel, so it's impossible to prevent calls to the filter even after it returned NULL. And the number of such calls cannot be predicted, so the function should not expect it willalways be 4, or 3, or whatever else.