Concerning the question, why the serial version is faster than the parallel, you might need to have a closer look at the overhead. The first time the parallel section is executed, the OpenMP run-time environment starts the thread pool, which needs a considerable amount of time. For time measurements, it is therefore advisable to havea warm-up call to the parallel routine before the time measurement.
Additionally, you should take into account that parallelization is only beneficial for large data sets. If you sort only 100 integers, you are probably better off using a serial algorithm. If you haven't done so already, you should choose a big data set to time your sort routine. Additionally, you shouldswitch tothe serial variant in your recursion, if the data sets become too small.
You should also be aware that the maximal possible speed-up of a parallel quicksort algorithm is limited by the first partitioning, which is serial. Amdahl's law might bite you, especially if you are running on a machine with a lot of cores.
If you are completely stuck, the Intel Thread Profiler can help you in visualizing the thread behavior.