Hello, for SIMD machines (like CPUs) BitonicSort is typically used. Try one from Intel OCL SDK, there is a dedicated sample there.
Also from the general point, the specific sorting algo to select should match the task, e.g. BubbleSort is fastest approachif the input sequence is almost sorted, demostrating O(N). While the same BubbleSort is the sloweset with O(N*N), when the input sequence is simply inversed.
From this side the Bitonic exhibits the same garaunteed complexity of n*log(n) and fits most tasks.