Please use MOVMSKPS instead PMOVMSKB, even though it will not change your current performance results, as both instructions are implemented very similarly in current generation of processors.
In short, you are correct in your measurements for the code provided. PTEST is decoded into 2 uops and has 3 cycle latency when receives sources from FP instruction (like CMPLTPS, in your case), also integer instruction receiving input (flags) from PTEST will also have +1 cycles added to its latency. So sequence: CMPLTPS->PTEST->Jcc is 1+3+1+1=6 cycle latency/4 uops.
MOVMSKPS, MOVMSKPD and PMOVMSKB are 1 uop and 1-cycle latency instructions when they receive sources from FP instruction, plus integer instruction receiving source from them will have +2 cycles added to its latency. So sequence CMPLTPS->MOVMSKPS->CMP/JMP(fused) is 1+1+2+1=5 cycles latency/3 uops.
Results may also be affected by how you are testing. If you just have this tiny piece of code above in the loop, you may get penalized by FrontEnd/decoding bottlenecks, which do not appear normally in the compiled code's execution sequence. So, to be more realistic in measurements, please unroll the loops (have your branches not being taken). And then, of course, latency and throughput kinds of measuring what is "faster" are different with regard to how dependency chains are formed.
Hope it explains,