I'm not an expert in the Linux kernel, but in my opinion you'd need to write your own driver for accurate measuring thread latencies.
As for the time spent in each thread, any tool based on instrumentation might help.
>can CPI reflect the program effeciency at this situation?(same code bu different thread scheduler policy)
If you are measuring efficiency of the real project, VTune sampling might help, but you will need to investigate to understand why CPI appeared better or worse depending on scheduler mode.
If you are measuring with small benchmarks (preferred way), usually it's better to count clock ticks within the program using appropriate API calls.