Paper "Inside Intel Next Generation Nehalem Architecture" by Ronak Singhal (SP08_NGMS001_100r_eng.pdf) contains comparison of strlen uses PCMPSTRx instruction and ordinal x86-code. SSE4.2 code looks very nice, but what is approximate speedup?
And why scalar x86 code was used? With SSE2 instructions strlen could also be coded; here is my implementation: http://wmula.republika.pl/proj/sse2string/src/strlen.S. I'm wondering how faster SSE4.2 code is.
BTW what is latency/throughput of PCMPSTRx instructions? Does latency depend on input data or is constant? I didn't find answers in recent manuals.
For software developers who might be interested in attending Fall IDF (8/19-8/21). There will be sessions on Intel AVX on Wed. (8/20). On Thursday afternoon, there is an in-depthsession on SSE4.2. Additionally, SSE4.2 will be demo'ed in the advanced technology zone on all three days.