I wonder whether I can have my own implementation of memset/memcpy to beat the build-in version. I am using Intel compiler, linux platform. I am thinking using SSE, but I am not sure whether Intel compiler already apply it. Also, I am linking with TC-malloc library.
you might want to take a look at the implemenationsin latest glibc(2.13). look under sysdeps/x64_64/multiarch. Your mileage will vary depending on the metrics you choose and the test data sets youmeasure with.
You could use nm to determine which references to memset and memcpy have been replaced by the __intel_fast_ versions from the icc library. There should be no built-in version with icc, unless you mean those __intel_fast_ versions. As the other response indicated, current glibc versions should be good for most purposes as well. I can't see what your choice of malloc would imply; maybe you mean which functions does your non-standard malloc use. Again, nm should be a useful tool. Apparently, you're not asking about AVX optimizations; those don't have great importance on the Sandy Bridge implementation, since the hardware splits 256-bit moves into 128-bit pieces. The main issue for big aligned memset/memmove strings is the cutover point to nontemporal, which would be application dependent.