We have fast-string operation (REP MOVSB/STOSB) and non-temporal access (NTA) in modern hardware (CPU). Which one do you prefer for memory copy/fill (without considering other DMA resources in the system)?
There certainly have been CPU models where the built-in string moves didn't qualify as "fast," so Intel compiler developers devoted significant effort to make their compilers choose well. More recent CPUs were designed to overcome performance deficits associated with legacy choices. It's reasonable to hope that clearly written portable source will be optimized adequately until performance profiling shows otherwise.
Intel also devoted effort to fix obvious deficiencies in memmove/memcpy/memset provided by OS so there aren't so many problems there as in the past. When using compilers other than Intel's, you may need to call such functions explicitly if you wish to engage automatic run-time selection of streaming/nontemporal store.
If the strings are indeed byte strings at arbitrary byte offsets in both source and destinaton, and if the strings are relatively short, rough guess of less than 256 bytes, then the rep movsb/stosb (questionably) may be a good choice. You will have to run some tests. And because the tests are to be used for you to make a decision, be sure that your tests are set up to provide representative results for the situations you encounter (IOW not a contrived prove your point test).
FWIW I do agree that some optimization effort should be made to favor rep movsb/stosb over using "gobs" of registers (and avoid save/restore or discarding values).